[jcifs] Re: jCIFS deadlock issue - when can we expect a fix?

Mon Feb 20 19:20:04 GMT 2006

On Mon, 20 Feb 2006 18:56:47 +0100
Ronny Schuetz <usenet-01 at groombridge34.de> wrote:

> The current solution is basically an additional synchronization in
> SmbSession#send() and Transport#disconnect() on an additional object per
> Transport instance (_oSetupDisconnectMutex):

That might work. If I remember the problem correctly the problem is that
the transport thread is grabbing the transport lock and then tries to
get another lock (I think it's SmbTransport.response_map). But before
it can do so, the calling thread gets the gets the response_map lock
and then tries to get the transport lock. So now you have:

Thread-T has Lock-T trying to get Lock-M
Thread-C has Lock-M trying to get Lock-T

So what you're doing is introducing a third lock Lock-X. So Thread-T gets
Lock-X and then Lock-T. If a context switch occurs there Thread-C will try
and fail to get Lock-X allowing Thread-T to get Lock-M and finish. Then
Thread-C can get Lock-X, Lock-M, and Lock-T, do it's thing and complete.

It could kill concurrency and it's a little impure becuase you're
basically using a lock to protect locks but I think it will work. If it
doesn't, post a thread dump.

> Another point that I just noticed today: During the tests I tried to
> avoid the deadlock by just setting the ssnLimit to 1, as this should
> create a single Transport per Session. The test is running 150
> concurrent threads (to quickly reproduce the issue) accessing a single
> Windows share; each thread accesses a single file and executes a loop
> that constantly calls SmbFile#getFreeSpace(), deletes the file if it is
> available, recreates it (32k), checks if it exists, reads it again, and
> deletes it again. The operations are chosen on the operations executed
> when it deadlocked before. However, the reason why I write that is that
> I saw that the library opened a lot of sockets and did not even reuse
> them - looked like they were all timing out. It ended up in more than
> 2500 open sockets. Could it be that the library does not reuse the
> connections in this special case while it does when ssnLimit is for
> example unset?

This is a combination of two things. One, JCIFS doesn't explicitly close
sockets. They are closed after the soTimeout has expired. So if you set
ssnLimit to 1 you will create a socket for each session and you will
rapidly build up sockets. Generally you really want to avoid setting
ssnLimit to 1 as it really destroys scalability. Two, the last TCP socket
state is CLOSE_WAIT. You can see socket states using netstat -ta. If you
see CLOSE_WAIT that means the sockets were closed but the kernel keeps the
data structure around for a while for reasons I don't fully understand
(supposedly it's to wait for the final ACK but I don't understand why
the kernel would care or why it would dispose of the socket after it got
it). What you want to do is to put a long sleep at the end of the program
and run netstat -ta repeated to see if the sockets finally go away. If
they're still there after 30 minutes, then there might be a problem.

Mike