[jcifs] Re: jcifs-1.2.4 stability fixes

Sat Oct 1 02:32:46 GMT 2005

On Fri, 30 Sep 2005 10:40:58 +0200
Lars heete <hel at admin.de> wrote:

> > >  jcifs-threading.diff -
> > >      the biggest one ... there is a big locking problem between
> > >      treeConnect/sessionSetup/sessionExpiration  and transport closedown
> > > via disconnect. Both take the transport-monitor, and if the transport
> > > socket-connection closes with sessions at a high frequency, wrong thread
> > > may wait on transport.response_map with transport monitor hold, so
> > > transport.disconnect will not run, and the transport is unusable for a
> > > long time (at least REQUEST_TIMEOUT)
> > >      The changes try adress this problem by signalling not only responses
> > > but also transport state changes via transport.response_map, and setting
> > > transport.state CLOSING to indicate transport closedown before
> > > disconnect, so any waiting thread can leave the transport monitor.
> > >
> > > the changes are well tested (changing the applications surviving time
> > > from 15 min to at least 2 Weeks)
> >
> > Lars,
> >
> > So you're saying Transport.disconnect() is blocked because the transport
> > is locked by a thread waiting on Transport.response_map. Is the server
> > closing the socket? And that causes disconnect() to be called. But
> > because there are outstanding requests the disconnect state is basically
> > ignored. I don't understand the part about the "wrong thread".
> >
> > Can you explain this in more detail. Where is the deadlock exactly?
>
> its actually no deadlock but a livelock. If one thread opens a new session, 
> sessionSetup locks the transport. If at this time for example the transport 
> times out it is possible that the response map is still empty, so transport 
> trys to enter disconnect. Because the transport lock is still hold by the 
> session thread, the transport thread has to wait until the session threads 
> wait on the response map times out (there will be no notify).  By the time, 
> there may be many other session setups pending that wait for the transport 
> monitor. Now it is very unlikely that the transport thread gets the monitor 
> and with high session load we have a pretty stable livelock.  

I see. So if something happends that causes the transport thread to call
disconnect() it may wait a long time to get the lock if many many other
threads are also trying to get it in sessionSetup.

I'm not going to fix this. I admit it's a wart but it's a pretty unusual
case for a variety of reasons and I cannot risk making significant
changes to the code. I tried for while to think of an easy way to dodge
this problem but I can't think of anything solid.

Your fix is not quite right either. Aside from some extrainious changes
the crux of your fix is to change the transport state without holding
the lock:

    239 +    public boolean disconnect( boolean hard ) {
    240 +        if (state == STATE_OPEN)
    241 +            changeState(STATE_CLOSING);
    242 +
    243 +        synchronized(this) {

This opens up all sorts of possible synchronization flaws. If you're
mess with the state without the lock you might as well just flip the
state to error just before disconnect() in Transport.loop() and then
bail out in sendrecv() if the transport is in error.

Or you want to figure out is why so many threads are trying to communicate
with the same unresponsive server. If you're doing network crawling
there are MUCH better algorithms you could use here. The server can only
handler 2 or 3 busy threads efficiently.

> The current SmbTransport actually has a deadlock in the paths 
> doSend->disconnect (first rmap lock then transport lock) and  
> sessionSetup->send (first transport lock then rmap lock).
> This is fixed doing the error handling only in the transport thread.

I don't know about this. I think logic actually prevents this from
ever happening because doSend() cannot be called without first doing
a sessionSetup(). So you either do sessionSetup() and then send() but a
subsequent sessionSetup() will just return because the session is already
"setup".

I agree it looks dangerous as hell and should probably be changed but
at the moment I'd like to leave it alone because technically I do not
think it can actually deadlock. If you can get it to happen, please send
me a thread dump.

I did find a different deadlock in the code however. It was in the
getDfsReferral code and was rather easy to fix.

Mike