[jcifs] Re: jcifs-1.2.4 stability fixes

Tue Oct 4 21:35:10 GMT 2005

On Mon, 3 Oct 2005 11:48:03 +0200
Lars heete <hel at admin.de> wrote:

> >     239 +    public boolean disconnect( boolean hard ) {
> >     240 +        if (state == STATE_OPEN)
> >     241 +            changeState(STATE_CLOSING);
> >     242 +
> >     243 +        synchronized(this) {
> >
> > This opens up all sorts of possible synchronization flaws. If you're
> > mess with the state without the lock you might as well just flip the
> > state to error just before disconnect() in Transport.loop() and then
> > bail out in sendrecv() if the transport is in error.
> In my changed transport version actually the state changes are protected by 
> the response_map mutex (look at changeState), which gets released when client 
> threads are waiting for responses.

But between the time you inspect the state and call changeState a context switch could occur resulting on changeState(STATE_CLOSING) being called multiple times.

> The problem actually is that disconnect 
> may block the transport thread. For the hard error case it should be ok using 
> only the respose_map lock for state changes. The soft disconnect (expiration) 
> case  ideally should be moved to a different "cleanup" thread, where blocking 
> isn't that fatal. 

Meaning while the transport thread is busy calling disconnect it cannot process any responses. In this case yes a separate thread should be used if the transport thread is to call disconnect(). However I'm going to avoid this for now because it so happends that we do not wait for TreeDisconnect or LogoffAndX responses so it doesn't really help us.

> > > The current SmbTransport actually has a deadlock in the paths
> > > doSend->disconnect (first rmap lock then transport lock) and
> > > sessionSetup->send (first transport lock then rmap lock).
> > > This is fixed doing the error handling only in the transport thread.
> >
> > I don't know about this. I think logic actually prevents this from
> > ever happening because doSend() cannot be called without first doing
> > a sessionSetup(). So you either do sessionSetup() and then send() but a
> > subsequent sessionSetup() will just return because the session is already
> > "setup".
> there may be actually multiple sessions on one transport.

True. I didn't think about that.

Ok, I've made some significant changes. The technique of locking
the transport while setting up/logging off/mounting/unmounting is too
coarse. It was very simple to implement but because Transport.disconnect()
and other related routines can call back into the upper layers while
holding the transport lock it is too deadlock prone. This is really the
source of all issues you've found.

So I've change locking throughout in a fundamental
way. SmbSession and SmbTree now have explicit state machine
driven locking for treeConnect/disconnect, sessionSetup/logoff like
Transport.connect/disconnect do. But all of these methods now *unlock*
the lock before calling the body of their routine [1]. Meaning all of
these methods now are roughly as follows:

    void foo() {
        synchronized( transport ) {
            look at state and allow only one thread to pass
            all other threads either return or wait for the state to change
        }

        do real work here without holding the lock

        synchronized( transport ) {
            update state with success or failure and notifyAll waiters
        }
    }

Now, beware, this new model is much looser. There is much more potential
for concurrency on each transport. Before, we knew that because the
transport was locked the whole time during sessionSetup/logoff etc that
the number of possible code paths was very limited. This is not true
of the new model and therefore it is more complicated.

I've tested this as best I can in my limited environment. Can you try
it before I unleash this on everyone else? Your webcrawler setup seems
to be good at stressing the code.

The code is in the download area labeled jcifs-1.2.6test.

Mike

[1] There is one exception that I know of so far that I'm not sure how
to fix. That is in SmbTransport.getSmbSession() which is synchronized
but SmbSession.logoff() is called inside a loop. The loop needs to be
protected by the lock so unlocking within the loop isn't possible. To
fix this it would be necessary to remove the sessions from the list,
unlock, and then logoff the sessions.