[jcifs] Deadlock while cleaning up old sessions in transport

Tue Aug 14 17:35:19 GMT 2007

Hi Mike,

glad you are still not fed up with this thread.

I think I will have to explain how I am looking at the problem. 

The deadlock I saw occured while trying to free up old sessions in
SmbTransport.getSmbSession(NtlmPasswordAuthentication). [The part
triggered when 
  "if (SO_TIMEOUT > 0 && sessionExpiration < (now =
System.currentTimeMillis()))" is the part I would put into its own
thread and waiting for a signal, which could be "times up", or "no
sessions left for a new auth". This thread would live as an instance
variable of the SmbTransport instance. (I have included a sample with
Condition/ReentrantLock at the end)]
>From there (ssn.logoff()) I descended to
SmbTree.treeConnect(ServerMessageBlock, ServerMessageBlock). 
On the way down there it always had a reference to an instance of
SmbTransport and it was locking the two Objects transport and
setupDiscolock in that order, though cyclic.
>From the method SmbTree.treeConnect I then went up the callstack of
another thread to say for example SmbFile.copyTo(SmbFile). If I would
follow that track back down to treeConnect, I would pass the
synchronization locks in the order setupDiscolock, transport. So there
was one/the culprit for the experienced deadlock.

The way I am seeing it, is that the transport lock is used, so that no
other thread can interfere with the multiplexing done on the socket and
the setup logik before and after sending messages over the wire. But
since each SmbTransport instance has its own socket and variables, we
have to guard that instance from seperate thread access only, so there
is only one lock needed (per SmbTransport instance).

I don't think it is wise to unlock that (smb)transport instance right
before a write to a socket. I wonder what would happen when two threads
were sending data on the socket the same time. And beside, what would
you like to do with that transport instance? You certainly can't use it
for sending data to the server?

Given my thoughts as outlined above, the only reason for giving up a
lock, would be knowing, that I can't use the underlying socket for the
next moments.

But maybe I am completely wrong and only stealing your time.

Bye Felix

// Example for using ReentrantLock as I would think it would make sense
class SmbTransport {
  Lock transportLock = new ReentrantLock();
  Condition cleanUpCondition = transportLock.newCondition();
  volatile boolean cleanUpSessions = false;
  Thread cleanUpThread;
...
//in constructor
     cleanUpThread = new Thread(new CleanUpSessions());
     cleanUpThread.start();
...
  class CleanUpSessions implements Runnable {
    public void run() {
        transportLock.lock();
        try {
           for(;;) {
              // while we have nothing to do, we can wait and give up
our lock
              while (!cleanUpSessions) 
                 tidyUp.await();
              sessionExpiration = now + SO_TIMEOUT;
              iter = sessions.listIterator();
              while( iter.hasNext() ) {
                ssn = (SmbSession)iter.next();
                if( ssn.expiration < now ) {
                    ssn.logoff( false );
                }
              }
              cleanUpSessions = false;
           }          
        } finally {
           transportLock.unlock();
        }
    }
  }
...
}

Am Dienstag, den 14.08.2007, 12:20 -0400 schrieb Michael B Allen:
> On Tue, 14 Aug 2007 14:30:55 +0200 (CEST)
> "Felix Schumacher" <felix.schumacher at internetallee.de> wrote:
> 
> > Am Mo, 13.08.2007, 21:07, schrieb Michael B Allen:
> > > On Mon, 13 Aug 2007 17:17:52 +0200 (CEST)
> > > "Felix Schumacher" <felix.schumacher at internetallee.de> wrote:
> > >
> > >> > Like I said the other day, the only way to fix this problem is to
> > >> create
> > >> > a custom lock class and replace all of the synchronized calls so that
> > >> one
> > >> > lock can be used and the lock can be *unlocked* deep in hte
> > >> call-stack.
> > >> So given an implementation of concurrent.ReentrantLock (with which one
> > >> could start to have a known working implementation), how would you
> > >> decide
> > >> when to unlock the transport and where?
> > >
> > > Hi Felix,
> > >
> > > If you have a lock class that can be unlocked then the solution is simple
> > > because you can have one lock for the entire VM. If there's only one
> > > lock then a deadlock is theoretically impossible.
> > If you only want to have ONE lock for the entire VM, it should be enough
> > to synchronize on a static Object Variable (which should exist only once
> > in the VM). But I had the impression, it would be nicer to lock every
> > instance of Transport seperatly. When we use only one Object to
> > synchronize there, it should not be possible to deadlock either. Or do we
> > only have one instance of Transport? In that case we effectivly have the
> > wanted situation already.
> 
> Hi Felix,
> 
> I don't see a problem with doing that. The only possible issue is that
> you might not actually have access to the right transport object at the
> point you need it so having a static member assures that you'll always
> have access to the lock no matter where you are. But if you find that
> you can do it it probably would be a little better because it ensures
> that any mistake in the code (blocking with the lock) will not stall
> the whole client.
> 
> > > Meaning, create a static instance of ReentrantLock associated with
> > > jcifs.util.transport.Transport and add static lock/unlock methods that
> > > lock/unlock that ReentrantLock. Now, eliminate all synchronized calls
> > > and replace them with Transport.lock() / Transport.unlock().
> > >
> > > That will solve the deadlock but there is a pitfall. You must never ever
> > > ever EVER block while that lock is locked. For example if the transport
> > > tries to read() from a socket with the lock locked the entire client
> > > will lockup (but not deadlock) waiting for that I/O. The end result will
> > > be that performance will go subterranean.
> > The performance will be bad having one lock only, since every write to all
> > Transport instances would be queued.
> 
> No, that's not true. Like I said you can never block with the lock. So
> before you write you must unlock the lock. You must unlock the lock
> whenever you open files, read / write, create directories, etc.  And when
> you're not sure if the code can block (e.g. calling into some mystery
> library) then you should probably unlock there as well. Basically I
> you should have a handful of instances that look something like:
> 
>   Transport.unlock()
>   socket.write(buf, index, len);
>   Transport.lock();
> 
> > I thought you wanted to have a thread to disconnect old sessions, which
> > would lock the corresponding transport instance and wait for a condition
> > to be true (time, transport has too many sessions, ...).
> 
> I suspect you're talking about the "transport thread". It's purpose is
> to read from the socket so that callers do not have to commit to a read
> which could block for longer than the caller wants to wait. But if an
> error occurs it calls disconnect(). That is one of the threads in the
> deadlock you're seeing. But you can't eliminate that call because the
> application thread may be long gone.
> 
> > While this thread
> > waits it would release it's lock, so other threads could do their work.
> 
> Right.
> 
> > When a working thread or a supervising thread signals that the awaited
> > condition is true, the sleeping cleanup thread would awake, reclaim it's
> > lock and do the dirty work, while keeping the other jobs at bay.
> 
> I'm not sure I understand this. The transport thread (there's no "cleanup
> thread") should only call disconnect if there's some kind of error.
> 
> > If the design of the cleanup job stays like it is today, I see no real win
> > in using a lock, but to loose the compiler checked synchronization.
> 
> The whole point of using ReentrantLock is to completely get rid of the
> synchronized() statements. The problem with synchronized is that you
> cannot unlock the lock deep in the call stack without calling wait on
> the object on which you locked. That is a very coarse mechanism and not
> suitable for our purposes. In fact IIRC there is no way to do it correctly
> using synchronized statements - there will *always* be a hole. It's easy
> to patch the code so that it doesn't occur in your test scenario but if
> you stare at the code long enough you'll see that there is in fact a hole.
> 
> Mike
> 
> > > If you actually do this and you know what you're doing and it works,
> > > please send a patch and I'll put it in the patches directory.
> > >
> > > Mike
> > >
> > > --
> > > Michael B Allen
> > > PHP Active Directory Kerberos SSO
> > > http://www.ioplex.com/
> > >
> > 
> > 
> 
>