[jcifs] Deadlock while cleaning up old sessions in transport

Tue Aug 14 18:21:57 GMT 2007

Felix,

I think you're going in the wrong direction. Your making it more complex
than it needs to be. If you use a single condition variable and get rid
of the over-complicated morass of synchronized statements the locking
could be very simple.

Back in the Linux 2.0 days they had something called the Big Kernel Lock
(BKL). Everything used that one single lock. If you never wait while
holding the CPU is always working and therefore the technique is very
fast. You could run X windows, Netscape, Apache, etc and thats a lot
more work than JCIFS will ever do.

Using two or three threads JCIFS can crawl over every file on a
Windows workstation in under 10 seconds. I know something about I/O
multiplexing. Use the static Transport condvar idea and KISS.

Ok, now I'm procrastinating ... back to work.

Mike

On Tue, 14 Aug 2007 19:35:19 +0200
Felix Schumacher <felix.schumacher at internetallee.de> wrote:

> Hi Mike,
> 
> glad you are still not fed up with this thread.
> 
> I think I will have to explain how I am looking at the problem. 
> 
> The deadlock I saw occured while trying to free up old sessions in
> SmbTransport.getSmbSession(NtlmPasswordAuthentication). [The part
> triggered when 
>   "if (SO_TIMEOUT > 0 && sessionExpiration < (now =
> System.currentTimeMillis()))" is the part I would put into its own
> thread and waiting for a signal, which could be "times up", or "no
> sessions left for a new auth". This thread would live as an instance
> variable of the SmbTransport instance. (I have included a sample with
> Condition/ReentrantLock at the end)]
> >From there (ssn.logoff()) I descended to
> SmbTree.treeConnect(ServerMessageBlock, ServerMessageBlock). 
> On the way down there it always had a reference to an instance of
> SmbTransport and it was locking the two Objects transport and
> setupDiscolock in that order, though cyclic.
> >From the method SmbTree.treeConnect I then went up the callstack of
> another thread to say for example SmbFile.copyTo(SmbFile). If I would
> follow that track back down to treeConnect, I would pass the
> synchronization locks in the order setupDiscolock, transport. So there
> was one/the culprit for the experienced deadlock.
> 
> The way I am seeing it, is that the transport lock is used, so that no
> other thread can interfere with the multiplexing done on the socket and
> the setup logik before and after sending messages over the wire. But
> since each SmbTransport instance has its own socket and variables, we
> have to guard that instance from seperate thread access only, so there
> is only one lock needed (per SmbTransport instance).
> 
> I don't think it is wise to unlock that (smb)transport instance right
> before a write to a socket. I wonder what would happen when two threads
> were sending data on the socket the same time. And beside, what would
> you like to do with that transport instance? You certainly can't use it
> for sending data to the server?
> 
> Given my thoughts as outlined above, the only reason for giving up a
> lock, would be knowing, that I can't use the underlying socket for the
> next moments.
> 
> But maybe I am completely wrong and only stealing your time.
> 
> Bye Felix
> 
> // Example for using ReentrantLock as I would think it would make sense
> class SmbTransport {
>   Lock transportLock = new ReentrantLock();
>   Condition cleanUpCondition = transportLock.newCondition();
>   volatile boolean cleanUpSessions = false;
>   Thread cleanUpThread;
> ...
> //in constructor
>      cleanUpThread = new Thread(new CleanUpSessions());
>      cleanUpThread.start();
> ...
>   class CleanUpSessions implements Runnable {
>     public void run() {
>         transportLock.lock();
>         try {
>            for(;;) {
>               // while we have nothing to do, we can wait and give up
> our lock
>               while (!cleanUpSessions) 
>                  tidyUp.await();
>               sessionExpiration = now + SO_TIMEOUT;
>               iter = sessions.listIterator();
>               while( iter.hasNext() ) {
>                 ssn = (SmbSession)iter.next();
>                 if( ssn.expiration < now ) {
>                     ssn.logoff( false );
>                 }
>               }
>               cleanUpSessions = false;
>            }          
>         } finally {
>            transportLock.unlock();
>         }
>     }
>   }
> ...
> }
> 
> Am Dienstag, den 14.08.2007, 12:20 -0400 schrieb Michael B Allen:
> > On Tue, 14 Aug 2007 14:30:55 +0200 (CEST)
> > "Felix Schumacher" <felix.schumacher at internetallee.de> wrote:
> > 
> > > Am Mo, 13.08.2007, 21:07, schrieb Michael B Allen:
> > > > On Mon, 13 Aug 2007 17:17:52 +0200 (CEST)
> > > > "Felix Schumacher" <felix.schumacher at internetallee.de> wrote:
> > > >
> > > >> > Like I said the other day, the only way to fix this problem is to
> > > >> create
> > > >> > a custom lock class and replace all of the synchronized calls so that
> > > >> one
> > > >> > lock can be used and the lock can be *unlocked* deep in hte
> > > >> call-stack.
> > > >> So given an implementation of concurrent.ReentrantLock (with which one
> > > >> could start to have a known working implementation), how would you
> > > >> decide
> > > >> when to unlock the transport and where?
> > > >
> > > > Hi Felix,
> > > >
> > > > If you have a lock class that can be unlocked then the solution is simple
> > > > because you can have one lock for the entire VM. If there's only one
> > > > lock then a deadlock is theoretically impossible.
> > > If you only want to have ONE lock for the entire VM, it should be enough
> > > to synchronize on a static Object Variable (which should exist only once
> > > in the VM). But I had the impression, it would be nicer to lock every
> > > instance of Transport seperatly. When we use only one Object to
> > > synchronize there, it should not be possible to deadlock either. Or do we
> > > only have one instance of Transport? In that case we effectivly have the
> > > wanted situation already.
> > 
> > Hi Felix,
> > 
> > I don't see a problem with doing that. The only possible issue is that
> > you might not actually have access to the right transport object at the
> > point you need it so having a static member assures that you'll always
> > have access to the lock no matter where you are. But if you find that
> > you can do it it probably would be a little better because it ensures
> > that any mistake in the code (blocking with the lock) will not stall
> > the whole client.
> > 
> > > > Meaning, create a static instance of ReentrantLock associated with
> > > > jcifs.util.transport.Transport and add static lock/unlock methods that
> > > > lock/unlock that ReentrantLock. Now, eliminate all synchronized calls
> > > > and replace them with Transport.lock() / Transport.unlock().
> > > >
> > > > That will solve the deadlock but there is a pitfall. You must never ever
> > > > ever EVER block while that lock is locked. For example if the transport
> > > > tries to read() from a socket with the lock locked the entire client
> > > > will lockup (but not deadlock) waiting for that I/O. The end result will
> > > > be that performance will go subterranean.
> > > The performance will be bad having one lock only, since every write to all
> > > Transport instances would be queued.
> > 
> > No, that's not true. Like I said you can never block with the lock. So
> > before you write you must unlock the lock. You must unlock the lock
> > whenever you open files, read / write, create directories, etc.  And when
> > you're not sure if the code can block (e.g. calling into some mystery
> > library) then you should probably unlock there as well. Basically I
> > you should have a handful of instances that look something like:
> > 
> >   Transport.unlock()
> >   socket.write(buf, index, len);
> >   Transport.lock();
> > 
> > > I thought you wanted to have a thread to disconnect old sessions, which
> > > would lock the corresponding transport instance and wait for a condition
> > > to be true (time, transport has too many sessions, ...).
> > 
> > I suspect you're talking about the "transport thread". It's purpose is
> > to read from the socket so that callers do not have to commit to a read
> > which could block for longer than the caller wants to wait. But if an
> > error occurs it calls disconnect(). That is one of the threads in the
> > deadlock you're seeing. But you can't eliminate that call because the
> > application thread may be long gone.
> > 
> > > While this thread
> > > waits it would release it's lock, so other threads could do their work.
> > 
> > Right.
> > 
> > > When a working thread or a supervising thread signals that the awaited
> > > condition is true, the sleeping cleanup thread would awake, reclaim it's
> > > lock and do the dirty work, while keeping the other jobs at bay.
> > 
> > I'm not sure I understand this. The transport thread (there's no "cleanup
> > thread") should only call disconnect if there's some kind of error.
> > 
> > > If the design of the cleanup job stays like it is today, I see no real win
> > > in using a lock, but to loose the compiler checked synchronization.
> > 
> > The whole point of using ReentrantLock is to completely get rid of the
> > synchronized() statements. The problem with synchronized is that you
> > cannot unlock the lock deep in the call stack without calling wait on
> > the object on which you locked. That is a very coarse mechanism and not
> > suitable for our purposes. In fact IIRC there is no way to do it correctly
> > using synchronized statements - there will *always* be a hole. It's easy
> > to patch the code so that it doesn't occur in your test scenario but if
> > you stare at the code long enough you'll see that there is in fact a hole.
> > 
> > Mike
> > 
> > > > If you actually do this and you know what you're doing and it works,
> > > > please send a patch and I'll put it in the patches directory.
> > > >
> > > > Mike
> > > >
> > > > --
> > > > Michael B Allen
> > > > PHP Active Directory Kerberos SSO
> > > > http://www.ioplex.com/
> > > >
> > > 
> > > 
> > 
> > 
> 

-- 
Michael B Allen
PHP Active Directory Kerberos SSO
http://www.ioplex.com/