[jcifs] SmbTransport thread accumulation issues

Mon Oct 17 14:58:36 MDT 2011

Hi all,

I apologize if this lacks a degree of specificity (and evidence); I've inherited some jcifs-related work from a former co-worker and am essentially trying to get caught up on what he did. This is really intended as a disclosure of some changes we made to jcifs rather than a request for help; we seem to have the issue resolved but since jcifs is licensed under the LGPL I figure we should share the alterations with the community in case they might be useful to others, and maybe they can be considered for inclusion in a later release of jcifs.

My company's product uses jcifs to connect to a number of remote windows machines (depending on the customer, this could be just a handful, or several hundred). We have two components that make use of jcifs for different purposes; each employs its own retry mechanism. One utilizes a retry queue such that if a given connection attempt to a remote machine does not complete successfully within 45 seconds, we wait 30 seconds and make another connection attempt. We keep trying every 30 seconds until success. The other component doubles its wait interval between each retry attempt; starting at 1 second for the first connect failure, 2 seconds for the second, 4 seconds for the third, etc, though we max out at 15 minutes.

A while back we came across a problem where we were accumulating threads at a rate such that we would eventually hit an OOM that would kill our jvm. This was because the windows machines we were trying to connect to were not responsive, and whatever issue they were having resulted in the threads created in jcifs.util.transport.Transport.connect(long timeout) blocking and staying active even after the timeout expired and the reference to the thread was nullified by the creating thread. The next time a connect attempt was initiated, another thread would be created in the connect() method, and this would continue until we had a serious problem on our hands because of the accumulation of stranded blocked threads. I can't give details as to what the threads were blocking on because I can't find any thread dumps from when the original issue was investigated, nor any explanation as to how to reproduce the problem (it was discovered at a customer site).

My predecessor's solution to this was to:

a)       add a thread.interrupt() call to the synchronized(thread){} block of jcifs.util.transport.Transport.connect(long timeout) in an effort to make sure the thread does not hang around forever:

synchronized (thread) {
                thread.start();
                thread.wait( timeout );          /* wait for doConnect */

                switch (state) {
                    case 1: /* doConnect never returned */
                        state = 0;
                        thread.interrupt();
                        thread = null;
                        throw new TransportException( "Connection timeout" );
                    case 2:
                        if (te != null) { /* doConnect throw Exception */
                            state = 4;                        /* error */
                            thread = null;
                            throw te;
                        }
                        state = 3;                         /* Success! */
                        return;
                }
            }

b)       add a cleanupThread() method to jcifs.util.transport.Transport, called from connect(long timeout) before creating the new thread, to check if the thread has already been initialized by a previous call to connect() and if so, interrupt and nullify it.

state = 1;
            te = null;
            cleanupThread();
            thread = new Thread( this, name + "-" + threadId++ );
      thread.setDaemon( true );

                        private void cleanupThread()
            {
                  if (thread == null)
                  {
                        return;
                  }

                  if (thread.isAlive())
                  {
                        thread.interrupt();
                  }
                  thread = null;
      }

These two changes seem redundant to me, and a bit dangerous (I'd prefer not to blindly interrupt a thread in progress), but without knowing how to reproduce the problem to test, I'm forced to take my predecessor's word for it (and the fact that the customer's ticket was closed) that it worked.

I don't necessarily expect these changes to be included in a future release because of the vagueness of the problem, but if anyone has seen a similar thread accumulation, you could try making these same changes (one or the other or both), and if it helps, maybe you could share a thread dump of the situation prior to the change, so it can be properly documented as to what the problem is (i.e. what the threads end up blocked on). If the problem can be properly identified and reproduced, and the solution proven to be effective, it might make a good addition to a future release (whenever that might be).

Cheers,

Colin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/jcifs/attachments/20111017/fc077243/attachment-0001.html>