[jcifs] SmbTransport thread accumulation issues

Tue Oct 18 12:46:17 MDT 2011

Hi Sebastian and Colin,

This type of application where you want to "touch" as many machines as
possible in the shortest amount of time and with the least amount of
resources is somewhat popular actually. However, in general this is
largely an algorithm issue. As Sebastian points out, one should
probably not create so many threads that you reach an OOM condition.
Use a configurable fixed maximum number. Also, note that a recent
release of JCIFS made the soTimeout property control the socket
connect timeout. Unfortunately it is a static global property so it
cannot be dynamically adjusted on a per connection basis. If the
connect timeout could be dynamically adjusted on a per-SmbFile basis,
you could use a short soTimeout initially and if it fails, try again
with a larger soTimeout, then larger still, etc. That would be quite
efficient I think.

But to Colin directly, note that in 1.3.16, the
jcifs.smb.client.soTimeout value is passed to the Socket constructor
to control the connect timeout. So by just using a relatively small
soTimeout value you could speed things up considerably. And you could
make one pass over the list of hosts with a low soTimeout and if they
respond, process and remove them from the list and then double the
soTimeout and go over the list again and remove and repeat so that you
try successively longer soTimeouts. In general this type of
application is handled well with a thoughtful algorithm. You should
almost never interrupt() threads. That's generally considered bad
form. I don't think I would ever include code to interrupt threads. It
would make much more sense to get rid of the global static
jcifs.Config properties so that the connect timeout can be supplied on
a per-connection basis.

Note: The 1.3.16 release introduced a mildly serious bug that I am
about to fix after I finish typing this message. The
socket.setSoTimeout call was removed and the SO_TIMEOUT was passed to
the Socket constructor instead. However, that timeout is actually NOT
the SO_TIMEOUT. It is the connect timeout. So
jcifs.smb.client.soTimeout in 1.3.16 does not actually control the
SO_TIMEOUT which means if JCIFS successfully connects but then the
server hangs, JCIFS could sit there indefinitely. Not good. I will
release 1.3.17 with a fix ASAP.

Mike

-- 
Michael B Allen
Java Active Directory Integration
http://www.ioplex.com/

On Tue, Oct 18, 2011 at 12:22 AM, Sebastian Sickelmann
<sebastian.sickelmann at gmx.de> wrote:
> Hi,
>
> i don't have the problems with the thread accumulation, because i do not
> create so many connections, but i see the point that creating a thread for
> every machine you connect to can lead to such problems. I think the main
> problem is that jcifs is using blocking io instead of non-blocking io. I
> made some experiments to use non-blocking io in jcifs and handle the io with
> some small amount of threads, but it is far away from beeing ready for a
> first review by the community and the commiters(i am not a commiter) of
> jcifs.
>
> The question i have is: "are there many/some users of jcifs who are using
> many connections in a single jvm-process?"
>
> -- Sebastian
>
> Am 17.10.2011 22:58, schrieb Colin Hay:
>
> Hi all,
>
>
>
> I apologize if this lacks a degree of specificity (and evidence); I’ve
> inherited some jcifs-related work from a former co-worker and am essentially
> trying to get caught up on what he did. This is really intended as a
> disclosure of some changes we made to jcifs rather than a request for help;
> we seem to have the issue resolved but since jcifs is licensed under the
> LGPL I figure we should share the alterations with the community in case
> they might be useful to others, and maybe they can be considered for
> inclusion in a later release of jcifs.
>
>
>
> My company’s product uses jcifs to connect to a number of remote windows
> machines (depending on the customer, this could be just a handful, or
> several hundred). We have two components that make use of jcifs for
> different purposes; each employs its own retry mechanism. One utilizes a
> retry queue such that if a given connection attempt to a remote machine does
> not complete successfully within 45 seconds, we wait 30 seconds and make
> another connection attempt. We keep trying every 30 seconds until success.
> The other component doubles its wait interval between each retry attempt;
> starting at 1 second for the first connect failure, 2 seconds for the
> second, 4 seconds for the third, etc, though we max out at 15 minutes.
>
>
>
> A while back we came across a problem where we were accumulating threads at
> a rate such that we would eventually hit an OOM that would kill our jvm.
> This was because the windows machines we were trying to connect to were not
> responsive, and whatever issue they were having resulted in the threads
> created in jcifs.util.transport.Transport.connect(long timeout) blocking and
> staying active even after the timeout expired and the reference to the
> thread was nullified by the creating thread. The next time a connect attempt
> was initiated, another thread would be created in the connect() method, and
> this would continue until we had a serious problem on our hands because of
> the accumulation of stranded blocked threads. I can’t give details as to
> what the threads were blocking on because I can’t find any thread dumps from
> when the original issue was investigated, nor any explanation as to how to
> reproduce the problem (it was discovered at a customer site).
>
>
>
> My predecessor’s solution to this was to:
>
>
>
> a)       add a thread.interrupt() call to the synchronized(thread){} block
> of jcifs.util.transport.Transport.connect(long timeout) in an effort to make
> sure the thread does not hang around forever:
>
>
>
> synchronized (thread) {
>
>                 thread.start();
>
>                 thread.wait( timeout );          /* wait for doConnect */
>
>
>
>                 switch (state) {
>
>                     case 1: /* doConnect never returned */
>
>                         state = 0;
>
>                         thread.interrupt();
>
>                         thread = null;
>
>                         throw new TransportException( "Connection timeout"
> );
>
>                     case 2:
>
>                         if (te != null) { /* doConnect throw Exception */
>
>                             state = 4;                        /* error */
>
>                             thread = null;
>
>                             throw te;
>
>                         }
>
>                         state = 3;                         /* Success! */
>
>                         return;
>
>                 }
>
>             }
>
>
>
>
>
> b)       add a cleanupThread() method to jcifs.util.transport.Transport,
> called from connect(long timeout) before creating the new thread, to check
> if the thread has already been initialized by a previous call to connect()
> and if so, interrupt and nullify it.
>
>
>
> state = 1;
>
>             te = null;
>
>             cleanupThread();
>
>             thread = new Thread( this, name + "-" + threadId++ );
>
>       thread.setDaemon( true );
>
>
>
>
>
>                         private void cleanupThread()
>
>             {
>
>                   if (thread == null)
>
>                   {
>
>                         return;
>
>                   }
>
>
>
>                   if (thread.isAlive())
>
>                   {
>
>                         thread.interrupt();
>
>                   }
>
>                   thread = null;
>
>       }
>
>
>
> These two changes seem redundant to me, and a bit dangerous (I’d prefer not
> to blindly interrupt a thread in progress), but without knowing how to
> reproduce the problem to test, I’m forced to take my predecessor’s word for
> it (and the fact that the customer’s ticket was closed) that it worked.
>
>
>
> I don’t necessarily expect these changes to be included in a future release
> because of the vagueness of the problem, but if anyone has seen a similar
> thread accumulation, you could try making these same changes (one or the
> other or both), and if it helps, maybe you could share a thread dump of the
> situation prior to the change, so it can be properly documented as to what
> the problem is (i.e. what the threads end up blocked on). If the problem can
> be properly identified and reproduced, and the solution proven to be
> effective, it might make a good addition to a future release (whenever that
> might be).
>
>
>
>
>
> Cheers,
>
>
>
> Colin
>
>
>