[jcifs] SmbTransport thread accumulation issues

Mon Oct 17 22:22:30 MDT 2011

Hi,

i don't have the problems with the thread accumulation, because i do not 
create so many connections, but i see the point that creating a thread 
for every machine you connect to can lead to such problems. I think the 
main problem is that jcifs is using blocking io instead of non-blocking 
io. I made some experiments to use non-blocking io in jcifs and handle 
the io with some small amount of threads, but it is far away from beeing 
ready for a first review by the community and the commiters(i am not a 
commiter) of jcifs.

The question i have is: "are there many/some users of jcifs who are 
using many connections in a single jvm-process?"

-- Sebastian

Am 17.10.2011 22:58, schrieb Colin Hay:
>
> Hi all,
>
> I apologize if this lacks a degree of specificity (and evidence); I've 
> inherited some jcifs-related work from a former co-worker and am 
> essentially trying to get caught up on what he did. This is really 
> intended as a disclosure of some changes we made to jcifs rather than 
> a request for help; we seem to have the issue resolved but since jcifs 
> is licensed under the LGPL I figure we should share the alterations 
> with the community in case they might be useful to others, and maybe 
> they can be considered for inclusion in a later release of jcifs.
>
> My company's product uses jcifs to connect to a number of remote 
> windows machines (depending on the customer, this could be just a 
> handful, or several hundred). We have two components that make use of 
> jcifs for different purposes; each employs its own retry mechanism. 
> One utilizes a retry queue such that if a given connection attempt to 
> a remote machine does not complete successfully within 45 seconds, we 
> wait 30 seconds and make another connection attempt. We keep trying 
> every 30 seconds until success. The other component doubles its wait 
> interval between each retry attempt; starting at 1 second for the 
> first connect failure, 2 seconds for the second, 4 seconds for the 
> third, etc, though we max out at 15 minutes.
>
> A while back we came across a problem where we were accumulating 
> threads at a rate such that we would eventually hit an OOM that would 
> kill our jvm. This was because the windows machines we were trying to 
> connect to were not responsive, and whatever issue they were having 
> resulted in the threads created in 
> jcifs.util.transport.Transport.connect(long timeout) blocking and 
> staying active even after the timeout expired and the reference to the 
> thread was nullified by the creating thread. The next time a connect 
> attempt was initiated, another thread would be created in the 
> connect() method, and this would continue until we had a serious 
> problem on our hands because of the accumulation of stranded blocked 
> threads. I can't give details as to what the threads were blocking on 
> because I can't find any thread dumps from when the original issue was 
> investigated, nor any explanation as to how to reproduce the problem 
> (it was discovered at a customer site).
>
> My predecessor's solution to this was to:
>
> a)add a thread.interrupt() call to the synchronized(thread){} block of 
> jcifs.util.transport.Transport.connect(long timeout) in an effort to 
> make sure the thread does not hang around forever:
>
> *synchronized*(thread) {
>
> thread.start();
>
> thread.wait( timeout ); /* wait for doConnect */
>
> *switch*(state) {
>
> *case*1: /* doConnect never returned */
>
> state= 0;
>
> thread.interrupt();
>
> thread= *null*;
>
> *throw**new*TransportException( "Connection timeout");
>
> *case*2:
>
> *if*(te!= *null*) { /* doConnect throw Exception */
>
> state= 4; /* error */
>
> thread= *null*;
>
> *throw*te;
>
>                         }
>
> state= 3; /* Success! */
>
> *return*;
>
>                 }
>
>             }
>
> b)add a cleanupThread() method to jcifs.util.transport.Transport, 
> called from connect(long timeout) before creating the new thread, to 
> check if the thread has already been initialized by a previous call to 
> connect() and if so, interrupt and nullify it.
>
> state= 1;
>
> te= *null*;
>
> cleanupThread();
>
> thread= *new*Thread( *this*, name+ "-"+ /threadId/++ );
>
> thread.setDaemon( *true*);
>
> *private**void*cleanupThread()
>
>             {
>
> *if*(thread== *null*)
>
>                   {
>
> *return*;
>
>                   }
>
> *if*(thread.isAlive())
>
>                   {
>
> thread.interrupt();
>
>                   }
>
> thread= *null*;
>
>       }
>
> These two changes seem redundant to me, and a bit dangerous (I'd 
> prefer not to blindly interrupt a thread in progress), but without 
> knowing how to reproduce the problem to test, I'm forced to take my 
> predecessor's word for it (and the fact that the customer's ticket was 
> closed) that it worked.
>
> I don't necessarily expect these changes to be included in a future 
> release because of the vagueness of the problem, but if anyone has 
> seen a similar thread accumulation, you could try making these same 
> changes (one or the other or both), and if it helps, maybe you could 
> share a thread dump of the situation prior to the change, so it can be 
> properly documented as to what the problem is (i.e. what the threads 
> end up blocked on). If the problem can be properly identified and 
> reproduced, and the solution proven to be effective, it might make a 
> good addition to a future release (whenever that might be).
>
> Cheers,
>
> Colin
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/jcifs/attachments/20111018/24a9a145/attachment-0001.html>