[jcifs] pile up of Transport threads in BLOCKED state

Fri Jun 24 21:32:19 MDT 2011

The jcifs-1.3.16 release now uses InetSocketAddress and sets the
SO_TIMEOUT before calling connect.

Mike

-- 
Michael B Allen
Java Active Directory Integration
http://www.ioplex.com/

On Thu, Dec 16, 2010 at 7:32 PM, Michael B Allen <ioplex at gmail.com> wrote:
> Hi Adam,
>
> Interrupting threads is generally considered bad form (although I
> wouldn't rule it out entirely if it can be shown conclusively that the
> Thread is useless and not doing anything that could cause some kind of
> corruption) and you do not make it clear as to why going through
> InetSocketAddress is superior to simply calling soTimeout. But I have
> added your patch to the TODO for further consideration. There is no
> ETA.
>
> Mike
>
> On Wed, Dec 15, 2010 at 2:49 PM, Adam Morgan <adam.morgan at q1labs.com> wrote:
>> Ok, try this again… I’ve reattached the diff.
>>
>>
>>
>> Have you had a chance to look at it, and do you have any idea about its
>> inclusion in the next version, and when that might be?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Adam
>>
>>
>>
>> ________________________________
>>
>> From: Adam Morgan
>> Sent: Wednesday, December 01, 2010 10:56 AM
>> To: 'Michael B Allen'
>> Subject: RE: [jcifs] pile up of Transport threads in BLOCKED state
>>
>>
>>
>> Hi Mike
>>
>>
>>
>> Here’s a diff of my changes.
>>
>>
>>
>> SmbTransport.java – changes to binding/connect calls to allow for timeout to
>> be administered.
>>
>> Transport.java – altered naming of threads to make them distinct, and added
>> call to interrupt the thread created during connect() to prevent it from
>> blocking indefinitely and being orphaned
>>
>>
>>
>> Let me know what you think.
>>
>>
>>
>> Cheers
>>
>>
>>
>>
>>
>>
>>
>> ________________________________
>>
>> From: jcifs-bounces at lists.samba.org [mailto:jcifs-bounces at lists.samba.org]
>> On Behalf Of Adam Morgan
>> Sent: Tuesday, November 30, 2010 11:53 AM
>> To: Michael B Allen
>> Cc: jcifs at lists.samba.org
>> Subject: Re: [jcifs] pile up of Transport threads in BLOCKED state
>>
>>
>>
>> Hi Mike
>>
>>
>>
>> So I've managed to reproduce the issue in house, and I’ve come up with a fix
>> as well.  It only occurs when configured with a non-existent IP (or
>> unreachable).
>>
>>
>>
>> Basically what is happening is that as our process gets going, thread A
>> calls connect() on the Transport thread.  This in turn instantiates a new
>> Thread (thread B) (Transport:163) and starts it (Transport:167).  Of course
>> run() is called on Transport in thread B, and as u can see in the trace
>> below, it tries to connect to the bad IP via creation of a Socket
>> (SmbTransport:245) and hangs here indefinitely b/c the timeout has not yet
>> been set to SO_TIMEOUT.
>>
>>
>>
>> Nov 30 09:38:34 VM205 "Transport1" daemon prio=10 tid=0x00000000562f4400
>> nid=0x45d6 runnable [0x0000000052e61000..0x0000000052e61b90]
>>
>> Nov 30 09:38:34 VM205    java.lang.Thread.State: RUNNABLE
>>
>> Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.socketConnect(Native
>> Method)
>>
>> Nov 30 09:38:34 VM205   at
>> java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>>
>> Nov 30 09:38:34 VM205   - locked <0x00002aaaf3815dd8> (a
>> java.net.SocksSocketImpl)
>>
>> Nov 30 09:38:34 VM205   at
>> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>>
>> Nov 30 09:38:34 VM205   at
>> java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>>
>> Nov 30 09:38:34 VM205   at
>> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>
>> Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:519)
>>
>> Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:469)
>>
>> Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:366)
>>
>> Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:180)
>>
>> Nov 30 09:38:34 VM205   at
>> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:245)
>>
>> Nov 30 09:38:34 VM205   - locked <0x00002aaafc77cbe0> (a [B)
>>
>> Nov 30 09:38:34 VM205   at
>> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)
>>
>> Nov 30 09:38:34 VM205   at
>> jcifs.util.transport.Transport.run(Transport.java:233)
>>
>> Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)
>>
>>
>>
>> Since the configured server IP is unreachable, thread B spins for long
>> enough for thread A to throw a new TransportException( "Connection timeout"
>> ) (Transport:174).  Our code catches this and then kicks off our own
>> reconnect thread, trying to kick the whole thing off again.  However, since
>> the thread B is still trying to connect to the unreachable IP and is not
>> interrupted, all subsequent connect() calls pile up on the synchronized
>> mutex in negotiate
>>
>>
>>
>> Nov 30 09:38:34 VM205 "Transport1" daemon prio=10 tid=0x00000000562f2800
>> nid=0x3ff8 waiting for monitor entry
>> [0x0000000052c5f000..0x0000000052c5fc90]
>>
>> Nov 30 09:38:34 VM205    java.lang.Thread.State: BLOCKED (on object monitor)
>>
>> Nov 30 09:38:34 VM205   at
>> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:239)
>>
>> Nov 30 09:38:34 VM205   - waiting to lock <0x00002aaafc77cbe0> (a [B)
>>
>> Nov 30 09:38:34 VM205   at
>> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:302)
>>
>> Nov 30 09:38:34 VM205   at
>> jcifs.util.transport.Transport.run(Transport.java:233)
>>
>> Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)
>>
>>
>>
>>
>>
>> Now, as I see it the problem is that the negotiate() call is passing in the
>> connection info to new Socket(), which results in an immediate connect
>> attempt, and THEN setting the timeout, which means that it is set to
>> infinity on the initial socket connect attempt.
>>
>>
>>
>> I’ve refactored the code in a couple places to resolve the issue.  How can I
>> submit the code to be reviewed and included in v16?
>>
>>
>>
>> Cheers
>>
>>
>>
>> Adam
>>
>>
>>
>>
>>
>> -----Original Message-----
>>
>> From: Michael B Allen [mailto:ioplex at gmail.com]
>>
>> Sent: Monday, November 29, 2010 6:27 PM
>>
>> To: Adam Morgan
>>
>> Cc: jcifs at lists.samba.org
>>
>> Subject: Re: [jcifs] pile up of Transport threads in BLOCKED state
>>
>>
>>
>> Hi Adam,
>>
>>
>>
>> What is the thread blocked on?
>>
>>
>>
>> Get a backtrace.
>>
>>
>>
>> Mike
>>
>>
>>
>> On Fri, Nov 26, 2010 at 10:51 AM, Adam Morgan <adam.morgan at q1labs.com>
>> wrote:
>>
>>> Hi Mike
>>
>>>
>>
>>>
>>
>>>
>>
>>> We’ve run into an issue around a pile up of the Transport threads, to the
>>
>>> point of using all available os threads, and causing OOMs across ALL jvms
>>> on
>>
>>> the box.
>>
>>>
>>
>>>
>>
>>>
>>
>>> A customer had a configuration set up successfully for a while, but at
>>> some
>>
>>> point decided to shut it off (ie shut down samba on the target box, plus
>>
>>> potentially some other services), but forgot to turn off our code so it
>>
>>> would stop trying to reconnect.  As a result, the box made roughly 20k
>>
>>> reconnect attempts over the next couple days.  Each reconnect attempt met
>>> a
>>
>>> NoRouteToHostException, and for whatever reason the Transport threads
>>> remain
>>
>>> in a BLOCKED state, waiting on an object monitor, and are never released.
>>
>>>  In light of this ‘runaway’ reconnect, we’ve refactored the code to
>>> attempt
>>
>>> the reconnects on a increasingly-delayed basis (1s, 2s, 4s,… up to 15min
>>
>>> between attempts, resetting to 0s on a successful reconnect) in hopes that
>>
>>> we were just not waiting long enough and the issue would be resolved.
>>
>>> However, this refactoring did not resolve the issue and we still see the
>>
>>> ‘Transport’ thread count creep up over time.
>>
>>>
>>
>>>
>>
>>>
>>
>>> Nov 16 10:02:13 evlqradar01 ecs[25797]: java.net.NoRouteToHostException:
>>> No
>>
>>> route to host
>>
>>>
>>
>>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>>
>>> java.net.PlainSocketImpl.socketConnect(Native Method)
>>
>>>
>>
>>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>>
>>> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>
>>>
>>
>>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>>
>>> java.net.Socket.<init>(Socket.java:180)
>>
>>>
>>
>>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>>
>>> jcifs.smb.SmbTransport.ssn139(SmbTransport.java:178)
>>
>>>
>>
>>>
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]: java.net.NoRouteToHostException:
>>> No
>>
>>> route to host
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> java.net.PlainSocketImpl.socketConnect(Native Method)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> java.net.Socket.connect(Socket.java:469)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> java.net.Socket.<init>(Socket.java:180)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:242)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)
>>
>>>
>>
>>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>>
>>> java.lang.Thread.run(Thread.java:619)
>>
>>>
>>
>>>
>>
>>>
>>
>>> Here’s a snippet from our logs showing the thread state and blocked-time:
>>
>>>
>>
>>>
>>
>>>
>>
>>> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP
>>
>>> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange
>>
>>> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]
>>
>>> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]
>>
>>> [NOT:0000006000][172.24.251.251/- -] [-/- -]124143,Transport1 in Byte
>>> Code,
>>
>>> BLOCKED, blocked-count: 1, blocked-time: 1003721 ms, wait-count: 0,
>>
>>> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport1
>>
>>> locked on [B at 72faf9b5
>>
>>>
>>
>>> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP
>>
>>> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange
>>
>>> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]
>>
>>> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]
>>
>>> [NOT:0000006000][172.24.251.251/- -] [-/- -]124142,Transport2 in Byte
>>> Code,
>>
>>> BLOCKED, blocked-count: 1, blocked-time: 1003753 ms, wait-count: 0,
>>
>>> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport2
>>
>>> locked on [B at 5e36d88a
>>
>>>
>>
>>>
>>
>>>
>>
>>>
>>
>>>
>>
>>> As you can see, the threads are locked and have been blocked for ~1000s…
>>
>>> well above any of the default timeouts listed in SmbConstants.
>>
>>>
>>
>>>
>>
>>>
>>
>>> Can you shed any light into this as to whether we have any configuration
>>
>>> options or if this is a jcifs bug?
>>
>>>
>>
>>>
>>
>>>
>>
>>> Thanks
>>
>>>
>>
>>>
>>
>>>
>>
>>> Adam