[jcifs] pile up of Transport threads in BLOCKED state

Wed Dec 1 08:42:02 MST 2010

Hi Adam,

Just post a code fragment. Or if the changes are spread out, generate
a patch with "diff -Nuar old new" (or maybe it's diff -Nuar new old, I
can never remember). Or if you don't have diff, just attach the .java
file and I'll diff it myself.

Mike

On Tue, Nov 30, 2010 at 10:52 AM, Adam Morgan <adam.morgan at q1labs.com> wrote:
> Hi Mike
>
>
>
> So I've managed to reproduce the issue in house, and I’ve come up with a fix
> as well.  It only occurs when configured with a non-existent IP (or
> unreachable).
>
>
>
> Basically what is happening is that as our process gets going, thread A
> calls connect() on the Transport thread.  This in turn instantiates a new
> Thread (thread B) (Transport:163) and starts it (Transport:167).  Of course
> run() is called on Transport in thread B, and as u can see in the trace
> below, it tries to connect to the bad IP via creation of a Socket
> (SmbTransport:245) and hangs here indefinitely b/c the timeout has not yet
> been set to SO_TIMEOUT.
>
>
>
> Nov 30 09:38:34 VM205 "Transport1" daemon prio=10 tid=0x00000000562f4400
> nid=0x45d6 runnable [0x0000000052e61000..0x0000000052e61b90]
>
> Nov 30 09:38:34 VM205    java.lang.Thread.State: RUNNABLE
>
> Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.socketConnect(Native
> Method)
>
> Nov 30 09:38:34 VM205   at
> java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
>
> Nov 30 09:38:34 VM205   - locked <0x00002aaaf3815dd8> (a
> java.net.SocksSocketImpl)
>
> Nov 30 09:38:34 VM205   at
> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>
> Nov 30 09:38:34 VM205   at
> java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
>
> Nov 30 09:38:34 VM205   at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>
> Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:519)
>
> Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:469)
>
> Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:366)
>
> Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:180)
>
> Nov 30 09:38:34 VM205   at
> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:245)
>
> Nov 30 09:38:34 VM205   - locked <0x00002aaafc77cbe0> (a [B)
>
> Nov 30 09:38:34 VM205   at
> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)
>
> Nov 30 09:38:34 VM205   at
> jcifs.util.transport.Transport.run(Transport.java:233)
>
> Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)
>
>
>
> Since the configured server IP is unreachable, thread B spins for long
> enough for thread A to throw a new TransportException( "Connection timeout"
> ) (Transport:174).  Our code catches this and then kicks off our own
> reconnect thread, trying to kick the whole thing off again.  However, since
> the thread B is still trying to connect to the unreachable IP and is not
> interrupted, all subsequent connect() calls pile up on the synchronized
> mutex in negotiate
>
>
>
> Nov 30 09:38:34 VM205 "Transport1" daemon prio=10 tid=0x00000000562f2800
> nid=0x3ff8 waiting for monitor entry
> [0x0000000052c5f000..0x0000000052c5fc90]
>
> Nov 30 09:38:34 VM205    java.lang.Thread.State: BLOCKED (on object monitor)
>
> Nov 30 09:38:34 VM205   at
> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:239)
>
> Nov 30 09:38:34 VM205   - waiting to lock <0x00002aaafc77cbe0> (a [B)
>
> Nov 30 09:38:34 VM205   at
> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:302)
>
> Nov 30 09:38:34 VM205   at
> jcifs.util.transport.Transport.run(Transport.java:233)
>
> Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)
>
>
>
>
>
> Now, as I see it the problem is that the negotiate() call is passing in the
> connection info to new Socket(), which results in an immediate connect
> attempt, and THEN setting the timeout, which means that it is set to
> infinity on the initial socket connect attempt.
>
>
>
> I’ve refactored the code in a couple places to resolve the issue.  How can I
> submit the code to be reviewed and included in v16?
>
>
>
> Cheers
>
>
>
> Adam
>
>
>
>
>
> -----Original Message-----
>
> From: Michael B Allen [mailto:ioplex at gmail.com]
>
> Sent: Monday, November 29, 2010 6:27 PM
>
> To: Adam Morgan
>
> Cc: jcifs at lists.samba.org
>
> Subject: Re: [jcifs] pile up of Transport threads in BLOCKED state
>
>
>
> Hi Adam,
>
>
>
> What is the thread blocked on?
>
>
>
> Get a backtrace.
>
>
>
> Mike
>
>
>
> On Fri, Nov 26, 2010 at 10:51 AM, Adam Morgan <adam.morgan at q1labs.com>
> wrote:
>
>> Hi Mike
>
>>
>
>>
>
>>
>
>> We’ve run into an issue around a pile up of the Transport threads, to the
>
>> point of using all available os threads, and causing OOMs across ALL jvms
>> on
>
>> the box.
>
>>
>
>>
>
>>
>
>> A customer had a configuration set up successfully for a while, but at
>> some
>
>> point decided to shut it off (ie shut down samba on the target box, plus
>
>> potentially some other services), but forgot to turn off our code so it
>
>> would stop trying to reconnect.  As a result, the box made roughly 20k
>
>> reconnect attempts over the next couple days.  Each reconnect attempt met
>> a
>
>> NoRouteToHostException, and for whatever reason the Transport threads
>> remain
>
>> in a BLOCKED state, waiting on an object monitor, and are never released.
>
>>  In light of this ‘runaway’ reconnect, we’ve refactored the code to
>> attempt
>
>> the reconnects on a increasingly-delayed basis (1s, 2s, 4s,… up to 15min
>
>> between attempts, resetting to 0s on a successful reconnect) in hopes that
>
>> we were just not waiting long enough and the issue would be resolved.
>
>> However, this refactoring did not resolve the issue and we still see the
>
>> ‘Transport’ thread count creep up over time.
>
>>
>
>>
>
>>
>
>> Nov 16 10:02:13 evlqradar01 ecs[25797]: java.net.NoRouteToHostException:
>> No
>
>> route to host
>
>>
>
>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>
>> java.net.PlainSocketImpl.socketConnect(Native Method)
>
>>
>
>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>
>> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>
>>
>
>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>
>> java.net.Socket.<init>(Socket.java:180)
>
>>
>
>> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at
>
>> jcifs.smb.SmbTransport.ssn139(SmbTransport.java:178)
>
>>
>
>>
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]: java.net.NoRouteToHostException:
>> No
>
>> route to host
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> java.net.PlainSocketImpl.socketConnect(Native Method)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> java.net.Socket.connect(Socket.java:469)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> java.net.Socket.<init>(Socket.java:180)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:242)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)
>
>>
>
>> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at
>
>> java.lang.Thread.run(Thread.java:619)
>
>>
>
>>
>
>>
>
>> Here’s a snippet from our logs showing the thread state and blocked-time:
>
>>
>
>>
>
>>
>
>> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP
>
>> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange
>
>> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]
>
>> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]
>
>> [NOT:0000006000][172.24.251.251/- -] [-/- -]124143,Transport1 in Byte
>> Code,
>
>> BLOCKED, blocked-count: 1, blocked-time: 1003721 ms, wait-count: 0,
>
>> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport1
>
>> locked on [B at 72faf9b5
>
>>
>
>> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP
>
>> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange
>
>> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]
>
>> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]
>
>> [NOT:0000006000][172.24.251.251/- -] [-/- -]124142,Transport2 in Byte
>> Code,
>
>> BLOCKED, blocked-count: 1, blocked-time: 1003753 ms, wait-count: 0,
>
>> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport2
>
>> locked on [B at 5e36d88a
>
>>
>
>>
>
>>
>
>>
>
>>
>
>> As you can see, the threads are locked and have been blocked for ~1000s…
>
>> well above any of the default timeouts listed in SmbConstants.
>
>>
>
>>
>
>>
>
>> Can you shed any light into this as to whether we have any configuration
>
>> options or if this is a jcifs bug?
>
>>
>
>>
>
>>
>
>> Thanks
>
>>
>
>>
>
>>
>
>> Adam
>
>
>
>
>
>
>
> --
>
> Michael B Allen
>
> Java Active Directory Integration
>
> http://www.ioplex.com/

-- 
Michael B Allen
Java Active Directory Integration
http://www.ioplex.com/