[jcifs] pile up of Transport threads in BLOCKED state

Adam Morgan adam.morgan at Q1Labs.com
Tue Nov 30 08:52:32 MST 2010


Hi Mike



So I've managed to reproduce the issue in house, and I've come up with a fix as well.  It only occurs when configured with a non-existent IP (or unreachable).



Basically what is happening is that as our process gets going, thread A calls connect() on the Transport thread.  This in turn instantiates a new Thread (thread B) (Transport:163) and starts it (Transport:167).  Of course run() is called on Transport in thread B, and as u can see in the trace below, it tries to connect to the bad IP via creation of a Socket (SmbTransport:245) and hangs here indefinitely b/c the timeout has not yet been set to SO_TIMEOUT.



Nov 30 09:38:34 VM205 "Transport1" daemon prio=10 tid=0x00000000562f4400 nid=0x45d6 runnable [0x0000000052e61000..0x0000000052e61b90]

Nov 30 09:38:34 VM205    java.lang.Thread.State: RUNNABLE

Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.socketConnect(Native Method)

Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)

Nov 30 09:38:34 VM205   - locked <0x00002aaaf3815dd8> (a java.net.SocksSocketImpl)

Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)

Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)

Nov 30 09:38:34 VM205   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:519)

Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:469)

Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:366)

Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:180)

Nov 30 09:38:34 VM205   at jcifs.smb.SmbTransport.negotiate(SmbTransport.java:245)

Nov 30 09:38:34 VM205   - locked <0x00002aaafc77cbe0> (a [B)

Nov 30 09:38:34 VM205   at jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)

Nov 30 09:38:34 VM205   at jcifs.util.transport.Transport.run(Transport.java:233)

Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)



Since the configured server IP is unreachable, thread B spins for long enough for thread A to throw a new TransportException( "Connection timeout" ) (Transport:174).  Our code catches this and then kicks off our own reconnect thread, trying to kick the whole thing off again.  However, since the thread B is still trying to connect to the unreachable IP and is not interrupted, all subsequent connect() calls pile up on the synchronized mutex in negotiate



Nov 30 09:38:34 VM205 "Transport1" daemon prio=10 tid=0x00000000562f2800 nid=0x3ff8 waiting for monitor entry [0x0000000052c5f000..0x0000000052c5fc90]

Nov 30 09:38:34 VM205    java.lang.Thread.State: BLOCKED (on object monitor)

Nov 30 09:38:34 VM205   at jcifs.smb.SmbTransport.negotiate(SmbTransport.java:239)

Nov 30 09:38:34 VM205   - waiting to lock <0x00002aaafc77cbe0> (a [B)

Nov 30 09:38:34 VM205   at jcifs.smb.SmbTransport.doConnect(SmbTransport.java:302)

Nov 30 09:38:34 VM205   at jcifs.util.transport.Transport.run(Transport.java:233)

Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)





Now, as I see it the problem is that the negotiate() call is passing in the connection info to new Socket(), which results in an immediate connect attempt, and THEN setting the timeout, which means that it is set to infinity on the initial socket connect attempt.



I've refactored the code in a couple places to resolve the issue.  How can I submit the code to be reviewed and included in v16?



Cheers



Adam





-----Original Message-----

From: Michael B Allen [mailto:ioplex at gmail.com]

Sent: Monday, November 29, 2010 6:27 PM

To: Adam Morgan

Cc: jcifs at lists.samba.org

Subject: Re: [jcifs] pile up of Transport threads in BLOCKED state



Hi Adam,



What is the thread blocked on?



Get a backtrace.



Mike



On Fri, Nov 26, 2010 at 10:51 AM, Adam Morgan <adam.morgan at q1labs.com> wrote:

> Hi Mike

>

>

>

> We've run into an issue around a pile up of the Transport threads, to the

> point of using all available os threads, and causing OOMs across ALL jvms on

> the box.

>

>

>

> A customer had a configuration set up successfully for a while, but at some

> point decided to shut it off (ie shut down samba on the target box, plus

> potentially some other services), but forgot to turn off our code so it

> would stop trying to reconnect.  As a result, the box made roughly 20k

> reconnect attempts over the next couple days.  Each reconnect attempt met a

> NoRouteToHostException, and for whatever reason the Transport threads remain

> in a BLOCKED state, waiting on an object monitor, and are never released.

>  In light of this 'runaway' reconnect, we've refactored the code to attempt

> the reconnects on a increasingly-delayed basis (1s, 2s, 4s,... up to 15min

> between attempts, resetting to 0s on a successful reconnect) in hopes that

> we were just not waiting long enough and the issue would be resolved.

> However, this refactoring did not resolve the issue and we still see the

> 'Transport' thread count creep up over time.

>

>

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]: java.net.NoRouteToHostException: No

> route to host

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> java.net.PlainSocketImpl.socketConnect(Native Method)

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> java.net.Socket.<init>(Socket.java:180)

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> jcifs.smb.SmbTransport.ssn139(SmbTransport.java:178)

>

>

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]: java.net.NoRouteToHostException: No

> route to host

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.PlainSocketImpl.socketConnect(Native Method)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.Socket.connect(Socket.java:469)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.Socket.<init>(Socket.java:180)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:242)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.lang.Thread.run(Thread.java:619)

>

>

>

> Here's a snippet from our logs showing the thread state and blocked-time:

>

>

>

> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP

> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange

> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]

> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]

> [NOT:0000006000][172.24.251.251/- -] [-/- -]124143,Transport1 in Byte Code,

> BLOCKED, blocked-count: 1, blocked-time: 1003721 ms, wait-count: 0,

> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport1

> locked on [B at 72faf9b5

>

> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP

> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange

> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]

> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]

> [NOT:0000006000][172.24.251.251/- -] [-/- -]124142,Transport2 in Byte Code,

> BLOCKED, blocked-count: 1, blocked-time: 1003753 ms, wait-count: 0,

> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport2

> locked on [B at 5e36d88a

>

>

>

>

>

> As you can see, the threads are locked and have been blocked for ~1000s...

> well above any of the default timeouts listed in SmbConstants.

>

>

>

> Can you shed any light into this as to whether we have any configuration

> options or if this is a jcifs bug?

>

>

>

> Thanks

>

>

>

> Adam







--

Michael B Allen

Java Active Directory Integration

http://www.ioplex.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/jcifs/attachments/20101130/0541e352/attachment-0001.html>


More information about the jCIFS mailing list