[jcifs] Fwd: pile up of Transport threads in BLOCKED state

Tue Feb 22 14:14:45 MST 2011

Reposting diff for Gergan.

Note that you probably are only interested in the soTimeout part. And
the default timeout is not "infinity". On my Linux workstation it
takes about 90 seconds for an unanswered SYN to timeout.

Mike

---------- Forwarded message ----------
From: Adam Morgan <adam.morgan at q1labs.com>
Date: Wed, Dec 15, 2010 at 2:49 PM
Subject: RE: [jcifs] pile up of Transport threads in BLOCKED state
To: Adam Morgan <adam.morgan at q1labs.com>, Michael B Allen <ioplex at gmail.com>

Ok, try this again… I’ve reattached the diff.

Have you had a chance to look at it, and do you have any idea about
its inclusion in the next version, and when that might be?

Thanks

Adam

________________________________

From: Adam Morgan
Sent: Wednesday, December 01, 2010 10:56 AM
To: 'Michael B Allen'
Subject: RE: [jcifs] pile up of Transport threads in BLOCKED state

Hi Mike

Here’s a diff of my changes.

SmbTransport.java – changes to binding/connect calls to allow for
timeout to be administered.

Transport.java – altered naming of threads to make them distinct, and
added call to interrupt the thread created during connect() to prevent
it from blocking indefinitely and being orphaned

Let me know what you think.

Cheers

________________________________

From: jcifs-bounces at lists.samba.org
[mailto:jcifs-bounces at lists.samba.org] On Behalf Of Adam Morgan
Sent: Tuesday, November 30, 2010 11:53 AM
To: Michael B Allen
Cc: jcifs at lists.samba.org
Subject: Re: [jcifs] pile up of Transport threads in BLOCKED state

Hi Mike

So I've managed to reproduce the issue in house, and I’ve come up with
a fix as well.  It only occurs when configured with a non-existent IP
(or unreachable).

Basically what is happening is that as our process gets going, thread
A calls connect() on the Transport thread.  This in turn instantiates
a new Thread (thread B) (Transport:163) and starts it (Transport:167).
 Of course run() is called on Transport in thread B, and as u can see
in the trace below, it tries to connect to the bad IP via creation of
a Socket (SmbTransport:245) and hangs here indefinitely b/c the
timeout has not yet been set to SO_TIMEOUT.

Nov 30 09:38:34 VM205 "Transport1" daemon prio=10
tid=0x00000000562f4400 nid=0x45d6 runnable
[0x0000000052e61000..0x0000000052e61b90]

Nov 30 09:38:34 VM205    java.lang.Thread.State: RUNNABLE

Nov 30 09:38:34 VM205   at java.net.PlainSocketImpl.socketConnect(Native Method)

Nov 30 09:38:34 VM205   at
java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)

Nov 30 09:38:34 VM205   - locked <0x00002aaaf3815dd8> (a
java.net.SocksSocketImpl)

Nov 30 09:38:34 VM205   at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)

Nov 30 09:38:34 VM205   at
java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)

Nov 30 09:38:34 VM205   at
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:519)

Nov 30 09:38:34 VM205   at java.net.Socket.connect(Socket.java:469)

Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:366)

Nov 30 09:38:34 VM205   at java.net.Socket.<init>(Socket.java:180)

Nov 30 09:38:34 VM205   at
jcifs.smb.SmbTransport.negotiate(SmbTransport.java:245)

Nov 30 09:38:34 VM205   - locked <0x00002aaafc77cbe0> (a [B)

Nov 30 09:38:34 VM205   at
jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)

Nov 30 09:38:34 VM205   at
jcifs.util.transport.Transport.run(Transport.java:233)

Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)

Since the configured server IP is unreachable, thread B spins for long
enough for thread A to throw a new TransportException( "Connection
timeout" ) (Transport:174).  Our code catches this and then kicks off
our own reconnect thread, trying to kick the whole thing off again.
However, since the thread B is still trying to connect to the
unreachable IP and is not interrupted, all subsequent connect() calls
pile up on the synchronized mutex in negotiate

Nov 30 09:38:34 VM205 "Transport1" daemon prio=10
tid=0x00000000562f2800 nid=0x3ff8 waiting for monitor entry
[0x0000000052c5f000..0x0000000052c5fc90]

Nov 30 09:38:34 VM205    java.lang.Thread.State: BLOCKED (on object monitor)

Nov 30 09:38:34 VM205   at
jcifs.smb.SmbTransport.negotiate(SmbTransport.java:239)

Nov 30 09:38:34 VM205   - waiting to lock <0x00002aaafc77cbe0> (a [B)

Nov 30 09:38:34 VM205   at
jcifs.smb.SmbTransport.doConnect(SmbTransport.java:302)

Nov 30 09:38:34 VM205   at
jcifs.util.transport.Transport.run(Transport.java:233)

Nov 30 09:38:34 VM205   at java.lang.Thread.run(Thread.java:619)

Now, as I see it the problem is that the negotiate() call is passing
in the connection info to new Socket(), which results in an immediate
connect attempt, and THEN setting the timeout, which means that it is
set to infinity on the initial socket connect attempt.

I’ve refactored the code in a couple places to resolve the issue.  How
can I submit the code to be reviewed and included in v16?

Cheers

Adam

-----Original Message-----

From: Michael B Allen [mailto:ioplex at gmail.com]

Sent: Monday, November 29, 2010 6:27 PM

To: Adam Morgan

Cc: jcifs at lists.samba.org

Subject: Re: [jcifs] pile up of Transport threads in BLOCKED state

Hi Adam,

What is the thread blocked on?

Get a backtrace.

Mike

On Fri, Nov 26, 2010 at 10:51 AM, Adam Morgan <adam.morgan at q1labs.com> wrote:

> Hi Mike

>

>

>

> We’ve run into an issue around a pile up of the Transport threads, to the

> point of using all available os threads, and causing OOMs across ALL jvms on

> the box.

>

>

>

> A customer had a configuration set up successfully for a while, but at some

> point decided to shut it off (ie shut down samba on the target box, plus

> potentially some other services), but forgot to turn off our code so it

> would stop trying to reconnect.  As a result, the box made roughly 20k

> reconnect attempts over the next couple days.  Each reconnect attempt met a

> NoRouteToHostException, and for whatever reason the Transport threads remain

> in a BLOCKED state, waiting on an object monitor, and are never released.

>  In light of this ‘runaway’ reconnect, we’ve refactored the code to attempt

> the reconnects on a increasingly-delayed basis (1s, 2s, 4s,… up to 15min

> between attempts, resetting to 0s on a successful reconnect) in hopes that

> we were just not waiting long enough and the issue would be resolved.

> However, this refactoring did not resolve the issue and we still see the

> ‘Transport’ thread count creep up over time.

>

>

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]: java.net.NoRouteToHostException: No

> route to host

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> java.net.PlainSocketImpl.socketConnect(Native Method)

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> java.net.Socket.<init>(Socket.java:180)

>

> Nov 16 10:02:13 evlqradar01 ecs[25797]:         at

> jcifs.smb.SmbTransport.ssn139(SmbTransport.java:178)

>

>

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]: java.net.NoRouteToHostException: No

> route to host

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.PlainSocketImpl.socketConnect(Native Method)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.Socket.connect(Socket.java:469)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.net.Socket.<init>(Socket.java:180)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> jcifs.smb.SmbTransport.negotiate(SmbTransport.java:242)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> jcifs.smb.SmbTransport.doConnect(SmbTransport.java:305)

>

> Nov 16 10:03:17 evlqradar01 ecs[31981]:         at

> java.lang.Thread.run(Thread.java:619)

>

>

>

> Here’s a snippet from our logs showing the thread state and blocked-time:

>

>

>

> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP

> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange

> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]

> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]

> [NOT:0000006000][172.24.251.251/- -] [-/- -]124143,Transport1 in Byte Code,

> BLOCKED, blocked-count: 1, blocked-time: 1003721 ms, wait-count: 0,

> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport1

> locked on [B at 72faf9b5

>

> Nov 17 14:23:07 172.24.251.251 [ecs] [Folder Monitor [SMTP

> tailer][smb://172.24.1.162/c$/Program Files/Microsoft/Exchange

> Server/TransportRoles/Logs/ProtocolLog/] - Reconnect]

> com.q1labs.frameworks.core.ThreadExceptionHandler: [INFO]

> [NOT:0000006000][172.24.251.251/- -] [-/- -]124142,Transport2 in Byte Code,

> BLOCKED, blocked-count: 1, blocked-time: 1003753 ms, wait-count: 0,

> wait-time: 0 ms, user cpu: 0 nanos, sys/user cpu time: 0 nanos, Transport2

> locked on [B at 5e36d88a

>

>

>

>

>

> As you can see, the threads are locked and have been blocked for ~1000s…

> well above any of the default timeouts listed in SmbConstants.

>

>

>

> Can you shed any light into this as to whether we have any configuration

> options or if this is a jcifs bug?

>

>

>

> Thanks

>

>

>

> Adam

--

Michael B Allen

Java Active Directory Integration

http://www.ioplex.com/

-- 
Michael B Allen
Java Active Directory Integration
http://www.ioplex.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jcifs.diff
Type: application/octet-stream
Size: 4278 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/jcifs/attachments/20110222/76b79fd7/attachment-0001.obj>