[jcifs] Issues with connections to servers that reboot

Sean Daley spdaley at gmail.com
Wed Jun 22 14:20:46 MDT 2011


Hi Mike,

I've spent some more time troubleshooting this in my environment and I think the
problem is more on my end than the JCIFS end.  For some reason, I have
a particular
server that when I reboot it, the connection to it will stay in an
ESTABLISHED state
for 15 - 16 minutes or so.  After about 15 minutes the connection will
eventually die.
I haven't fully figured out why that's happening yet but that seems to
be the main
fact that's causing me problems.

So if I run my test utility for at least 15 minutes after the server
reboots, eventually
the program will start working again without having to re-start it.

I also discovered that if I wait about 30 seconds or so after
listFiles first fails, this
gives JCIFS enough time to also disconnect the connection.  Because I
keep executing
the listFiles almost immediately after the first one finishes, this
keeps re-using the same
broken connection until it eventually times out after 15 minutes.

What I saw here was the following:

1) listFiles is executed.
2) After about 30 seconds or so, a "Read timed out" exception is
thrown and caught in
    the loop() method of Transport.java.
3) This then calls disconnect with hard == false since it's a timeout.
4) In disconnect we get to this code:
            case 3: /* connected - go ahead and disconnect */
                if (response_map.size() != 0 && !hard) {
                    break; /* outstanding requests */
                }
                doDisconnect( hard );
   In my case, response_map.size() is still greater than 0 so we don't
actually disconnect.
   I'm assuming it's non-zero because of the listAll call and that
hasn't fully completed yet here?
5) If I immediately execute listAll again, then we'll hit a perpetual
cycle of steps 2 - 4 until
    after my 15 minute comes into play.
6) If I pause for 30 seconds or so after listFiles, then steps 2 - 4
will happen again before
    listFiles is called again but this time response_map.size() will
be == 0 and we'll disconnect
    the connection.
7) Once we've disconnected, if I then call listFiles, it will now
re-connect successfully (assuming
    my server is back up again)

Hope that helps or makes sense.

I've been able to reduce some of the impact in my environment by
detecting some of these errors
and holding off on making any additional calls for a little while.
That's definitely helping.  I'll also
try out 1.3.16 as I think that will definitely help some of the other
things I've run into.

Thanks for the help on this.

Sean

On Wed, Jun 22, 2011 at 1:29 PM, Michael B Allen <ioplex at gmail.com> wrote:
> Hi Sean,
>
> I was not able to reproduce this issue. Windows Server 2008r2 produces
> the "timedout waiting for response" error (as opposed to Windows
> Server 2003 which produces "connection reset") but after about a
> minute, the program recovered and correctly listed the target
> directory.
>
> I have applied Simon's try / catch anyway (but in
> util/transport/Transport.java) since an exception from doDisconnect is
> clearly bad for the Transport.java state machine. I don't know if it
> will help with your issue but I recommend trying the
> soon-to-be-released 1.3.16.
>
> Mike
>
> --
> Michael B Allen
> Java Active Directory Integration
> http://www.ioplex.com/
>
>
> On Tue, Jun 21, 2011 at 10:10 PM, Sean Daley <spdaley at gmail.com> wrote:
>> I seem to be running into a random issue with JCIFS re-connecting to a
>> server that is
>> rebooted.  I've attached a simple Java program which connects to the
>> Admin$ share
>> and calls listFiles on it.  It then repeats this every second.
>>
>> Sometimes, when I reboot a server and/or shut it down for a few
>> minutes and re-start
>> it, the JCIFS connection to that server never seems to recover.  This
>> doesn't seem to
>> happen to all of my servers but it does happen to some of them.
>>
>> I'm currently using Fedora 14 x86_64 as the JCIFS client connecting to
>> a wide-variety
>> of windows boxes.  The biggest windows culprit I have seems to be a
>> Windows 2008r2
>> box.
>>
>> For this particular box, I get the following logs from this test class:
>> 0: fileList returned 80 and took 190(ms).
>> 1: fileList returned 80 and took 11(ms).
>> ...
>> 33: fileList returned 80 and took 5(ms).
>> 34: fileList failed: Transport1[testhost/10.20.14.15:445] timedout
>> waiting for response to
>> Trans2FindFirst2[command=SMB_COM_TRANSACTION2,received=false,errorCode=0,flags=0x0018,flags2=0xC803,signSeq=0,tid=2048,pid=63708,uid=2048,mid=73,wordCount=15,byteCount=19,totalParameterCount=18,totalDataCount=0,maxParameterCount=10,maxDataCount=65535,maxSetupCount=0,flags=0x00,timeout=0,parameterCount=18,parameterOffset=66,parameterDisplacement=0,dataCount=0,dataOffset=84,dataDisplacement=0,setupCount=1,pad=1,pad1=0,searchAttributes=0x16,searchCount=200,flags=0x00,informationLevel=0x104,searchStorageType=0,filename=\]
>> took 30001(ms).
>> 35: ... (repeats the exact same thing as 34: every 30 seconds).
>>
>> I've let it run for awhile now and it will just continuously report
>> the "timedout waiting for ..."
>> error every 30 seconds.
>>
>> If I stop and re-start the program though it will re-connect just
>> fine.  If I enable
>> jcifs.Config.setProperty("jcifs.smb.client.ssnLimit", "1");
>> the problem also does not occur but I'd really rather not do that as
>> I'm going to potentially
>> be working with the same set of hosts many times and I rather like the
>> caching that's
>> being done here.
>>
>> I've played around with this program and differing target servers as
>> well as changing things
>> around to do something else other than a listFiles check (like an
>> exists) check and I've
>> received differing behaviors along the way.  For some of my
>> environment, with the
>> exists check, I got similar timeout behavior but it was a more
>> straight-forward exception
>> of "connection timed out".  What was worse though was that each time I
>> got that, I
>> was left with a new Thread running with the following stack trace:
>>
>> #########
>> Daemon Thread [Transport1] (Suspended)
>>        PlainSocketImpl.socketConnect(InetAddress, int, int) line: not
>> available [native method]
>>        SocksSocketImpl(PlainSocketImpl).doConnect(InetAddress, int, int) line: 333
>>        SocksSocketImpl(PlainSocketImpl).connectToAddress(InetAddress, int,
>> int) line: 195
>>        SocksSocketImpl(PlainSocketImpl).connect(SocketAddress, int) line: 182
>>        SocksSocketImpl.connect(SocketAddress, int) line: 366
>>        Socket.connect(SocketAddress, int) line: 529
>>        Socket.connect(SocketAddress) line: 478
>>        Socket.<init>(SocketAddress, SocketAddress, boolean) line: 375
>>        Socket.<init>(String, int) line: 189
>>        SmbTransport.ssn139() line: 185
>>        SmbTransport.negotiate(int, ServerMessageBlock) line: 240
>>        SmbTransport.doConnect() line: 302
>>        SmbTransport(Transport).run() line: 232
>>        Thread.run() line: 662
>> #########
>>
>> So every 30 seconds, I'd get the connection timedout error, then we'd
>> try to connect
>> again and a new Daemon Thread Transport1 would start.  These threads would take
>> upwards of 4 - 5 minutes (at least) before they finally terminated.
>> During that time though
>> we'll keep on accumulating more and more of them as we try to
>> re-connect.  Once again, if I
>> stop and re-start the test program it works just fine again right away.
>>
>> Is there any way to force a new SmbTransport to get created without
>> setting ssnLimit to 1?
>> I briefly tried setting it to 1 but I have some concerns about doing
>> that because we lose
>> the benefit of caching, plus, unless I'm misreading the code, it looks like the
>> CONNECTIONS LinkedList can grow unbounded.  So with ssnLimit == 1, we're just
>> constantly creating new SmbTransports and adding them to CONNECTIONS.  I didn't
>> find any place where we were removing them from the list though.
>>
>> Any thoughts on this?  Or is there any additional information I can get you?
>> Any help would be greatly appreciated.
>>
>> Sean
>>
>


More information about the jCIFS mailing list