[jcifs] Re: jcifs-1.2.4 stability fixes

Mon Oct 10 11:21:32 GMT 2005

Hello,

> > > Lars Heete <hel at admin.de> wrote:
> > > > > One of the main problems with your original patch was that it was
> > > > > mostly irrelevant stuff. If you resubmit it with only the changes
> > > > > necessary to make the client behave the way you want then I will
> > > > > reconsider it. Aside from the bug regarding adding the session back to
> > > > > transport.sessions introduced in 1.2.5 (and your original "live lock"
> > > > > issue), that code
> > > >
> > > > there also is the deadlock between sessionSetup and send introduced in
> > > > 1.2.4 (doSend0/sendrecv calling disconnect on IOError). Anyway I think
> > >
> > > Do I know about this one? What is this one all about?
> > 
> > To help your memories
> > 
> > -- excerpt from prevoius mails in this thread --
> > > > > > The current SmbTransport actually has a deadlock in the paths
> > > > > > doSend->disconnect (first rmap lock then transport lock) and
> > > > > > sessionSetup->send (first transport lock then rmap lock).
> > > > > > This is fixed doing the error handling only in the transport thread.
> > > > >
> > > > > I don't know about this. I think logic actually prevents this from
> > > > > ever happening because doSend() cannot be called without first doing
> > > > > a sessionSetup(). So you either do sessionSetup() and then send() but a
> > > > > subsequent sessionSetup() will just return because the session isalready
> > > > > "setup".
> > > > there may be actually multiple sessions on one transport.
> > > 
> > > True. I didn't think about that.
> > ----
> 
> Ahh, yes. I remember now. This was the inspiration for the "test"
> branch. I'm not sure how to handle this differently.
> 
> > > > disconnect
> > > > connect
> > > > disconnect
> > > > connect
> > > >
> > > > untill all threads with sending errors are done.
> > >
> > > Correct. Each thread is serialized so Thread1 tries and fails to connect
> > > and calls disconnect. Then Thread2 wakes up tries and fails to connent
> > > and calls disconnect. Then Thread3 and so on ad infinitum. AFAIC this
> > > is the correct behavior because only one thread can be allowed to work
> > > on changing the connection state of the transport.
> > Not exactly what I meant. The first thread gets an IOException in doSend 
> > because the transport socket had been closed. It calls disconnect and exits.
> > The second thread now gets the monitor, the transport connects and he tries to 
> > send. The third thread now gets the monitor and disconnects the transport, 
> > probably causing the second thread to fail. And so on. Is this sane 
> > behavior ?
> 
> What exactly do you mean by "tries to send"? Sending anything other than
> a session setup, logoff, tree connect, or tree disconnect doesn't happen
> with the transport locked. So a disconnect can shutdown the transport even
> though doSend() has been called. Access to the socket is serialized by
> the lock on SmbTransport.BUF. If the third thread decided the transport
> should be shutdown then yes, it is normal for the transport to shutdown
> before another thread has had a chance to send. What do you expect? Should
> the second thread be allowed to proceed and write it's message to a broken
> transport?
No, the transport has been reconnected by the second thread and is actually
running. The third thread is just waiting to handle the error it got in parallel to
the first thread. It does not know that the transport already had been reconnected,
so it makes the wrong decision to close it again.
Socket events like "connection reset by remote peer" cannot be synchronized
by a monitor. You have to be prepared that if multiple threads are using this socket,
you can have multiple threads with exceptions for one error.
Even if you check for "open transport" (not thinking about the locking implications)
in doSend which is serialized by the BUF monitor, there will still be the transport 
thread reading from the socket in peekKey.
Attached is a log that documents this behavior short after a single disconnect from server
under heavy load, and a network trace (only the SYN, FIN and RST packets).
Because of the deadlock I was forced to use your jcifs-1.2.6test.

> > > The "test" branch changed this behavior slightly. When Thread1 gets an
> > > error it calls disconnect and sets the state to be in "error". Thread2
> > > then sees the state is "error", resets the state to 0 (not connected)
> > > and then throws a "transport in error" exception. Thread3 then tries and
> > > fails to connect and calls disconnect, etc. So the number of cycles is
> > > effectively cut in half.
> > >
> > > So what do you *want* to happen?
> > report errors in the sending threads and close down the transport from the 
> > reading transport thread (this is what my transport changes try to 
> > implement).
> 
> Uhh, I'm not sure I fully understand what you're saying but I think this
> is what we do now. Any caller thread will get an exception if there is
> an error either in sending the request it initiated or if there is an
> error in the underlying transport and the transport is shutdown by the
> thread that detects a problem. Again, what do you expect to happen?
see above (there may be multiple threads that detect errors).
> > > > > [1] Note that lowering jcifs.smb.client.ssnLimit would probably suffice
> > > > >     to resolve this issue. It was not really anticipated that the
> > > > > number of outstanding requests on a transport at one time would be more
> > > > > than say 2 or 3.
> > > >
> > > > Not really an option if you may have hundreds of sessions per server.
> > >
> > > Why not? I don't think you understand what ssnLimit does. If ssnLimit is
> > > say 250 (the default) that means that after the 250th session, instead
> > > of reusing the existing transport, a new one is created which itself
> > > can then support 250 sessions. Meaning if you set ssnLimit to 25 and you
> > > have 500 sessions, they will be spread across 500 / 25 = 20 transports
> > > cutting your response_map.notifyAll() overhead by a factor of 20. That
> > > sounds like an option worth trying to me.
> > Like memory threads are a limited resource. Also there may be more than one 
> > thread using a session.
> 
> So what? Lowering the ssnLimit will proportionally reduce contention
> for the response_map. That's a fact.
You are right, but for stability reasons I would try to avoid ssnLimit for now.

> > Anyway there are some issues with ssnLimit not beeing effective when sessions 
> > reconnect and not being disassociated from transport.
> 
> You mean not being RE-associated with the transport. I think this is the "Session
> Management" bug I just fixed in 1.2.6.
yes, and now casing the sessions not to be removed from session-list. If you have
several thousand users (meaning several thousand sessions over the time), this could
cause _many_ transport threads running.
 
> > > The SmbThreadTest Results
> > >
> > > I have spent some time looking at your SmbThreadTest.java example. I
> > > did find one issue. It turns out that the SmbTransport.matches() routine
> > > wasn't quite right. It needs to look like this:
> > >
> > >   boolean matches( UniAddress address, int port,
> > >            InetAddress localAddr, int localPort ) {
> > >     return address.equals( this.address ) &&
> > >           (port == 0 || port == this.port ||
> > >               /* port 139 is ok if 445 was requested */
> > >               (port == 445 && this.port == 139)) &&
> > >           (localAddr == this.localAddr ||
> > >
> > > because if the transport tries to connect and falls back to port 139 then
> > > the ports will not match and SmbTransport.getSmbTransport() will end
> > > up creating a new transport every time. This will result in many many
> > > sockets being opened to the target machine which will eventually choke
> > > and the client will throw exceptions like SocketeException: socket closed.
> > if you had used not only parts of my change, this would have worked correctly.  
> 
> I think you're referring to your change to getDefaultPort returning
> 0. The java.net.URL handler implementation has a contract associated
> with it that we cannot change. So I don't think it's ok for it to return
> 0 to satisfy some logic elsewhere. That was a poor fix.
No, I think my change is reasonable.
URL-Default ports means the port used, when nothing else is specified, eg.
smb://server vs. smb://server:139 and smb://server:445
This is no problem for http, ftp where such a default port exists.
With smb there are two alternative used ports. Which one you 
use does not depend on the URL, but on the server-version, needed
features, network restrictions, etc. Also with using 0 as a placeholder for
"URL default port", you can clealy distinct between the cases port specified
and no port specified 

> > There is still a problem if the netbios port has to be used, and the 
> > transport gets an exception on first connect. Then port 445 is used and the 
> > sessions with workstation limitations fail until the transport is closed.
> 
> No. If port 139 fails we do NOT try to use port 445. The connect just
> fails. So AFAIK there is no problem with this code.
Ok, no problem here (actually doConnect retries after ConnectException on port 445, 
but negotiate then sets the port back to 139).

> > > Managing the SmbTransport.sessions List
> > >
> > > As for the outstanding issue of when and how to add and remove sessions
> > > from the SmbTransport.sessions list this is still not handled correctly
> > > even in the "test" branch. I didn't realize that the SmbTree.session
> > > member is never released once set. This means that (as you originally
> > > discovered) if we remove the session from the SmbTransport.sessions list
> > > (such as when disconnect logs off all sessions) it will not be re-added
> > > when the session is re-"setup". This is not catastrophic but extra
> > > redundant sessions may be created if other threads request a session with
> > > the same credentials.
> > this _is_ catastrophic if the transport closes and reconnects, having 
> > non-working open sessions.
> 
> No, actually I don't think it is. The only effect caused by removing and
> not readding a session from the SmbTransport.sessions list is that it will
> not be automatically logged off when the transport disconnects. So the
> server will not get polite tree disconnect and logoff messages. That's
> not a big deal. The sessions should still work even though they're not
> in the list.
thats not the problem, but the session remains in the state open on transport 
disconnect, and therefore does no logon after transport reconnect. If it's get used then 
(this is still possible), strange things will happen

> Regardless this behavior will no longer occur in 1.2.6 so it's not
> an issue.
but now we have an issue there with a large number of users!
 
> > > For now I have simply stoped removing the sessions from the sessions
> > > list. That should keep things normalized, but the downside to this is that
> > > the sessions will never be released which is a waste of memory. I think
> > this makes ssnLimit even worse. You also could have used my patch.
> > 
> > > the correct solution here might be to dynamically aquire the session in
> > > SmbTree.java like we do with the transport() in SmbSession but I'm not
> > > certain of the implications of doing that.
> > I think this will only help gradually. Anyway the number of trees per session 
> > is quite limited.
...so distributing trees between multiple sessions is not very useful.
> 
> Now I get the feeling you're just grasping at random crap. This statement
> is pretty vague, don't you think? If you want me to help you make this
Its just an opinion, may be it helps may be it helps not. I just think it is not very
useful.

> work with your application then you're going to have to provide some
> concrete issues. I can't fix "this makes ssnLimit even worse". You have
> to tell me HOW and provide data.
see above. You have many users, that will cause many sessions to be
generated. When these sessions don't get removed from session-list, the
limited number of sessions per transport cause many transport objects
being created. And don't expect these session to distribute in a nice way
over these transport objects, that they will close down on inactivity. 
Anyway, the current close-down of expired sessions, that is currently done
only for transports with numbers of sessions < ssnLimit (in transport.getSmbSession
that is called afted SmbTransport.getSmbTransport that handles the limit),
you will have no session expiration with ssnLimit set on many transport instances.

> As for your patches, they were not correct. You basically moved
> code around until you stopped seeing whatever it was that bothered
> you. But the biggest problem with them was that they had extra stuff
> that wasn't relivant to the "livelock" issue. But again, if you make
> minimal modifications to 1.2.6 that eliminates or elieviates the issue
> (and possibly solves the rmap->transport/transport->rmap deadlock)
> then I will *consider* your patch.
There may be no _minimal_ modifications possible. As you may have
experienced with your own jcifs-1.2.6test, these kind of bugs are
not simple to fix (actually you _have_ to move some code around).
I could have split this patch  into several smaller ones, but because
these changes depend on each other, this would not  help very much.
The locking is safe, the transport thread maintains it state transitions 
from CONNECTING to CLOSED alone (disconnect() is only called
from transport thread). Therefore there is no need for taking any monitor
for reading the state.

> As of right now I will assume that all of your issues are basically
> "resolved". My solution for the rmap->transport/trasnport->rmap deadlock
> was the "test" branch but you didn't seem like you wanted to pursue that
> so for now that issue is on the shelf. As for any of your other issues
> you MUST provide concrete data with explicit descriptions of individual
> deadlocks, invalid exceptions, or incorrect logic AND provide appropriate
> thread dumps and or stack traces. I'm not going to follow through on
> vague discriptions.
reread the whole thread and think about it, if you have the time.
If you don't want, for any reason, anyone else to contribute significant
changes to your code, it's ok. Just say it, and don't waste my time.

Lars
-------------- next part --------------
A non-text attachment was scrubbed...
Name: transport_flapping.log
Type: text/x-log
Size: 6924 bytes
Desc: not available
Url : http://lists.samba.org/archive/jcifs/attachments/20051010/59956acf/transport_flapping.bin