[jcifs] Re: jcifs-1.2.4 stability fixes

Sat Oct 8 19:31:09 GMT 2005

On Sat, 8 Oct 2005 12:32:20 +0200
Lars Heete <hel at admin.de> wrote:

> Hello
> > On Fri, 7 Oct 2005 11:07:42 +0200
> >
> > Lars Heete <hel at admin.de> wrote:
> > > > One of the main problems with your original patch was that it was
> > > > mostly irrelevant stuff. If you resubmit it with only the changes
> > > > necessary to make the client behave the way you want then I will
> > > > reconsider it. Aside from the bug regarding adding the session back to
> > > > transport.sessions introduced in 1.2.5 (and your original "live lock"
> > > > issue), that code
> > >
> > > there also is the deadlock between sessionSetup and send introduced in
> > > 1.2.4 (doSend0/sendrecv calling disconnect on IOError). Anyway I think
> >
> > Do I know about this one? What is this one all about?
> 
> To help your memories
> 
> -- excerpt from prevoius mails in this thread --
> > > > > The current SmbTransport actually has a deadlock in the paths
> > > > > doSend->disconnect (first rmap lock then transport lock) and
> > > > > sessionSetup->send (first transport lock then rmap lock).
> > > > > This is fixed doing the error handling only in the transport thread.
> > > >
> > > > I don't know about this. I think logic actually prevents this from
> > > > ever happening because doSend() cannot be called without first doing
> > > > a sessionSetup(). So you either do sessionSetup() and then send() but a
> > > > subsequent sessionSetup() will just return because the session isalready
> > > > "setup".
> > > there may be actually multiple sessions on one transport.
> > 
> > True. I didn't think about that.
> ----

Ahh, yes. I remember now. This was the inspiration for the "test"
branch. I'm not sure how to handle this differently.

> > > disconnect
> > > connect
> > > disconnect
> > > connect
> > >
> > > untill all threads with sending errors are done.
> >
> > Correct. Each thread is serialized so Thread1 tries and fails to connect
> > and calls disconnect. Then Thread2 wakes up tries and fails to connent
> > and calls disconnect. Then Thread3 and so on ad infinitum. AFAIC this
> > is the correct behavior because only one thread can be allowed to work
> > on changing the connection state of the transport.
> Not exactly what I meant. The first thread gets an IOException in doSend 
> because the transport socket had been closed. It calls disconnect and exits.
> The second thread now gets the monitor, the transport connects and he tries to 
> send. The third thread now gets the monitor and disconnects the transport, 
> probably causing the second thread to fail. And so on. Is this sane 
> behavior ?

What exactly do you mean by "tries to send"? Sending anything other than
a session setup, logoff, tree connect, or tree disconnect doesn't happen
with the transport locked. So a disconnect can shutdown the transport even
though doSend() has been called. Access to the socket is serialized by
the lock on SmbTransport.BUF. If the third thread decided the transport
should be shutdown then yes, it is normal for the transport to shutdown
before another thread has had a chance to send. What do you expect? Should
the second thread be allowed to proceed and write it's message to a broken
transport?

> > The "test" branch changed this behavior slightly. When Thread1 gets an
> > error it calls disconnect and sets the state to be in "error". Thread2
> > then sees the state is "error", resets the state to 0 (not connected)
> > and then throws a "transport in error" exception. Thread3 then tries and
> > fails to connect and calls disconnect, etc. So the number of cycles is
> > effectively cut in half.
> >
> > So what do you *want* to happen?
> report errors in the sending threads and close down the transport from the 
> reading transport thread (this is what my transport changes try to 
> implement).

Uhh, I'm not sure I fully understand what you're saying but I think this
is what we do now. Any caller thread will get an exception if there is
an error either in sending the request it initiated or if there is an
error in the underlying transport and the transport is shutdown by the
thread that detects a problem. Again, what do you expect to happen?

> > > > [1] Note that lowering jcifs.smb.client.ssnLimit would probably suffice
> > > >     to resolve this issue. It was not really anticipated that the
> > > > number of outstanding requests on a transport at one time would be more
> > > > than say 2 or 3.
> > >
> > > Not really an option if you may have hundreds of sessions per server.
> >
> > Why not? I don't think you understand what ssnLimit does. If ssnLimit is
> > say 250 (the default) that means that after the 250th session, instead
> > of reusing the existing transport, a new one is created which itself
> > can then support 250 sessions. Meaning if you set ssnLimit to 25 and you
> > have 500 sessions, they will be spread across 500 / 25 = 20 transports
> > cutting your response_map.notifyAll() overhead by a factor of 20. That
> > sounds like an option worth trying to me.
> Like memory threads are a limited resource. Also there may be more than one 
> thread using a session.

So what? Lowering the ssnLimit will proportionally reduce contention
for the response_map. That's a fact.

> Anyway there are some issues with ssnLimit not beeing effective when sessions 
> reconnect and not being disassociated from transport.

You mean not being RE-associated with the transport. I think this is the "Session
Management" bug I just fixed in 1.2.6.

> > The SmbThreadTest Results
> >
> > I have spent some time looking at your SmbThreadTest.java example. I
> > did find one issue. It turns out that the SmbTransport.matches() routine
> > wasn't quite right. It needs to look like this:
> >
> >   boolean matches( UniAddress address, int port,
> >            InetAddress localAddr, int localPort ) {
> >     return address.equals( this.address ) &&
> >           (port == 0 || port == this.port ||
> >               /* port 139 is ok if 445 was requested */
> >               (port == 445 && this.port == 139)) &&
> >           (localAddr == this.localAddr ||
> >
> > because if the transport tries to connect and falls back to port 139 then
> > the ports will not match and SmbTransport.getSmbTransport() will end
> > up creating a new transport every time. This will result in many many
> > sockets being opened to the target machine which will eventually choke
> > and the client will throw exceptions like SocketeException: socket closed.
> if you had used not only parts of my change, this would have worked correctly.  

I think you're referring to your change to getDefaultPort returning
0. The java.net.URL handler implementation has a contract associated
with it that we cannot change. So I don't think it's ok for it to return
0 to satisfy some logic elsewhere. That was a poor fix.

> There is still a problem if the netbios port has to be used, and the 
> transport gets an exception on first connect. Then port 445 is used and the 
> sessions with workstation limitations fail until the transport is closed.

No. If port 139 fails we do NOT try to use port 445. The connect just
fails. So AFAIK there is no problem with this code.

> > Managing the SmbTransport.sessions List
> >
> > As for the outstanding issue of when and how to add and remove sessions
> > from the SmbTransport.sessions list this is still not handled correctly
> > even in the "test" branch. I didn't realize that the SmbTree.session
> > member is never released once set. This means that (as you originally
> > discovered) if we remove the session from the SmbTransport.sessions list
> > (such as when disconnect logs off all sessions) it will not be re-added
> > when the session is re-"setup". This is not catastrophic but extra
> > redundant sessions may be created if other threads request a session with
> > the same credentials.
> this _is_ catastrophic if the transport closes and reconnects, having 
> non-working open sessions.

No, actually I don't think it is. The only effect caused by removing and
not readding a session from the SmbTransport.sessions list is that it will
not be automatically logged off when the transport disconnects. So the
server will not get polite tree disconnect and logoff messages. That's
not a big deal. The sessions should still work even though they're not
in the list.

Regardless this behavior will no longer occur in 1.2.6 so it's not
an issue.

> > For now I have simply stoped removing the sessions from the sessions
> > list. That should keep things normalized, but the downside to this is that
> > the sessions will never be released which is a waste of memory. I think
> this makes ssnLimit even worse. You also could have used my patch.
> 
> > the correct solution here might be to dynamically aquire the session in
> > SmbTree.java like we do with the transport() in SmbSession but I'm not
> > certain of the implications of doing that.
> I think this will only help gradually. Anyway the number of trees per session 
> is quite limited.

Now I get the feeling you're just grasping at random crap. This statement
is pretty vague, don't you think? If you want me to help you make this
work with your application then you're going to have to provide some
concrete issues. I can't fix "this makes ssnLimit even worse". You have
to tell me HOW and provide data.

As for your patches, they were not correct. You basically moved
code around until you stopped seeing whatever it was that bothered
you. But the biggest problem with them was that they had extra stuff
that wasn't relivant to the "livelock" issue. But again, if you make
minimal modifications to 1.2.6 that eliminates or elieviates the issue
(and possibly solves the rmap->transport/transport->rmap deadlock)
then I will *consider* your patch.

As of right now I will assume that all of your issues are basically
"resolved". My solution for the rmap->transport/trasnport->rmap deadlock
was the "test" branch but you didn't seem like you wanted to pursue that
so for now that issue is on the shelf. As for any of your other issues
you MUST provide concrete data with explicit descriptions of individual
deadlocks, invalid exceptions, or incorrect logic AND provide appropriate
thread dumps and or stack traces. I'm not going to follow through on
vague discriptions.

Mike