[jcifs] Re: jcifs-1.2.4 stability fixes

Sat Oct 1 02:27:53 GMT 2005

Please send all messages (minus packet captures) to the jcifs mailing
list. Especially these messages as they are a valuable reference when
dealing with future problems.

On Fri, 30 Sep 2005 10:21:39 +0200
Lars heete <hel at admin.de> wrote:

> > >  jcifs-handle_close.diff - handle close on negotiate
> >
> > Ok. This one's good. But I think we might as well just throw an Exception
> > out of peekKey.
>
> this would be better. There also is a bug in current Transport.loop, which 
> leaves if peekKey returns zero without disconnect.

Actually in that case I've changed the contract for peekKey to now
to return null whenever less than the specified bytes are read from
readn. That basically means the end of the stream has been reached. I
put your exception in negotiate and added one in Transport.loop. That
reduces the number of cases a little.

> > >  jcifs-buffer.diff  - use one buffer per transport instance (performance)
> >
> > No. These buffers are 64K. If you have multiple transports it's a serious
> > drain on memory so the associated page faults will far outstrip any
> > performance benifits of giving each transport it's own buffer.
> having 64MB of memory used for 1000 transport connections is worse than having 
> 20 - 30 threads waiting for the same monitor and doing probably blocking IO 
> with this monitor hold (there may be servers connected via slow links) ?
> It would be possible to use a separate buffer lock and do lazy buffer 
> allocation buffer dallocation on closed transports to reduce memory drain.

Ahh, you must be doing some kind of crawling. First, note that
everything up to and including the SMB header is read and written from
SmbTransport.sbuf which is NOT static. So unresponsive hosts do not
impact performance that much. In the case of a network crawler though,
yes, you might make more progess with separate buffers if the servers
are very cranky. But in all other cases the performance impact of non
static buffers is definitely negative. Originally we used separate
buffers and found that switching to a static one significantly improved
performance. If you want optimal performance for crawling choosing a
good algorithm can speed things up considerably more than messing around
with buffers (avoid stale file attributes!). Otherwise you can modify
SmbTransport to get buffers from the BufferCache (used for transactions
but they're perfectly good for regular SMBs) if that's what you really
want. But for stock JCIFS we're definitely not going to change this.

> > >  jcifs-session-assoc.diff - fix session->transport association to avoid
> > > 				     connects on NULL_TRANSPORT and active
> > > 				     sessions that aren't registered in transport.sessions.
> >
> > I don't understand this. I had to think for a while to recall what
> > NULL_TRANSPORT does but I believe it's simply to give logoff() something
> > to synchronize on in the event that it's called when the session isn't
> > setup. There should be no way for the NULL_TRANSPORT to actually be used.
>
> This is actually possible, if one thread calls sessionSetup and at the same 
> time the session.transport disconnects. Because sessionSetup synchronizes on
> this transport it has to wait for disconnect to leave and then 
> session.transport is already set to NULL_TRANSPORT.
>
> > But that doesn't look like what your patch is really targeting. It looks
> > like it trys to resolve a conflict between setting up the session and
> > logging it off at the same time.
>
> It could actually happen if logoff sets session.transport to NULL_TRANSPORT at 
> the time session.transport(), that session.transport() may returs 
> NULL_TRANSPORT and sessionSetup may synchronize on NULL_TRANSPORT.

Actually session.transport() NEVER returns NULL_TRANSPORT. That
method's purpose is to set session.transport to something other than
NULL_TRANSPORT.

But your correct that setting session.transport to NULL_TRANSPORT in
logoff() causes the subsequent sessionSetup to operate on NULL_TRANSPORT.

I think the solution for this one is to simply NOT set transport to
NULL_TRANSPORT in logoff(). In fact, we can just get rid of NULL_TRANSPORT
alltogether since once an SmbTransport has been created it is never
destroyed (and minus ssnLimit constraints SmbTransport.getSmbTransport()
ensures there is only one per set of constructor parameters). A
transport's thread will exit after SO_TIMEOUT ms of inactivity but the
object itself will sit in the static SmbConstants.CONNECTIONS linked list
from which nothing is ever removed. So setting SmbSession.transport to
something else doesn't do anything.

> Since logoff is done then, this is not the problem. But a second session.setup 
> called at the same time then gets a real transport obejct and synchronizes on 
> this different object, so two session setups run in parallel. This is easy to 
> fix (syncronizing on session when setting session.transport to 
> NULL_TRANSPORT)

It does this already. SmbSession.transport() is a synchronized method.

> > >  jcifs-tcon_hostname.diff - reuse available host-name to avoid reconnects
> >
> > SmbTransport.tconHostName is set in SmbTransport.doConnect() which
> > is called by Transport.connect() which is called before the two
> > instances where you inserted getHostName which calls connect(). So it
> > can never be null at that point. Perhaps you assumed connect() would
> > cause some protocol communication. This is not true. If you look at
> > jcifs.util.transport.Transport.connect() you will see it is a state
> > machine that only acts if the transport is in state 0 (not connected).
> > What was the original symptom that lead you to look at this? Are you
> > seeing tconHostName null?
>
> The reason is to avoid most of the transport.connect calls at treeConnect.
> The assumption is, that transport.tconHostname does not change once a 
> transport has been connected for the first time. 

I think you misread what I just wrote. I'm saying Transport.connect()
returns immediately if the transport is already connected. So there's
no advantage in trying to avoid calling it.

> > Again, is this based on observed behavior or are you simply drawing on
> > your understanding of the code. Submitting patches is good because it
> > helps me understand more precisely where the issues are but you have
> > to explain more about what was observed that lead you to conclude the
> > change was necessary.
>
> session.transport() can return different objects which does not have the 
> desired effect. If the transport disconnects at the same time, treeConnect 
> has to wait for the transport monitor. When it gets the monitor 
> session.transport is NULL_TRANSPORT. A second treeConnect called in another
> thread may then get a different transport object.

Again this is not correct - SmbSession can never return
NULL_TRANSPORT. But logoff() can clobber SmbSession.transport resulting
in using SmbSession.transport while it's set to NULL_TRANSPORT. I think
I've fixed this.

> > > The last two are transport fixes:
> > >
> > >  jcifs-transport_setup.diff -
> > >      modify transport setup to allow timeouts on socket.connect,
> >
> > Connect timeouts are handled in Transport.connect():166. Your technique
> > is superior but it requires Java 1.4 so I'd rather hold off on that for
> > a while.
>
> That leaves the transport thread open for a very long time, probably causing 
> many waiting threads if a server is not reachable. 

I know. I'll incorporate this into 2.0 but right now I'm trying to stick
to stability fixes.

Note that the timeout shouldn't be *that* long. On Linux I think a
"Connection timed out" takes about 1 minute 15 seconds.

> > >      fix session-transport matching if transport.port != DEFAULT_PORT
> >
> > Good one. I love changes that make the code simpler (and correct :-).
> >
> > But does this depend on Handler.getDefaultPort returning 0 in some way? I
> > don't think we should do that because that API contract is not defined
> > by us. Port 445 is conceptually the default port. 139 is more like a
> > "fallback" port.
>
> This is needed to detect if no port was specified (0 meaning any available 
> port)

But we don't want "any availble port". We specifically want port 445
and then only if that doesn't work, port 139.

> > Also, why did you widen the scope of the catch in doConnect from
> > ConnectException to IOException? Did you encounter an exception that
> > warranted retrying on a different port? If so, do you have a stack trace?
>
> needed for the Connect timeouts

Well if the connect times out you will probably get the same error with
port 139. If you show me a stack trace and I'll think about it. I want
to be somewhat selective here.

Mike