[cifs-protocol] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Wed Dec 1 15:08:32 MST 2010

Below...

Jeff Layton wrote:
> On Wed, 01 Dec 2010 14:44:44 -0600
> "Christopher R. Hertel" <crh at samba.org> wrote:
> 
>> Jeff Layton wrote:
>> :
>>> Yes, this is probably stretching the definition of protocol
>>> clarification, but I figured it wouldn't hurt to ask... :)
>> Not at all.
>>
>> Keep in mind that I worked with Microsoft to get these docs out, so I know
>> how important such details are to them, as well as third party implementers.
>>
>> The interesting thing about your questions is that they touch on very
>> obscure boundaries between old LANMAN behavior, NT behavior, Windows
>> behavior, and actual protocol.  Perfect storm.  I love this stuff.
>>
>> The more I think about it, the more I believe that the Echo is sent to
>> determine whether the physical connection is still up.  If it's not, then
>> there is no sense in sending an SMB_COM_NT_CANCEL anyway, since the other
>> end would likely never receive it.
>>
>> As I mentioned, NT and OS/2 were able to support a single logical SMB
>> Session over multiple connections (think of a client with three dial-up
>> modems connection to a server with three or more modems).  I think that the
>> idea was to use the Echo to test a specific link, and shut down the
>> connection bound to that link if it was down.
>>
>> The client closes the entire session only if the server is non-responsive.
>>
>> ...but that's guess-work based upon my memory.  The real answer is in the
>> Windows source and I don't have access to that any more (thank goodness!).
>>
> 
> 
> Perfect storm indeed, especially since MS-CIFS also says:
> 
> 3.2.7.1 Handling a Transport Disconnect:
> 
> When the transport indicates a disconnection, the client MUST walk
> through the PIDMIDList and return an error for each outstanding command
> to the calling application. All resources associated with the
> connection MUST be freed. Finally, the connection MUST be freed.
> 
> ...so I guess you'd have to stretch "connection" in that case to mean
> the entire bonded connection group...(Blech!)

No, just the connection, not the entire virtual circuit.  If you sent
messages on channel 2, but channels 1 and 3 are still active, you only need
to report errors for outstanding requests that were on channel 2.  The
others are fine.

Then again, even though the Windows implementations (up to and including
W2K, I belive) had support for multiple connections that support was limited
and possibly broken.  It was a hold-over from the OS/2 implementations and
NT never fully implemented it.  Lots of dead or useless code there.

As far as I can tell, the ability to handle multiple connections was only
ever used with "Direct Hosted IPX" transport.  See section 2.1.3.

> In any case, this may all be a matter of opinion since the spec doesn't
> really spell it out. It is of concern however -- it can take a VERY long
> time for some reads or writes to complete.

I'm spending a lot of time explaining the history behind the confusion.
These days, *no one* uses multiple physical connections bound to a single
SMB session.  NT and above don't support it properly anyway.  This is all
about vestigial code.

You are correct that actual behavior should be spelled out.  The problem is
that a lot of the actual behavior is due to the requirements of unused
transports and features, earlier dialects, and incomplete implementations.

> Consider, for instance, a small write that is long way past EOF on a
> server with NTFS under the hood. My understanding is that NTFS will
> zero-fill the files, and on slow storage that can take a *really* long
> time (far longer than the default 45 second SESSTIMEOUT).

Yep.

> It would seem to make far more sense to simply apply a timeout to the
> socket as a whole. IOW, only perform a reconnect if the server doesn't
> respond to echoes within a reasonable amount of time (whatever
> "reasonable" is).

This sounds like something that should be tested and then verified against
the source code.

Probably needs two tests.  One to see what happens if the (single)
connection is lost, and another to see what happens if a single operation
takes a very, very long time to complete (as you describe).

> That said, since Windows is the reference platform here, I'm quite
> interested in what it does in this situation...

I have a one-year ban on working on CIFS implementations, specifically so I
will forget what I learned from looking at the Windows source code.  That
seems to be working.  :)

Chris -)-----

-- 
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org