[cifs-protocol] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Fri Dec 3 15:23:51 MST 2010

Just an FYI that I filed a technical document issue on this issue and will update you as soon we complete our investigation.
I have been researching Windows NT and 2K and passed on my observations to the product team. 

Thanks,
Edgar

-----Original Message-----
From: Christopher R. Hertel [mailto:crh at samba.org] 
Sent: Wednesday, December 01, 2010 4:09 PM
To: Jeff Layton
Cc: Edgar Olougouna; pfif at tridgell.net; cifs-protocol at samba.org; MSSolve Case Email
Subject: Re: [cifs-protocol] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Below...

Jeff Layton wrote:
> On Wed, 01 Dec 2010 14:44:44 -0600
> "Christopher R. Hertel" <crh at samba.org> wrote:
> 
>> Jeff Layton wrote:
>> :
>>> Yes, this is probably stretching the definition of protocol 
>>> clarification, but I figured it wouldn't hurt to ask... :)
>> Not at all.
>>
>> Keep in mind that I worked with Microsoft to get these docs out, so I 
>> know how important such details are to them, as well as third party implementers.
>>
>> The interesting thing about your questions is that they touch on very 
>> obscure boundaries between old LANMAN behavior, NT behavior, Windows 
>> behavior, and actual protocol.  Perfect storm.  I love this stuff.
>>
>> The more I think about it, the more I believe that the Echo is sent 
>> to determine whether the physical connection is still up.  If it's 
>> not, then there is no sense in sending an SMB_COM_NT_CANCEL anyway, 
>> since the other end would likely never receive it.
>>
>> As I mentioned, NT and OS/2 were able to support a single logical SMB 
>> Session over multiple connections (think of a client with three 
>> dial-up modems connection to a server with three or more modems).  I 
>> think that the idea was to use the Echo to test a specific link, and 
>> shut down the connection bound to that link if it was down.
>>
>> The client closes the entire session only if the server is non-responsive.
>>
>> ...but that's guess-work based upon my memory.  The real answer is in 
>> the Windows source and I don't have access to that any more (thank goodness!).
>>
> 
> 
> Perfect storm indeed, especially since MS-CIFS also says:
> 
> 3.2.7.1 Handling a Transport Disconnect:
> 
> When the transport indicates a disconnection, the client MUST walk 
> through the PIDMIDList and return an error for each outstanding 
> command to the calling application. All resources associated with the 
> connection MUST be freed. Finally, the connection MUST be freed.
> 
> ...so I guess you'd have to stretch "connection" in that case to mean 
> the entire bonded connection group...(Blech!)

No, just the connection, not the entire virtual circuit.  If you sent messages on channel 2, but channels 1 and 3 are still active, you only need to report errors for outstanding requests that were on channel 2.  The others are fine.

Then again, even though the Windows implementations (up to and including W2K, I belive) had support for multiple connections that support was limited and possibly broken.  It was a hold-over from the OS/2 implementations and NT never fully implemented it.  Lots of dead or useless code there.

As far as I can tell, the ability to handle multiple connections was only ever used with "Direct Hosted IPX" transport.  See section 2.1.3.

> In any case, this may all be a matter of opinion since the spec 
> doesn't really spell it out. It is of concern however -- it can take a 
> VERY long time for some reads or writes to complete.

I'm spending a lot of time explaining the history behind the confusion.
These days, *no one* uses multiple physical connections bound to a single SMB session.  NT and above don't support it properly anyway.  This is all about vestigial code.

You are correct that actual behavior should be spelled out.  The problem is that a lot of the actual behavior is due to the requirements of unused transports and features, earlier dialects, and incomplete implementations.

> Consider, for instance, a small write that is long way past EOF on a 
> server with NTFS under the hood. My understanding is that NTFS will 
> zero-fill the files, and on slow storage that can take a *really* long 
> time (far longer than the default 45 second SESSTIMEOUT).

Yep.

> It would seem to make far more sense to simply apply a timeout to the 
> socket as a whole. IOW, only perform a reconnect if the server doesn't 
> respond to echoes within a reasonable amount of time (whatever 
> "reasonable" is).

This sounds like something that should be tested and then verified against the source code.

Probably needs two tests.  One to see what happens if the (single) connection is lost, and another to see what happens if a single operation takes a very, very long time to complete (as you describe).

> That said, since Windows is the reference platform here, I'm quite 
> interested in what it does in this situation...

I have a one-year ban on working on CIFS implementations, specifically so I will forget what I learned from looking at the Windows source code.  That seems to be working.  :)

Chris -)-----

--
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org