[cifs-protocol] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Tue Dec 7 13:56:17 MST 2010

Jeff,

I have been working with the product group on this issue regarding Windows-based CIFS clients' timeout behavior. Please find the answers as follows. Let me know whether you need further clarifications. The product group is working on fleshing out the relevant Windows behavior notes.

1) If the server is responding to the echo requests, does the client still eventually return an error to the application or does it wait indefinitely for the response?

Answer:
If the server is responding to the echo requests, the client will wait until the session times out, and the client will not send any "interim" response to the calling application.

2) If it returns an error to the application, does the client send a SMB_COM_NT_CANCEL to cancel the outstanding request?

Answer:
The client will not send a CANCEL request on any outstanding request; it simply tears down the connection after the session times out.

3) If it waits indefinitely, does it send more than one echo request?   If so, how frequently are they sent?

Answer:
Echo requests are sent only when the connection has been idle for more than the session timeout. The default session timeout value is 45 seconds in Windows NT, and 60 seconds in Windows 2000 and onward. If there is no response on the connection for another session timeout, the client will tear down the connection.  If there is "any" response, it will not disconnect. The same repeats again.  

4) Do more recent versions of Windows behave similarly?

Answer:
Yes, there is no behavior change in recent versions of Windows.

Regards,
Edgar

-----Original Message-----
From: Edgar Olougouna 
Sent: Friday, December 03, 2010 4:24 PM
To: Jeff Layton; 'Christopher R. Hertel'
Cc: pfif at tridgell.net; cifs-protocol at samba.org; MSSolve Case Email
Subject: RE: [cifs-protocol] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Just an FYI that I filed a technical document issue on this issue and will update you as soon we complete our investigation.
I have been researching Windows NT and 2K and passed on my observations to the product team. 

Thanks,
Edgar

-----Original Message-----
From: Christopher R. Hertel [mailto:crh at samba.org]
Sent: Wednesday, December 01, 2010 4:09 PM
To: Jeff Layton
Cc: Edgar Olougouna; pfif at tridgell.net; cifs-protocol at samba.org; MSSolve Case Email
Subject: Re: [cifs-protocol] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Below...

Jeff Layton wrote:
> On Wed, 01 Dec 2010 14:44:44 -0600
> "Christopher R. Hertel" <crh at samba.org> wrote:
> 
>> Jeff Layton wrote:
>> :
>>> Yes, this is probably stretching the definition of protocol 
>>> clarification, but I figured it wouldn't hurt to ask... :)
>> Not at all.
>>
>> Keep in mind that I worked with Microsoft to get these docs out, so I 
>> know how important such details are to them, as well as third party implementers.
>>
>> The interesting thing about your questions is that they touch on very 
>> obscure boundaries between old LANMAN behavior, NT behavior, Windows 
>> behavior, and actual protocol.  Perfect storm.  I love this stuff.
>>
>> The more I think about it, the more I believe that the Echo is sent 
>> to determine whether the physical connection is still up.  If it's 
>> not, then there is no sense in sending an SMB_COM_NT_CANCEL anyway, 
>> since the other end would likely never receive it.
>>
>> As I mentioned, NT and OS/2 were able to support a single logical SMB 
>> Session over multiple connections (think of a client with three 
>> dial-up modems connection to a server with three or more modems).  I 
>> think that the idea was to use the Echo to test a specific link, and 
>> shut down the connection bound to that link if it was down.
>>
>> The client closes the entire session only if the server is non-responsive.
>>
>> ...but that's guess-work based upon my memory.  The real answer is in 
>> the Windows source and I don't have access to that any more (thank goodness!).
>>
> 
> 
> Perfect storm indeed, especially since MS-CIFS also says:
> 
> 3.2.7.1 Handling a Transport Disconnect:
> 
> When the transport indicates a disconnection, the client MUST walk 
> through the PIDMIDList and return an error for each outstanding 
> command to the calling application. All resources associated with the 
> connection MUST be freed. Finally, the connection MUST be freed.
> 
> ...so I guess you'd have to stretch "connection" in that case to mean 
> the entire bonded connection group...(Blech!)

No, just the connection, not the entire virtual circuit.  If you sent messages on channel 2, but channels 1 and 3 are still active, you only need to report errors for outstanding requests that were on channel 2.  The others are fine.

Then again, even though the Windows implementations (up to and including W2K, I belive) had support for multiple connections that support was limited and possibly broken.  It was a hold-over from the OS/2 implementations and NT never fully implemented it.  Lots of dead or useless code there.

As far as I can tell, the ability to handle multiple connections was only ever used with "Direct Hosted IPX" transport.  See section 2.1.3.

> In any case, this may all be a matter of opinion since the spec 
> doesn't really spell it out. It is of concern however -- it can take a 
> VERY long time for some reads or writes to complete.

I'm spending a lot of time explaining the history behind the confusion.
These days, *no one* uses multiple physical connections bound to a single SMB session.  NT and above don't support it properly anyway.  This is all about vestigial code.

You are correct that actual behavior should be spelled out.  The problem is that a lot of the actual behavior is due to the requirements of unused transports and features, earlier dialects, and incomplete implementations.

> Consider, for instance, a small write that is long way past EOF on a 
> server with NTFS under the hood. My understanding is that NTFS will 
> zero-fill the files, and on slow storage that can take a *really* long 
> time (far longer than the default 45 second SESSTIMEOUT).

Yep.

> It would seem to make far more sense to simply apply a timeout to the 
> socket as a whole. IOW, only perform a reconnect if the server doesn't 
> respond to echoes within a reasonable amount of time (whatever 
> "reasonable" is).

This sounds like something that should be tested and then verified against the source code.

Probably needs two tests.  One to see what happens if the (single) connection is lost, and another to see what happens if a single operation takes a very, very long time to complete (as you describe).

> That said, since Windows is the reference platform here, I'm quite 
> interested in what it does in this situation...

I have a one-year ban on working on CIFS implementations, specifically so I will forget what I learned from looking at the Windows source code.  That seems to be working.  :)

Chris -)-----

--
"Implementing CIFS - the Common Internet FileSystem" ISBN: 013047116X
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org