[cifs-protocol] [Pfif] [REG: 110120160951867] Requesting clarification of CIFS client timeout behavior

Fri Dec 3 16:22:01 MST 2010

On Fri, 3 Dec 2010 15:12:28 -0600
Steve French <smfrench at gmail.com> wrote:

> On Fri, Dec 3, 2010 at 2:21 PM, Volker Lendecke
> <Volker.Lendecke at sernet.de>wrote:
> 
> > On Fri, Dec 03, 2010 at 01:50:11PM -0500, Jeff Layton wrote:
> > > > Probably needs two tests.  One to see what happens if the (single)
> > > > connection is lost, and another to see what happens if a single
> > operation
> > > > takes a very, very long time to complete (as you describe).
> > > >
> > >
> > > I did an experiment with this on win2k8. I first doctored an smbd to
> > > discard write requests. When I try to copy a file to this host (via
> > > copy.exe), the server usually waits a little while (the time seems to
> > > vary between 30-60s or so), sends a single echo request and then
> > > reconnects the socket if it still doesn't get a write reply in about
> > > 30s. copy.exe then says "The specified network name is no longer
> > > available." Heh.
> > >
> > > That said, the behavior seems to be really inconsistent. In at least
> > > one case, no echo was sent and the socket was shut down <30s after the
> > > write request was sent.
> > >
> > > The timeout before sending an echo also seems to vary quite a bit. My
> > > suspicion is that that indicates that the client has the echo ping on a
> > > separate timer, and just selectively sends it whenever the timer pops
> > > based on certain criteria.
> >
> > Probably all this timeout stuff varies too much with
> > different application behaviours. I have the same discussion
> > right now with the opposite direction: How can a server
> > reliably tell that a client died hard? The question here is:
> > When can we reliably throw away share mode entries? A
> > colleague just measured a W2k8 timeout of 5 minutes in this
> > case, but is this dependable? I suspect we have to develop
> > our own policies for this.
> >
> 
> A loosely related question is whether POSIX forbids
> EIO or EHOSTDOWN on some syscalls.  If such were
> specified in the standard, at least for those syscalls posix
> clients can never time out (or must timeout and either
> cancel/resubmit and/or reconnect transparently)
> Currently write beyond end of file (and operations on
> offline files) are the only known special cases where timeout would
> be inappropriate, but we may find other syscalls where it
> would be inappropriate for a client to return to the user.
> 
> 

EHOSTDOWN is not a valid return for all filesystem-based syscalls in
POSIX. In a quick grep of the Linux manpages, it looks like it's only a
valid return code for accept():

[jlayton at tlielax man2]$ pwd
/usr/share/man/man2
[jlayton at tlielax man2]$ zgrep EHOSTDOWN ./*
./accept.2.gz:.BR EHOSTDOWN ,

That's hardly authoritative for POSIX, but I'd be quite surprised if
it's incorrect. EIO on the other hand is allowed almost everywhere
(since it's such a non-specific error code).

I think Volker's correct. The spec really isn't going to be
particularly helpful in this regard, though understanding Windows'
behavior is an interesting datapoint for developing our own policies.

Treating different calls differently for timeouts sounds like the road
to special-case madness. It seems to me that the best behavior would be
to have the client wait for a reply indefinitely if the server is
responding to periodic echoes. If that's unacceptable then perhaps a
tunable timeout that defaults to something very long (10 minutes or so).

> For Windows (Windows behavior may be slightly different
> than POSIX but still important for implementers to understand)
> it would be helpful to know which operations
> are allowed to return errors to the user (if the host
> hangs or goes down) and which must retry forever.
> 

-- 
Jeff Layton <jlayton at samba.org>