Timeout waaay too long

Mon Mar 6 19:24:59 GMT 2006

> been suffering from one major problem that affects smb, nfs as well as 
>cifs: Long lock ups on server down times.
>
>If the serving computer shuts down / crashes, every process hangs when 
>trying to access the still mounted share. Just try df, mount or ls. You 
>can't even umount the share in this condition! When using smb this was 
>very severe as you couldn't mount a share with a "soft" option. So the 
>process in fact just hung until I pressed reset. Now, with cifs it is 
>much better. But still the timeout  is very long. And you still cannot 
>umount an offline share to prevent further lock ups.
To bring other users up to speed I should recap the current implementation/
behavior of the cifs code here (if this does not match users experiences
with current code let me know).

SEND: We will attempt to send CIFS requests (on stuck or full tcp sockets)
for approximately seven seconds in smb_send or 15 seconds in smb_send2
(since this handles large writes commonly 52K which presumably could
take longer), but we do not alter th sk_sndtimeo from its default
(infinite).  Linux NFS server (but interestingly not the NFS client)
alters the socket sk_sndtimeo but it is not whether the sk_sndtimeo
does need to be set to a lower value (perhaps 10-30 seconds) since
the kernel socket API seems to report back EAGAIN and ENOSPC as needed
already.

RECEIVE: If the request was put on the wire successfully we wait different
amounts of time depending on the type of request.
1) blocking requests (blocking byte range lock requests and ChangeNotify
(dnotify) requests) wait forever (unless we umount or kill the request
thread)

2) "long" requests such as single page or smaller writes past end
of file block 180 seconds.  This probably should be changed to
only happen when the write is far past end of file not when
the file length is only being incremented by a page or so.
Some servers (such as Win9x IIRC) which don't make files 
sparse can take a long time when a write is made far beyond
end of file.  We should also be setting this longer timeout
on "offline" files but currently I don't have an easy way
to test this and the cifs client ignores the dos
attribute offline flag.  Suggestions on how to get Windows
to set/return the offline flag would be appreciated

3) "medium" requests block 45 seconds (to allow for oplock breaks
that the server may have to send to hung clients to timeout - 
which can take from 20-40 seconds depending on the server). This
includes NTCreateX (and legacy OpenX) but perhaps should be
expanded to include other path based calls (SetPathInfo and Delete).
This also includes calls to cifs_writepages using iovecs and
the new SMBWrite2/smb_send2 interface.  It also includes
writes which are not past the end of file.

4) "normal" requests timeout in 15 seconds

5) nonblocking requests (client responses to oplock break requests,
rfc1001 sessioninits) do not timeout - we return immediately.

TIMEOUTS: Stuck requests are noticed on certain errors coming
back from the socket and also in the cifs_dnotify_thread which
wakes up the request and response queues every 15 seconds (to
allow them to check their timeout flags).   Note that this change
to cifs_dnotify_thread was added not that long ago. If any SMBs
time out we kill the socket (or if the socket goes dead for other
reasons) and we try to reconnect and in some cases retry the request.
If the "hard" mount option is set ("soft" is the default) then cifs
will try to reconnect until umount (see smb_init in fs/cifs/cifssmb.c).
On path based calls (setattr, readdir, open, delete, mkdir, rmdir
etc.), we attempt to retry once even if "soft" mount option specified
but we only wait 10 seconds in smb_init (or small_smb_init) for
cifsd to reconnect the dead socket before giving up and returning to
the caller.  On handle based calls (e.g. read and write) we can not
retry within cifssmb.c since the handles on a dead session are
no longer valid - but retries do occur in the calling function but
with the similar restriction that we only wait 10 seconds before
giving up.  It is somewhat dangerous to timeout writes/reads
back to the user since that causes a pagefault (unlike path
based calls such as open which are expected to be able to fail
and thus most applications handle these errors more gracefully)

UMOUNT: The cifs umount code was fixed in the last six months to
handle stuck requests and responses better but basically umount
tries to mark the mount as closing (see cifs_umount_begin
if fs/cifs/cifsfs.c) wake up all requests that are stuck waiting
on tcp sends, wake up all requests that are stuck waiting
for SMB responses (from presumably hung servers) then retries
waking up stuck requests (to catch requests blocked on the
max request count of 50 active on the wire per session that could have
snuck in when the max request count went under 50 and thus blocked
again).  I have done various tests last week both killing servers
("killall smbd") and also simulating different types of hangs by pulling
the network cable and also by going into smbd in gdb to simulate
server hangs - and umount always worked (even without requiring
the force flag ie "umount /cifsmount --force).  The only "slow
umount" (which takes about 15 seconds) is the case to NetApp servers
where some of their servers can return a malformed ulogoffX SMB
response and thus cause mount to block waiting for a good response
before giving up on the server and killing the tcp session explicitly).
So I would like to know if you have an umount scenario on reasonably
current code in which cifs won't umount (within 15 seconds) and should.
If such a scenario exists we will need to look at the blockids 
("echo t > /proc/sysrq-trigger" and then dmesg) to find out whether
cifs could reasonably wake up requests queued on that particular
block id (presumably blocked outside of cifs) - but I am aware
of no such problem at this time.

>Perhaps it is possible to check whether the server is online by simply 
> pinging it prior to any access. I mean, in a LAN servers normally 
> respond within just a few milliseconds, so even an extremely short time 
> out would do the trick and save a lot of trouble.
SMB echo could be used for this purpose (and can also alter payload
sizes to help estimate delay).  Anyone care to suggest a design?
A possible approach is to launch a new long running cifs kernel thread
or use the existing cifs_dnotify_thread (which wakes up every 15
seconds to check for stuck requests)

> I am sure, with the feature added to an upcoming version of CIFS, it 
> would get a far superior advantage over NFS and SMB.
There are a few cases where cifs is faster Linux to Linux/Samba
than nfs would be (certain write cases, and also cases in which the
oplock caching advantages of cifs outweigh nfs's advantages in
dispatching more read requests at one time and responding to
read faster) and a few cases where cifs has functional 
advantages (although nfs has a big advantage in one key functional
area ie in handling advisory locking well - which we are close to
implementing due to the recent work on the server side from jra
 of the samba team).  When mounting to Windows and similar server systems
cifs has more substantial advantages, but for the Linux to Linux/Samba
case (vs. NFS) it is hard to generalize since both implementations
are moving targets and the Linux clients for each are among the fastest
moving (ie most updated) components of the Linux kernel at least
when measured by number of changesets per month.  In addition
NFS version 4 adds some "cifs-like" features and offers a third
interesting alternative.

I don't mind pursuing three types of changes here:
1) "poll" the server via periodic SMB echo and reduce
the timeouts sharply (or even kill the tcp session to the
server) if the server stops responding to SMB echo.  This
will be an even more powerful approach in conjunction with
failover when DFS replicas are available for that share
(which cifs code can currently recognize but not connect to).

2) Fix the "hard" vs. "soft" mount option for cifs to be recognized
in more places in the code.

3) allow the request timeouts to be configurable (via new mount option "timeo") 
as we see with nfs version 4. See below:

       hard   The  program  accessing  a  file  on  a NFS mounted file system will hang when the server
              crashes. The process cannot be interrupted or killed unless you also specify intr.   When
              the  NFS  server  is back online the program will continue undisturbed from where it was.
              This is probably what you want.

       soft   This option allows the kernel to time out if the NFS server is not  responding  for  some
              time.  The  time  can  be  specified with timeo=time.  This timeout value is expressed in
              tenths of a second.  The soft option might be useful if your NFS server sometimes doesn’t
              respond  or  will  be  rebooted  while  some process tries to get a file from the server.
              Avoid using this option with proto=udp or with a short timeout.

I would like to hear experimental feedback and suggestions from users on this topic as it is
hard to predict the types of failure scenarios that todays complex networks (with routers that
lose packets, firewalls that silently "swallow" connection requests on certain ports, and
servers/OS with various bugs that can cause different types of requests to hang)