Samba, NT, and transient network failures

Tue Jan 26 14:32:08 GMT 1999

Hi Frank,

We rarely see this problem. Whenever we do run into this problem the
reason often has to do with our servers losing their ethernet carrier
(yes, we're still using those HME quad cards that don't like our
switches so much); I imagine that WAN outages can also cause this
problem.

Thanks for the info about AMD. We're still using the standard Solaris
automounter, so we're not affected. The Solaris automounter is [mostly]
multi-threaded, so it doesn't hang just because a mount request is
hanging... And only some of us geeks re-export NFS-mounted partitions,
from our workstations :)

We did think so set the SO_KEEPALIVE option in the smb.conf thinking
that it might speed up the process of getting old smbd processes to
elicit a RST response to TCP keepalives, but, going by W. Richard
Stevens' TCP/IP and Unix books, it seems that the default timeout values
associated with SO_KEEPALIVE are too large to be helpful with this
problem.

Anyways, I should have mentioned the SO_KEEPALIVE bit. It may seem like
having a preexec/postexec script with a per-{client, share, server, user}
pid file is overkill. But it also seems like the SO_KEEPALIVE option
won't help on every platform; on some platforms SO_KEEPALIVE might not
do any good unless you fiddle with some global kernel variables.

If there is a CIFS call that the server can make on the client just to
check that the client is still present, then Samba should probably use
it. That would work more consistently than SO_KEEPALIVE. It would also
help if all/most TCP/IP stacks offered an API call to probe TCP peers
with keepalives, rather than relying solely on per-kernel timeout
policies.

We sought a workaround just to be complete. We don't want egg on our
faces. It was hard enough to convince the powers that be that Samba was
the better choice for so many reasons (not just that it's cheaper
[supporting Samba internally does not costs "nothing"!]).

Thanks,

Nico

On Wed, Jan 27, 1999 at 12:33:50AM +1100, Frank Varnavas wrote:
> Hi Nick,
> 
> Glad to see you made the right decision regarding Samba vs Syntax.
> 
> I've seen this problem before.  In extreme cases it can lead to samba
> server failure.   The root cause is that the NT redirector has a timeout
> of approximately 45 seconds on smb requests.  If the timeout is exceeded
> then the redirector will log an event, close the connection, and open a
> new connection.  I did not test increasing the timeout value since the
> value could not be set on a per-share basis, analogous to hard vs soft
> mounts on NFS.
> 
> I almost allways saw this error as a result of NFS problems or network
> delays causing the smbd to hang for long enough to trigger the timeout.
> The most extreme cases occurred when I was using a particular Samba
> server as a gateway for AMD mounted homedirs from other machines.  If a
> server were to hang the smbd requests on the hard mounts would hang as
> well.  These processes could not be terminated.  Smbd processes would
> continue to be spawned (and subsequently hung) until the NFS problem was
> corrected.  Once the problem was corrected the processes would die by
> themselves upon receiving an error sending the reply (or a keepalive) to
> the client.
> 
> I had one other problem as well with AMD and Solaris that probably does
> not affect you.  In this instance if the AMD in use was not compiled to
> add device id's to the mount entries in the mnttab the getcwd() calls in
> an AMD mounted directory would hang if ANY amd-mounted server hung.
> 
> If you are seeing this problem enough to warrant this workaround I think
> more debugging is in order to determine why it is happening. AFAIK error
> recovery for files on bounced shares is an application responsibility
> and not all apps are so well written.
> 
> Good luck,
> Frank V
> 

Nico
--
Nicolas Williams	(x5220, Stamford, CT)
Stamford SysAdmin