Stale smbd processes (was: DOS: Clients can freeze other clients smbd)

Tue Sep 7 20:34:21 GMT 1999

Nicolas Williams wrote:

> 
> > Oh, I ment if the script was tested on a Unix-box with samba and the
> > clients where multiuser Windows NT systems. If I understands it right
> > the same user may have more than one connection to the server from
> > that type of client.
> 
> The key is in the parameters passed to the script. Here's my root
> preexec/postexecs:
> 
> root preexec  = /somepath/samba/libexec/chkStaleSession preexec  %d %I %h %S %U
> root postexec = /somepath/samba/libexec/chkStaleSession postexec %d %I %h %S %U
> 
> chkStaleSession requires the following argument:
> 
> $1 - action (preexec or postexec)
> $2 - smbd PID
> $* - tokens which altogether identify a share connection
> 
> I'm assuming that NT won't try more than one share connection with all
> of those parameters. I.e., if I've got a share mapped to, say, I: and
> then map the same share as the same user to, say, J:, then I assume that
> NT won't open a new share connection.
> 
> If this assumption is wrong then you could end up with a situation were
> smbd keeps getting itself killed when it runs chkStaleSession. Still,
> this can be avoided in chkStaleSession.
> 
> The chkStaleSession basically creates/deletes PID/lock files named by
> its concatenated arguments (skipping $1 and $2). This is what
> chkStaleSession does: if the given lock file already exists and it
> contains a PID and its the PID for a different smbd and it's still
> running, then stomp that stale smbd, otherwise create/overwrite the lock
> file and store given PID in it.

You may be right, our Hydra-licenses got timed and I haven't got any
time
to look any more into this. Your script may do the right thing, but this
use to be a less tested case and I do not have time to test it right now
:-(

> > > Samba needs a way to deal with these stale smbd processes. I'm still not
> > > exactly clear on what goes on that causes Samba to block on a socket,
> > > that ought to be dead, waiting for input; I've not spent enough time
> > > tracing the packets or the smbd processes so my analysis is partly based
> > > on guessing (I had no idea about the FIN-FIN/ACK bug when I sent my very
> > > first e-mail about this to the list); it could even be that there's a
> > > bug in the way the NT clients abandon the old connection (i.e., maybe
> > > they don't explicitly close it) or maybe there's a bug in NT's TCP/IP
> > > stack that causes TCP shutdown to not be reliable.
> >
> > I have investigated it, it has always been receive_smb() calling
> > read_socket_data() as of 2.0.5a source.
> 
> This doesn't tell us what goes wrong. I'm rather busy, but if I can I
> may setup a test and sniff the wire, see what's wrong...

Yesterday I debugged one client which couldn't upload files to the
server
and saw that the timeout was trigged. We sniffed the network and it
looked
like the client sent the packets, but after a while there where no more
reply:s from the server. The client started to resedn the packet sevral
times but no reply. When insted looking from the servers point of view
the packet in question never reached the server. The network just ate
the
packet! So, the problems I see is not Sambas fault, it is just that
Samba
in the official release isn't that good on handling bad networks :-(...

> Once I figured out what was wrong as far as Samba was concerned it was
> not hard to come up with a workaround. Once I had a workaround I lost
> any curiosity about what was causing the problem in the first place. Now
> I'm curious again.
> 
> Technically TCP ought to recover connection shutdowns from short term
> packet loss network conditions. What's happening indicates that this is
> not happening. This is why the other day I theorized that the
> FIN-FIN/ACK bug may be to blame, but it could be other things. Whatever
> it is we ought to find it and get Sun and/or Microsoft to fix it.

Or 3COM! But I do not expect them to fix buggs.

> [...]
> > Your system may have lower keepalive-timer in the TCP-stack I timed our
> > Solaris 2.5.1-server to take 2 hours to recover.
> 
> Like I said, we're using Solaris 2.6. I know it's got many TCP/IP stack
> changes with respect to 2.5.1 (like the routing code; 2.6 implements
> VLSM/CIDR, for example, whereas 2.5.1 does not).
> 
> > > We've only got experience with Samba running on Solaris, so the above
> > > might only apply to Solaris. I wonder what others' experiences on other
> > > platforms have been.
> 
> > It looks like there is a missmatch between Solaris and M$...
> 
> I guess so, at least until someone tells us it happens under some other
> server OS as well...
> 
> It's easy to test folks, connect to some [non-Solaris] Samba server from
> an NT workstation, open a file on that share with some editor, yank
> either the client's or the server's network connection, try to save,
> wait 1 minute, put the network back, contionue trying to save the file.
> 
> If after you put the network back your editor hangs for long, then you
> have the same problem we're talking about in this thread.

This sounds odd. My understanding of Sambas behaviour is like this:
o The editor gets an oplock that the smbd-process register.
o The smbd-process waits for oplock-breaks or client requests.
o You cut the network.
o The client tries to save, but times out and closes its side of the
  socket.
o You reconnect the network and the client will make a new connection
which
  will start a new smbd-process.
o The client will try to get the oplock again so the new smbd-process
will
  contact the "stale" process which should try to contact the client.
  The client has closed the connection and should therefor respond with
  a reset of the connection.
o The smbd should let go of the lock report to the other smbd and die.
o The new smbd should get the oplock.
o This should go fast.

But I see that there may be some open ends:
Do the client respond with a reset?
Do the server handle the reset the right way?
Do smbd understand the closed connection?
Do smbd release the oplocks before terminating?

I do not have time to test this for a while, but it looks interesting.

/Mattias