Samba, NT, and transient network failures

Mon Jan 25 20:55:59 GMT 1999

We've recently completed an internal eval of products to server Unix
filesystems from our Unix fileservers to our NT clients via SMB.

We evaluated two products, one of them was Samba 1.9.18pl10, the other
was a commercial product.

We've selected Samba. We had been using the commercial product for some
time.

All is great, fine, dandy.

But, as a result of our experience with the commercial product, we knew
of a "problem" with the commercial product and tested for it in Samba.
Samba suffers from that problem.

The problem can be summarized as follows:

 - An NT client is connected to Samba server. The user has some file(s)
   open on a Samba-exported share.
 - The NT client and the Samba server lose network connectivity, for
   whatever reason, for a short period of time.
 - The user or his application attempt to save his file, and the
   application hangs due to the unavailability of the Samba server.
 - The NT client kernel abandons it's TCP connection to the server and
   attempts to establish a new connection.
 - Network connectivity between the NT client and the Samba is restored.
 - The NT client establishes a new TCP connections to the Samba server.

 PROBLEM:

 - The smbd process for the old TCP connection hangs around because as
   far as it and the Unix kernel are concerned the old TCP connection is
   still alive. This process still holds and honors all locks, oplocks
   and deny modes held by the NT client.
 - The new smbd honors the locks held by the old smbd.
 - The user's applications on the NT client continue to hang as they
   block while the old smbd holds the old locks.

I believe that the system recovers from this lockup within 15 minutes or
so after the restoration of network connectivity between the client and
the server.

This problem can be very annoying, even if it occurs very rarely.

There is a solution to this problem. If the new smbd kills the old smbd,
then the old smbd releases all of its locks, leaving the NT client free
to re-establish its locks (which it does, though only as each
application accesses files it had open on the Samba share, rather than
attempting to re-obtain all locks on that share in one fell swoop).

I've implemented this solution via the "root preexec" and "root
postexec" configuration parameters in smb.conf. So I have Samba
configured to call an external script with all the appropriate arguments
(IP address of the client, NetBIOS name of the server, share name,
username, smbd PID, "preexec" or "postexec") when a client connects or
when smbd closes the connection. When run by "preexec", the script
creates a PID file named after its arguments or, if the PID file already
exists, kills the old smbd whose PID is store in it. When run by
"postexec" the script simply removes the PID file.

I've tested Samba running with this configuration and emulated the
network connectivity problem as above. With this configuration the
applications running on the NT client recover very quickly once the NT
client and the Samba server re-establish their communication.

The preexec and postexec configurations parameters are set as follows:

	root preexec = /usr/local/libexec/samba/chkStaleSession preexec %d %I %h %S %U
	root postexec = /usr/local/libexec/samba/chkStaleSession postexec %d %I %h %S %U

If our analysis of the problem is incorrect, or if there's a better way
to attack this problem, please let me know.

The chkStaleSession script (Korn Shell) follows.

Nico

************************************************************
#!/bin/ksh
#
# NAME: chkStaleSession
# AUTHOR: nicolas.williams at wdr.com based on an analysis of a bug by
#	  roman.gollent at wdr.com
# PURPOSE: kill old smbd processes when a client abandons its TCP
#	   connection, establishes a new connections but the Samba
#	   server doesn't know. To be called by smbd as part of share
#	   preexec/postexec configuration.
#
# (c) 1999, Perot Systems Corporation.

SMBLOCKS=/var/locks
SMBLOGS=/var/log/smblogs

set -o noclobber

# Set stderr
exec 2>> $SMBLOGS/staleSessions.log
print -u2 -- "$@"

SELF=${0##*/}
OP=$1
SMBPID=$2
SMBCLIENT=$3
SMBSERVER=$4
SMBSERVICE=$5
SMBUSER=$6

status=""

# This sets the name of the pid file for this
# {client, server, share, user} tuple
PID_FILE=$SMBLOCKS/$SMBCLIENT-$SMBSERVER-$SMBSERVICE-$SMBUSER.pid

###
### The idea is to store the calling smbd's pid in $PID_FILE so that it
### can be killed if the client reconnects through a different TCP session.
###

case $OP in
	preexec)
		while [[ -f "$PID_FILE" ]]
		do
			PID=$(cat "$PID_FILE")

			# Check that the old process is still around ...
			# [I use grep smb because smbd, when started via
			#  inetd sets its ps strings to its arguments
			#  only, thus we cannot grab the process name
			#  from ps or ptree or whatever!]
			#
			# Replace ptree(1) with ps(1) on non-Solaris systems
			/usr/proc/bin/ptree $PID|grep smb > /dev/null 2>&1 || {
				# ... If not just remove the pid file and go on
				rm "$PID_FILE" || status=rmfail
				continue
			}

			# Ok, kill the old smbd
			print -u2 "$SELF: sending SIGTERM to $PID on behalf of $SMBUSER at SMBCLIENT for $SMBSERVER:$SMBSERVICE"
			kill -TERM $PID

			# Wait for the old smbd to exit
			/usr/proc/bin/pwait $PID
		done

		# There's no pid file now so we can create it
		print -- $SMBPID > "$PID_FILE" || status=rmfail
		;;
	postexec)
		# Remove the pid file for this session if it exists and
		# contains the pid of the smbd that called this script
		if [[ -f  "$PID_FILE" ]]
		then
			PID=$(cat "$PID_FILE")
			if [[ $PID = $SMBPID ]]
			then
				rm "$PID_FILE" || status=rmfail
			fi
		fi
		;;
esac

case $status in
	rmfail) print -u2 -- "$SELF: Error: ($1) pid file $PID_FILE has not been removed!"
		;;
	*)	:
		;;
esac

exit 0
************************************************************