losing connections to password server

Brandon Craig Rhodes brandon at oit.gatech.edu
Tue May 27 13:54:06 GMT 2003

   [This question seemed most appropriate for developers, but I tried
   submitting it to the general list first.  The one response
   contained no technical details, so now I am posting here.]

Last year the samba servers for our Georgia Tech computer clusters
were crashing about once a day running samba 2.0.9.  Upgrading to
2.2.5 was disastrous - the Windows machines in the cluster started
giving chronic bad-password errors.  Since daily crashes seemed an
easier problem to fix, we dug in and fixed the 2.0.9 source code.
(Our patch is attached, for those interested in running 2.0.9 stably.)

With the recent security issues, we have attempted upgrading to samba
2.2.8a; it promptly starting giving our users bad password errors for
what we knew were good passwords.  We discovered that once a Windows
box started giving these bad password errors, no one would be able to
log on until we killed the samba process serving that Windows box, or
rebooted the Windows box to force it to hang up on the process.

I should explain that we have two levels of samba server.  Passwords
are resolved by our central password samba server; it is configured
purely for authentication.  Actual user home directories and printers
are served from subordinate samba servers in each public computer
cluster.  Each cluster server specifies:

        security = server
        password server = <the central campus samba server>

in its smb.conf file.

The problem wound up being that, as a password server, samba closes
idle connections after 60 seconds; but as a client, it has no idea how
to re-open a connection to its password server!  We were stunned, but
there it was: in the server_validate() function, if cli->initialised
is false, server validation is simply abandoned with an error message.

For those trying to diagnose similar errors, look for messages on your
password server that report `Closing idle connection'.  On the cluster
server, you will then see threads reporting:

 lib/util_sock.c:(499) write_socket_data: write failure. Error = Broken pipe

indicating that the password server hung up on then.  After that, the
thread will log each authentication attempt with:

  smbd/password.c:(1102) password server  is not connected
  smbd/password.c:(545) Couldn't find user 'burdell' in passdb.
  smbd/password.c:(545) Couldn't find user 'burdell' in passdb.
  smbd/reply.c:(1023) Rejecting user 'burdell': authentication failed

and from this point until it is killed, the thread will be unable to
authenticate any more users.

(For those interested, both password server and cluster server in this
instance are running Solaris 8.)

Why does server_validate() simply give up rather than re-establishing
its connection to the password server?  Though I am not fluent in the
SMB protocol, perhaps the cluster server process passes along to its
client workstation the session key it receives from the password
server, which means the password hashes submitted by the client would
not work on a subsequent connection, whose session key would be
different.  So server_validate() must give up.

But my knowledge of the protocol is fuzzy - someone else can probably
better explain why server_validate() does not, or cannot, attempt a
reconnection.  Both our password server and cluster server use:

        encrypt passwords = yes

in their smb.conf files, and the password server is currently using an
smbpasswd containing all of our users.

Anyway, unless reconnecting becomes an option, I must keep the
connection open between the password server and the cluster server.
For the moment I solved the problem by increasing the password server
timeout from sixty seconds to twelve hours (!) which involves
adjusting a line in include/local.h:

        #define IDLE_CLOSED_TIMEOUT (60)

to a larger value.  I am experimenting with setting

        keepalive = 30

on our cluster samba servers instead, but this does not seem to fix
the problem; I have not yet had time to determine whether the cluster
server is really sending keepalives every thirty seconds with this
option, or whether the password server is refusing to be convinced to
leave its connections open.

Anyway, does anyone have recommendations for solving this problem?
Google searches have left me with the uncomfortable feeling that we
alone in the world have configured our samba servers this way - does
this configuration (central password server, local file and printer
servers since the load could not be sustained centrally) somehow
constitute abuse of the intentions behind samba?

Thanks for any help or ideas.

  [This patch prevents samba 2.0.9 from crashing by making sure the
  main thread, after forking a new process, correctly records that it
  is not connected to a remote host.]

-------------- next part --------------
diff -ur samba-orig/source/smbd/server.c samba-2.0.9/source/smbd/server.c
--- samba-2.0.9/source/smbd/server.c	2000-03-16 17:59:52.000000000 -0500
+++ samba-maybe/source/smbd/server.c	2002-12-09 15:44:18.000000000 -0500
@@ -253,6 +253,7 @@
 			/* The parent doesn't need this socket */
+			Client = -1;
 			/* Force parent to check log size after
 			 * spawning child.  Fix from
-------------- next part --------------

Brandon Craig Rhodes                         http://www.rhodesmill.org/brandon
Georgia Tech                                            brandon at oit.gatech.edu

More information about the samba-technical mailing list