[Samba] Mutex lock contention against Active directory domain controllers causing authentication failures

Bell D. D.Bell at soton.ac.uk
Thu Jul 19 09:39:46 MDT 2012


Hello,

We are using Samba 3.4.6 (packaged by opencsw.org) against Active Directory 2003 on our primary University filestore. The operating system is Solaris 10 Update 10. We have a number of domain controllers. For the past two days on our main filestore has been failing connections from a number of clients.

When using smbclient (or indeed any client) connecting to the Samba server we see logs similar  the following with log level of 3:

[2012/07/18 20:00:33.762539,  3] libsmb/namequery.c:2461(get_dc_list)
  get_dc_list: preferred server list: ", *"
[2012/07/18 20:00:43.756966,  1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
  tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2 (Interrupted system call)
[2012/07/18 20:00:43.757104,  0] lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
  tdb_chainlock_with_timeout_internal: alarm (10) timed out for key UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:00:43.757214,  1] lib/server_mutex.c:74(grab_named_mutex)
  Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:00:53.756881,  1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
  tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2 (Interrupted system call)
[2012/07/18 20:00:53.757009,  0] lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
  tdb_chainlock_with_timeout_internal: alarm (10) timed out for key UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:00:53.757130,  1] lib/server_mutex.c:74(grab_named_mutex)
  Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:01:03.756905,  1] ../lib/util/tdb_wrap.c:65(tdb_wrap_log)
  tdb(/var/opt/csw/samba/locks/mutex.tdb): tdb_lock failed on list 126 ltype=2 (Interrupted system call)
[2012/07/18 20:01:03.757102,  0] lib/util_tdb.c:72(tdb_chainlock_with_timeout_internal)
  tdb_chainlock_with_timeout_internal: alarm (10) timed out for key UOS-ADS00002-SI.SOTON.AC.UK in tdb /var/opt/csw/samba/locks/mutex.tdb
[2012/07/18 20:01:03.757260,  1] lib/server_mutex.c:74(grab_named_mutex)
  Could not get the lock for UOS-ADS00002-SI.SOTON.AC.UK
[2012/07/18 20:01:03.757420,  0] auth/auth_domain.c:292(domain_client_validate)
  domain_client_validate: Domain password server not available.
[2012/07/18 20:01:03.757527,  2] auth/auth.c:319(check_ntlm_password)
  check_ntlm_password:  Authentication for user [db2z07] -> [db2z07] FAILED with error NT_STATUS_NO_LOGON_SERVERS

After reading through the Samba source code it looks like whenever a new session setup happens it tries to authenticate the user, but to do this it must first lock a key in the mutex.tdb file. It tries to lock the key but fails (three times) before giving up (presumably because another process has it locked). Sadly, when unable to lock the key in the mutex TDB file, the code throws a "NT_STATUS_NO_LOGON_SERVERS" (despite the fact it didn't try to connect to a logon server) giving the message "Domain password server not available".

When using ONE of our domain controllers - UOS-ADS00003-SI - no problems occur. When Samba switches to using another domain controller (such as UOS-ADS00001-SI or UOS-ADS00002-SI) then the errors (as shown in the above logs) occur again. My current working theory is that there is a problem talking to some of our domain controllers and one smbd locks the key in the mutex - preventing the other smbd processes from getting a lock (and thus resulting the above logs).

Sadly we can't find what is holding the lock open and with 1800 processes open (smbd processes) open most of the time it is very difficult to find out any other errors talking to the domain controller. In the samba source code there is an ironic comment in the mutex locking code:

>From source3/lib/util_tdb.c:

/* TODO: If we time out waiting for a lock, it might
			 * be nice to use F_GETLK to get the pid of the
			 * process currently holding the lock and print that
			 * as part of the debugging message. -- mbp */

Right now we've worked around the problem by forcing samba to use a particular domain controller (password server = uos-ads00003-). 

My questions are:

1. Can somebody implement the idea above which logs the PID of the process which has the mutex key locked using F_GETLK
2. Why does samba switch between domain controllers every so often?
3. Can anybody think of a way to determine what is holding the lock and why it is holding the lock?

Sadly I cannot replicate the problem on other Solaris or Linux systems running Samba. 

I'd greatly appreciate any help anybody can offer!

Cheers,

David Bell
UNIX Systems Administrator
University of Southampton


More information about the samba mailing list