Samba 2.2.0 locking problem (long)

Tue Jun 5 17:27:36 GMT 2001

On Thu, 31 May 2001, David Collier-Brown wrote:

> 	HEY!!! just a sec, you said
> 		fcntl(13, F_SETLKW, 0xFFBEF218)   = 0
> 		fcntl(13, F_SETLKW, 0xFFBEF218)   = 0
>
> 	That's a success indication: Samba is looping on success,
> 	not failure (-1).
>
> 	Do a pstack the next time and mail it to the list, this
> 	shouldn't be happening!

I've seen similar things. I'm currently recovering from a situation
where the number of smbds increased dramatically (800 smbd processes
on a network which typically has 150 odd clients). The excess smbds
appeared to consist of one "busy" process, which was monopolizing the
processor, and a bunch of sleeping processes.

truss on the smbd that was burning CPU gives stuff like:

kill(3024, SIG#0)                               = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF978)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF978)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF978)                = 0
kill(19694, SIG#0)                              = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF978)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF908)                = 0
fcntl(7, F_SETLKW64, 0xEFFFF978)                = 0

(unfortunately truss -v fcntl or truss -v all doesn't appear to give any
additional info ... this is on Solaris 2.6, if it makes a difference?).

pstack shows the smbd concerned doing a tdb_traverse of the connections
database:

 ef5b6230 __fcntl  (7, 23, effff978, 0, 3f, 1c45b0) + c
 00110d1c tdb_brlock (1c4450, 230, 3, 23, 0, ef5c672c) + 50
 00110e8c tdb_unlock (1c4450, 62, 2, 108, ef622eb4, 1123a0) + 7c
 00112330 tdb_traverse (1c4450, 3ad40, effffb48, 7efefeff, 81010100, ff0000) + 9c
 0003af18 claim_connection (0, 1209d8, 186a0, 1, ef6259c0, ef625c4c) + dc
 000384f4 main     (ffffffff, effffea4, effffeac, 1a6d10, 0, 0) + 5b0
 00036ff0 _start   (0, 0, 0, 0, 0, 0) + 5c

I don't know whether the traverse would ever finish of its own accord,
but it persists for many minutes of wall-clock time without any sign of
completing. If I kill the smbd concerned then one of the sleepers
immedately wakes up from its slumbers and gets stuck in the same place.

I tried to capture a level 10 debug log, but the smbds that are stuck in
the traverse don't appear to answer to smbcontrol messages.

If there is a problem that might affect tbd_traverse in general, this
might explain why Jeremy's changes[1] to clean up the locking.tdb on
smbd startup/shutdown caused big load problems for me when I tried them.

[1] http://lists.samba.org/pipermail/samba-technical/2001-May/013346.html

Regards,
-- 
Neil Hoggarth                                 Departmental Computer Officer
<neil.hoggarth at physiol.ox.ac.uk>                   Laboratory of Physiology
http://www.physiol.ox.ac.uk/~njh/                  University of Oxford, UK