smbd blocking in receive_smb

Tue Aug 6 08:55:03 GMT 2002

Investigating more last lite (in France :) ) I find in fact the following:

1. One SMBD was in a read on a FIFO
    Since this read is restartable (I tested it with a kill and beeing under 
strace) this smbd will NEVER be terminated (the fifo was left on a share by a 
very old process, and this smbd tried to read it, as a normal processing of 
share backup...)

2. while in this state, I had lots of other smbd stucked in :
fcntl64(14, F_SETLKW64, {type=F_WRLCK, whence=SEEK_SET, start=256, len=1} 
<unfinished ...>
fd 14 is the locking.tdb database here (lsof proof)

3. I compiled tdbtool and tried to list the locking database
 ==> tdbtool list or dump blocks each time at the same place in the 
locking.tdb   same ofset as the other smbd !

4. I removed the FIFO (rm)

5. my first smbd then terminated properly

6. all the other blocked smbd terminated properly

7. smbstatus showed  ghost locks (pid no more present)

8. smbclient -L localhost ... cleaned up the database as expected.

I am in samba 2.2.5   ( -14mdk  mandrake serial)
several times a week I have locks problem like that in the server and nobody 
can access the share during this time. It's the first time I notice that 
locking.tdb is not scanable, I have to checkout next time too.

I have started today a debug level 10 but to no avail - everything works fine 
today :( 

maybe a race condition or timing problem in the lock code on locking.tdb.

I really wonder if it is possible for a smbd process to block other smbds only 
because it has a byte range lock on this database ? I know that at start, new 
smbd scan this table to remove dead locks, can this scan be blocked at some 
point in the samba code ?

any clue would be apreciated.

Pascal

Le Mardi 6 Août 2002 17:02, David Collier-Brown a écrit :
> Pascal wrote:
> > I've read your problem report in the samba-technical list and it seems I
> > am experiencing the same symptoms on some servers too.
> >
> > I noticed that the 'dead' smbd are in fact blocked because the network
> > socket is still to be closed by the client side (connexion showed as
> > CLOSE_WAIT in netstat -an).
>
> 	Hmmn: I think this is another variant of a known
> 	Windows <censored> feature.  Try adding
> 		dead time = 10
> 		keepalive = 3600
> 	to the conf file, which should clean up broken
> 	clients after 10 minutes of apparent death.
>
> 	If ten minutes is too long 9and it may well be),
> 	change just keepalive to 30 (seconds).  This will
> 	cause samba to notice if a client goes silent for 30
> 	seconds and send them an "are you alive" query.  If
> 	they aren't alive samba will tear them down and release
> 	any  datastructures and locks they're holding.
>
> 	This is usually a symptom of a network or client
> 	problem: I have a note about an earlier symptom at
> http://www.geocities.com/orville_torpid/papers/ethernet-note.txt
>
> Horst wrote:
> > The visible symptom is that users find themselves locked out of files by
> > their own processes.  On the server side we find that users have multiple
> > smbd's running that are talking to the same PC.  One smbd will be the
> > active one and others (older ones) will be blocked and will not respond
> > to SIGTERM.  Killing with other signals works, but leaves old locks in
> > the lock database.
> >
> > Below I've included a sample of four processes shown in smbstatus output
> > and the gdb backtrace of each of the four.  Three of them are blocked in
> > read_data and not being responsive to the client PC anymore.  Clearly, as
> > each of these processes blocked the client PC happily started a new one.
> >
> > The smbd version is 2.2.5 on Linux, Redhat 7.2.
> >
> > I realize that everyone is working on more exciting aspects of Samba but
> > wonder if anybody more familiar with the util_sock.c routines and
> > signalling would have ideas or hints for me to debug further.  The
> > receive code gets twisty and I'm not sure I totally understand how it's
> > supposed to be working with keepalive packets and such.
> >
> > --Eric
>
> --dave