Run away number of smbd children

Jeremy Allison jra at samba.org
Mon Nov 5 18:19:14 GMT 2007


On Mon, Nov 05, 2007 at 10:22:55AM -0800, Dave Daugherty wrote:
> Kevin:
> 
> A quick update...
> 
> With 32 smbtortures running on 8 clients, we have been able to reproduce
> behavior similar to this and bug 3204 on Samba 3.0.26a.   We are still
> investigating it.
> 
> The current - gory details from our engineer Yen Liew...
> 
> I was able to repro this thousand of smbd process using 3.0.26a, I think
> this issue exist since 3.0.25, but not on 3.0.23b.
>  
> Since the tdb_lock with timeout code in 3.0.23b is different from the
> implemention in 3.0.25 and 3.0.26a .  
> 
> In 3.0.26a, (we think) there's a bug in the tdb_chainlock_with_timeout()
> in handling the SIGALRM signal, and causes smbd process continue to call
> fcntl(), even SIGALRM is received.
> 
> In tdb_chainlock_with_timeout_internal()(in lib/util_tdb.c), it setup
> the
> gotalarm_sig() signal handler, which just set static var gotalarm=1, as
> follows:
> 
> tdb_chainlock_with_timeout_internal()
> { .....
>         if (timeout) {
>                 CatchSignal(SIGALRM, SIGNAL_CAST gotalarm_sig);
>                 alarm(timeout);
>         }
>         if (rw_type == F_RDLCK)
>             ret = tdb_chainlock_read(tdb, key);
>         else
>             ret = tdb_chainlock(tdb, key);
>         if (timeout) {
>                 alarm(0);
>                 CatchSignal(SIGALRM, SIGNAL_CAST SIG_IGN)
> .....
> }
> 
> In tdb_brlock()(in tdb/common/lock.c), which eventually called by
> tdb_chainlock(), called fcntl() with lck_type=F_SETLKW, in a while loop
> as
> shown:
> 
> tdb_brlock()
> { ....
>        do {
>                 ret = fcntl(tdb->fd,lck_type,&fl);
>         } while (ret == -1 && errno == EINTR);
> .....
> }
> according to fcntl man page, EINTR is set when signal to be caught (ie
> SIGALRM)
> is received. So, when SIGALRM is received, signhandler in util_tdb.c is
> called
> to set gotalarm=1; and after return from the signal handler, errno=EINTR
> and
> ret=-1, and the loop continue, and causes the process to hang. 
> tdb_brlock() which is waiting for the signal should either check
> gotalarm value
> or use sigsetjmp, siglongjmp to go to desired location. 
> 
> Tried to use sigsetjmp/siglongjmp in
> tdb_chainlock_with_timeout_internal(), the
> smbd process hang issue does resolve.  

Great analysis - thanks ! I'll fix this asap.

Jeremy.


More information about the samba-technical mailing list