Run away number of smbd children

Mon Nov 5 18:22:55 GMT 2007

Kevin:

A quick update...

With 32 smbtortures running on 8 clients, we have been able to reproduce
behavior similar to this and bug 3204 on Samba 3.0.26a.   We are still
investigating it.

The current - gory details from our engineer Yen Liew...

I was able to repro this thousand of smbd process using 3.0.26a, I think
this issue exist since 3.0.25, but not on 3.0.23b.

Since the tdb_lock with timeout code in 3.0.23b is different from the
implemention in 3.0.25 and 3.0.26a .  

In 3.0.26a, (we think) there's a bug in the tdb_chainlock_with_timeout()
in handling the SIGALRM signal, and causes smbd process continue to call
fcntl(), even SIGALRM is received.

In tdb_chainlock_with_timeout_internal()(in lib/util_tdb.c), it setup
the
gotalarm_sig() signal handler, which just set static var gotalarm=1, as
follows:

tdb_chainlock_with_timeout_internal()
{ .....
        if (timeout) {
                CatchSignal(SIGALRM, SIGNAL_CAST gotalarm_sig);
                alarm(timeout);
        }
        if (rw_type == F_RDLCK)
            ret = tdb_chainlock_read(tdb, key);
        else
            ret = tdb_chainlock(tdb, key);
        if (timeout) {
                alarm(0);
                CatchSignal(SIGALRM, SIGNAL_CAST SIG_IGN)
.....
}

In tdb_brlock()(in tdb/common/lock.c), which eventually called by
tdb_chainlock(), called fcntl() with lck_type=F_SETLKW, in a while loop
as
shown:

tdb_brlock()
{ ....
       do {
                ret = fcntl(tdb->fd,lck_type,&fl);
        } while (ret == -1 && errno == EINTR);
.....
}
according to fcntl man page, EINTR is set when signal to be caught (ie
SIGALRM)
is received. So, when SIGALRM is received, signhandler in util_tdb.c is
called
to set gotalarm=1; and after return from the signal handler, errno=EINTR
and
ret=-1, and the loop continue, and causes the process to hang. 
tdb_brlock() which is waiting for the signal should either check
gotalarm value
or use sigsetjmp, siglongjmp to go to desired location. 

Tried to use sigsetjmp/siglongjmp in
tdb_chainlock_with_timeout_internal(), the
smbd process hang issue does resolve.  

Running 32 smbtorture 8 clients only see ~around 77 smbd processes. 
However, the winbind issue "no idle connection found" still exist.

------- Comment #3 From Mike Patnode 2007-11-05 09:24:42 [reply] -------

Isn't this as simple as adding a global flag to the alarm signal
handler, and
changing the while loop:

        do {
                ret = fcntl(tdb->fd,lck_type,&fl);
        } while (ret == -1 && (errno == EINTR &&
!timeout_signal_caught));
        if (timeout_signal_caught)
               timeout_signal_caught = 0;
-----Original Message-----

On Behalf Of Kevin Robinson
Sent: Monday, October 15, 2007 6:04 AM
To: Dave Daugherty
Cc: samba-technical at samba.org
Subject: Re: Run away number of smbd children

Replies below, and thanks!

Dave Daugherty wrote:
> Kevin:
> 
> I know what you mean... and although that problem is serious, I am not
> convinced it's your only problem.  Were your users actually able to
> logon?

Nope, and the ones that were got booted except for a fraction of them. 
If 1000 were connected, that would drop to maybe 100 -- collected via 
smbstatus.  However, the smbd processes would soar to 5000+ before I 
found it and would restart the services.

> 
> One of our customers ran into a problem that on the surface behaved
like
> Bug 3204 with our customized build of 3.0.23b.  In this case, we were
> seeing millions of SID -> GID lookups being performed by winbind and
the
> 200 SMBD -> WINBIND connection limit was being exceeded because each
> user authentication, with the necessary group membership lookups was
> taking too long.  
> 
> We were able to work around this problem by not loading winbind and
> letting our own NSS module resolve the group memberships - but of
course
> then you run into the NSS solaris group membership limits (only 16
> allowed).  
> 
> I was looking for this problem in your log, and although there was a
> fair amount of IDMAP related references, it did not appear to be the
> case that your users belonged to many groups; mostly the same SIDs
were
> being resolved over and over.

Yea, we have one main group and oddly enough ... that's the log file 
pointed our per our initial conversation (GACL).  I'll have to collect 
some stats on the look ups and network utilization.  How did you notice 
the number of SID->GID lookups?

> 
> We are currently targeting 3.0.26a for a new release with the hope
that
> the improved IDMAP caching and bulk resolving functions will help, so
we
> have a vested interest in making sure this problem is resolved :)

Going from 3.0.20b to 3.0.26a I noticed a definite improvement.  Samba 
services would start slowing down or pick and chose whom could connect 
at around 900, and .26a brought that number up to 1100 -- then this...

> 
> I wonder if we can use the Samba 4 SMB torture to simulate the
> situation.
> 
> Dave Daugherty
> Centrify
> 
>> On Behalf Of Kevin Robinson
*snip*
> 

-- 
Kevin Robinson, B.Sc.
SysAdmin for University of Arkansas IT Services
(479)575-2901-office, (479)575-4753-fax
Never take life seriously.  Nobody gets out alive anyway.

01101011 01100101 01110110 01111001 00100000 01100100