[Samba] Prevent `wbinfo -u` from making Winbind unresponsive

Peter Eriksson pen at lysator.liu.se
Thu Apr 2 20:32:10 UTC 2020


I’ve also seen some problems with the “wbinfo -u” program. I used to run “wbinfo -u” in a systems monitoring script that ran every minute (checking a lot of different stuff) but had to disable it because it would make the winbindd daemon grow out of bounds… Or until we ran out of RAM (256GB).

Just for fun I tried it again here on a test server doing nothing. Same problem…


> # ps auxww | egrep winbindd
> root     14794    0.7  0.0 133372  86344  -  Ss   07:00        1:23.43 /liu/sbin/winbindd --daemon --configfile=/liu/etc/samba/smb.conf
> root     14795    0.0  0.0 265676 241044  -  S    07:00        0:44.00 winbindd: domain child [AD] (winbindd)
…


Doing “wbinfo -u” take a little bit of time the first time, then goes pretty quickly) for the following calls and you get all 140K users.
 
And then if you wait a little and try again it will again take some time.. (Some cache getting dropped perhaps?) 
The  “AD winbind child” is fairly stable in size however, then then thing are quick again. But then suddenly - *boom*.

> # wbinfo -u >/tmp/u.log
>     (Taking a very long time and finally gives up…)
> Error looking up domain users


A truss of the winbindd process list a lot of fcntl locking calls on geocache.tdb being done.

> # truss -p <pid of winbind-AD child>
..
> fcntl(17,F_SETLK,0x7fffffffc780)		 = 0 (0x0)
> fcntl(17,F_SETLKW,0x7fffffffc780)	 = 0 (0x0)
> fcntl(17,F_SETLK,0x7fffffffc780)		 = 0 (0x0)

> # lsof - p <pid>
...
> winbindd 14795 root   17ur VREG     152,3150381094 69443584  123054 /liu/var/samba/locks/gencache.tdb


> root     14795  100.0  0.1 459552 411096  -  R    07:00        1:51.72 winbindd: domain child [AD] (winbindd)

<wait a some time and check again, now it has shrunk a bit but not all the way>

> root     14795    0.0  0.1 332576 287296  -  S    07:00        5:46.20 winbindd: domain child [AD] (winbindd)


Another try with “wbinfo -u” now - same “Error looking up domain users”.

> # ps auxww | egrep winbindd
> root     14795  100.0  0.1 508704 462052  -  R    07:00        6:41.44 winbindd: domain child [AD] (winbindd)

I have to “kill -9” the runaway winbindd daemon and restart winbindd (just a “kill” doesn’t help) to get a usable system again.

In our smb.conf file we have:

winbind nested groups = false
winbind enum users = false
winbind enum groups = false
winbind use default domain = yes
winbind normalize names = yes
winbind max clients = 1000
winbind max domain connections = 2
winbind nss info = template

Samba 4.11.7 bound to a Microsoft AD domain on a FreeBSD 11.3-RELEASE-p7 server. 
Our AD contains about 140,000 (140k) users now….

Tried a “smbcontrol pool-usage” on the “AD” winbindd process but didn’t see anything suspicious there… No huge allocations that I could see atleast.

- Peter



> On 2 Apr 2020, at 20:18, Jeremy Allison via samba <samba at lists.samba.org> wrote:
> 
> On Wed, Apr 01, 2020 at 03:33:00PM -0700, Jeremy Allison via samba wrote:
>> On Wed, Apr 01, 2020 at 02:09:57PM -0700, Alexey A Nikitin via samba wrote:
>>> Hi,
>>> 
>>> Recently I by mistake ran `wbinfo -u <username>` when I was actually intending to run `wbinfo -n <username>`. It ignored the <username> part and proceeded to fetch the usernames. On a small domain this shouldn't be too much of an issue, but I did it on a domain with thousands upon thousands of users. The result was that Winbind became for all intents and purposes unresponsive for about six minutes - I couldn't authenticate me (or anyone else) for any new sessions, and it wouldn't even acknowledge me as a valid user in an existing session ('unknown uid: 3234505'). It pretty much blocked on that user search request for anything else, even things that were supposed to be cached locally like my UID.
>>> 
>>> I do have the following lines in smb.conf:
>>> 
>>> winbind enum users = no
>>> winbind enum groups = no
>> 
>> Ah, the winbindd code only prohibits
>> enumerating users when requested from
>> nsswitch lookups.
>> 
>> The code looks like:
>> 
>>        if (request->wb_flags & WBFLAG_FROM_NSS && !lp_winbind_enum_users()) {
>>                tevent_req_done(req);
>>                return tevent_req_post(req, ev);
>>        }
>> 
>> so making an explicit request via wbinfo will
>> still do the enumeration.
> 
> The rpc client code uses the dcerpc call_id
> field to allow multiple outstanding calls at
> once (asynchronously using tevent). It'd be
> interesting to know where exactly winbind
> is blocking (I think it might be on queuing
> calls between master and client) to see
> how we can improve the asynchronous performance.
> 
> If you're willing to reproduce and investigate,
> that is !
> 
> -- 
> To unsubscribe from this list go to the following URL and read the
> instructions:  https://lists.samba.org/mailman/options/samba




More information about the samba mailing list