[Samba] optimizing and scaling ntlm_auth

Volker Lendecke Volker.Lendecke at SerNet.DE
Tue Sep 9 03:52:48 MDT 2014

On Mon, Sep 08, 2014 at 12:11:05PM -0400, Louis Munro wrote:
> Hello,
> I am using ntlm_auth called from FreeRADIUS to authenticate users on a network with their Active Directory credentials.
> The problem I seem to be having is that ntlm_auth is taking longer than it should and I can't seem to get it to go faster reliably.
> Some background information:
> Users are connecting to a wireless network using 802.1x. 
> That network sends requests to FreeRADIUS which forks an ntlm_auth process to authenticate users against AD.
> ntlm_auth is called with the username and challenge contained in the radius request along with the nt-response and the domain, as in : 
> ntlm_auth --username=$USERNAME --challenge=$CHALLENGE --nt-response=$NT-RESPONSE --domain=$DOMAIN 
> An authentication is successful if ntlm_auth returns 0.
> Since I had error messages in the logs pointing to requests timing out on ntlm_auth I wrote a short C wrapper around ntlm_auth to log the time it takes to return (as well as the username and domain).
> That showed that while most (~90%) authentications succeed in less than 25ms, about 10% take longer than 100ms with some taking as much as a few seconds (2-4s).
> So I increased winbind max domain connections on the (linux) server while also raising the MaxConcurrentApi on the DC.
> I now see 39 connections open to the DC from winbind (that number fluctuates). 
> And yet the problem remains. 
> What's more, It seems winbind is only or mostly using one out of those 39 connections to the DC. 
> When I trace the processes using strace, only the first child of winbind seems to be sending any request. 
> All the others are idle. 
> Can anyone shed some light on how winbind manages it's connections to the DC? 
> Has anyone else encountered this problem? Any recommendations for scaling ntlm_auth? 

winbind *should* balance between all children, at least we
have code to do this. If it does not work, there might be a
bug or else it might be that 38 out of 39 children sit in
requests that block against the DC. Then only one child is
used because it's the only responsive one.

In that situation, can you figure out the call stack of an
unused child? Fedora has a "gstack" script that might be
helpful here.

Moreover, the main scalability problem is probably that
winbind only connects to one DC. It would be far better to
connect to all available DCs, but DC location is pretty
involved in winbind and needs some refactoring before this
can be implemented.



SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt at sernet.de

More information about the samba mailing list