Reducing LDAP delays with unreachable DCs

Thu Jun 10 19:21:48 GMT 2004

Hello,

The company that I work for uses samba in an enterprise environment.  We
have encountered situations where winbindd has, in its DC list, one or more
DCs that are unreachable which really bogs down the server.  I've made some
tweaks that seem to have helped quite a bit, at least in my contrived test
scenario which I will describe below.  I'm still evaluating the
effectiveness and robustness of these changes and I thought I'd send this to
the list in case anyone has any insights into whether these changes are a
good thing and what the potential side-effects might be.

SAMBA VERSION: 3.0rc1

PROBLEM ENVIRONMENT: winbindd gets the ip address of one or more DCs that
cannot be reached.  This might be because of routing problems, incorrect
hosts/lmhosts settings or bad DNS entries.

TEST SITUATION: I have found that the easiest way to simulate this problem
is by adding bogus IP entries into 'smb.conf:password server='.  For example
"password server = 192.168.100.1, *" with the first address being
non-existent on the network.  I have a valid Win2k DC on this network as
well and am able to join its domain without any problems under normal
circumstances.

With the added bogus entry the main problem I found was in the function
ads_try_connect() where the call to open_ldap() takes three minutes to time
out.  Setting the LDAP timeout option doesn't help, this seems to limit the
search time but has no effect on the connect timeout.  This test setup is
faked but it seems to simulate the behavior that I have seen when winbindd
has unreachable DCs in its DC list.

SOLUTIONS:
I found a function called open_ldap_with_timeout() in winbindd_rpc.c.  This
is a static function so I pasted a copy of it into libads/ldap.c and
replaced the call to ldap_open (in ads_try_connect()) with
open_ldap_with_timeout() (which uses an alarm to cancel the connect
request).  I also added a parameter, ldap_timeout, to smb.conf to make it
easy to try different timeout values.  With this option I can define the
ldap connection timeout to a certain number of seconds.

This change produced a marked improvement.  Whereas before even one bad IP
address would have a severe impact now using four bad ip addresses makes
only a small impact on the initial time it takes to join the domain.
Looking at ethereal traces the behavior is exactly the same as before, only
it goes through the list of DCs much faster.

As an additional measure I also modified the function get_dc_list as it goes
through the list of DCs copying them to the user buffer.  Before copying an
entry I added a call to check_negative_conn_cache() and if the DC is in the
failed connection cache it is not added to the DC list.  These cache entries
go stale after 30 seconds so a DC should have a chance to 'redeem' itself if
it was only temporarily unavailable.

CONCLUSION:
Early results are encouraging, but awaiting some more authentic testing by
our QA dept.  I just wanted to float this out there in case this is of any
use to anyone, or if anyone knows of a better solution or sees trouble with
this one.

Thanks all!
Joe Meadows
Snap Appliance