[Samba] Samba - faster failover to other AD servers?

Wed Sep 2 10:33:20 UTC 2020

> We just had an interesting experience here. One of our AD servers was down for 90 minutes due to the server being physically moved to another location. This shouldn’t be a problem since there are 5 other AD servers in that “group” that can take over the load. However it seems Samba (when used as a fileserver) for some reason is taking quite a long time to “give up” on the first one and switch to one of the alternative ones.
> 
> Don’t know if it’s the Kerberos bits or if it’s the LDAP connection (or both) that is slow to “switch”. 
> 
> Am I the only one seeing this?

No, we're experiencing the same behaviour (FreeBSD 12.1 p8, Samba 4.10.15). Although we have the impression that it also occurs when an AD server responds a bit (too) slow.

> Is there something that can be done to speed that process up?
> 
> I guess I could force Samba to talk to a special virtual “AD” address  we have that is behind a load balancer (it’s mainly used for equipments that needs to talk to the AD servers but only can talk to one specific server) but I’ve tried to keep the configuration as normal as possible so...

There is a post on the FreeBSD forum about this: https://forums.freebsd.org/threads/winbind-ad-dropping-every-10-hours.70752/. Especially this part intrigues me:

---
But the refreshing of the GSSAPI ticket for the openldap-sasl-client (with GSSAPI=on) that is used for the idmapper (process name: "winbindd: idmap child (winbindd)") seems to be the problem: when this ticket is expired, a connection to the DC (LDAP port) is established and stays open for 2 hours (i.e. 7200000 msecs, which is exactly the value of net.inet.tcp.keepidle). 
---

Would this be a problem when AD servers disappear as well? I dug into the Samba code a while ago and find that the particular code is blocking, however, it might be a FreeBSD specific problem. 

> We have a “samba-watchdog” script that regularily attempts to connect to the file service (using smbclient) and during this time period this script was triggered a number of times: If a connection attempt takes more than 15 seconds then it sleeps 5 seconds and tries again. If that one fails too then it kills winbindd and restarts it (which is pretty quick so most users doesn’t notice it).
> 
> The main reason for this script is to make smbd recover when new connections are “hung” when/if it hangs at the “10 hour lockup after winbindd start” (which probably is due to the service principal expiring and needing renewal - this doesn’t seem to happen on small servers with few users, but for us with 500-1600 users per “samba” it happens regularly. Every day at 17:00 and 03:00 (we restart smbd&winbindd at 07:00). Without this watchdog smbd would refuse new connections for 1-15 minutes (or more) which isn’t good :-)

Our work-around is also a watchdog script ('guard-winbindd-idmap'). It kills the idmap child of winbindd if it has been running for 8 hours or when 'wbinfo -i administrator' fails. Obviously, this script runs on the fileserver (domain member server). Since these servers also act as NFS server, it also restarts gssd if it is still running otherwise is starts gssd. This is needed since gssd stops working as well.

> Samba 4.12.5, FreeBSD 11.3 & 12.1
> 
> From krb5.conf:
> 
> [realms]
> OURREALM = {
>  kdc = server1
>  kdc = server2
>  kdc = server3
>  kdc = server4
> }
> 
> It was “server1” that was being moved.

-Remy