[Samba] winbindd: how it chooses which LDAP servers to query?
Michael Tokarev
mjt at tls.msk.ru
Wed Jun 25 17:14:08 UTC 2025
Hi!
We're having a huge issue with at least one of our samba servers
which is joined to a samba AD.
We had 2 DCs, in two offices. Each within its own site.
Everything worked correctly, it looked like all queries are
made to the nearby DC, local to the server's. Until we had
a network/power outage and lost connectivity for over a day.
And now, at least one of the samba servers almost completely
stopped working, -- because usernames can't be looked up
anymore, so only root can login over ssh, and samba shares
does not work at all.
winbindd constantly tries to reach a DC in the remote office,
despite local DC is working instantly. Just a simple `id mjt`
takes about a minute, despite the router immediately returning
"No route to host" to all packets destined for the remote DC
(it isn't timing out).
It is more: the results aren't being cached, so the next run
of the same `id mjt` takes another minute.
Sometimes it succeeds after a minute (querying a local DC),
and sometimes it reports "user not found", - but either way this
behavior breaks whole system almost completely.
log.winbindd has a lot of entries like
[2025/06/25 19:32:48.961557, 1, traceid=40]
source3/winbindd/wb_xids2sids.c:407(wb_xids2sids_recv)
wb_sids_to_xids failed: NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND
[2025/06/25 19:32:48.961609, 1, traceid=40]
source3/winbindd/winbindd_xids_to_sids.c:111(winbindd_xids_to_sids_recv)
Could not convert xids: NT_STATUS_DOMAIN_CONTROLLER_NOT_FOUND
We've added another DC to this office, but it didn't help at
all, -- the samba server still tries to contact the remote DC.
This does not depend on the DNS config being in use - exactly
the same happens when setting resolv.conf to point to the samba
AD DC as nameservers, as with external nameservers with static
contents for the AD functionality.
I tried to play with DNS records, temporarily removing the
remote DC from _ldap._tcp and _ldap._tcp.dc._msdcs sets of
records, but this changes nothing, at least I haven't seen a
change.
Another samba member server I've set up locally for testing
does not have this issue, it immediately finds both local to
the site DCs and starts querying one of the two.
Below is a typical level-5 debug from winbindd (asked for my groups).
It correctly determines local site, correctly determines a good server
to query and the DC list. It performs a *lot* of DNS queries. But
it does not even try other LDAP servers besides the remote one, despite
correctly finding the right ones.
How does winbindd choose which LDAP server to query?
Thanks,
/mjt
---
here, svdcm and svdcm2 are two local DCs in Moscow-Office,
with IP addresses 192.168.177.8 and .9, and svdcp is the
remote DC with an IP 192.168.19.6, which is unreachable.
child daemon request 54
sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office"
namecache_fetch: name svdcm.tls.msk.ru#20 found.
sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office"
saf_fetch: Returning "svdcm.tls.msk.ru" for "TLS.MSK.RU" domain
get_dc_list: preferred server list: "svdcm.tls.msk.ru, *"
resolve_ads: Attempting to resolve KDCs for TLS.MSK.RU using DNS
dns_rr_srv_fill_done: async DNS A lookup for svdcm.tls.msk.ru [0] got
svdcm.tls.msk.ru -> 192.168.177.8
dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm.tls.msk.ru
returned 0 addresses.
dns_rr_srv_fill_done: async DNS A lookup for svdcm2.tls.msk.ru [0] got
svdcm2.tls.msk.ru -> 192.168.177.9
dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm2.tls.msk.ru
returned 0 addresses.
sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office"
namecache_fetch: name svdcm.tls.msk.ru#20 found.
get_dc_list: returning 2 ip addresses in an ordered list
get_dc_list: 192.168.177.8 192.168.177.9
saf_fetch: Returning "svdcm.tls.msk.ru" for "TLS.MSK.RU" domain
get_dc_list: preferred server list: "svdcm.tls.msk.ru, *"
resolve_ads: Attempting to resolve KDCs for TLS.MSK.RU using DNS
dns_rr_srv_fill_done: async DNS A lookup for svdcm2.tls.msk.ru [0] got
svdcm2.tls.msk.ru -> 192.168.177.9
dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm2.tls.msk.ru
returned 0 addresses.
dns_rr_srv_fill_done: async DNS A lookup for svdcm.tls.msk.ru [0] got
svdcm.tls.msk.ru -> 192.168.177.8
dns_rr_srv_fill_done: async DNS AAAA lookup for svdcm.tls.msk.ru
returned 0 addresses.
dns_rr_srv_fill_done: async DNS A lookup for svdcp.tls.msk.ru [0] got
svdcp.tls.msk.ru -> 192.168.19.6
dns_rr_srv_fill_done: async DNS AAAA lookup for svdcp.tls.msk.ru
returned 0 addresses.
sitename_fetch: Returning sitename for realm 'TLS.MSK.RU': "Moscow-Office"
namecache_fetch: name svdcm.tls.msk.ru#20 found.
get_dc_list: returning 3 ip addresses in an ordered list
get_dc_list: 192.168.177.8 192.168.177.9 192.168.19.6
At this point, strace shows that it sends UDP queries to .177.9
(which immediately replies) and to .19.6 (which never replies),
apparently dislikes the answer from .177.9, sends a few more
queries to .19.6, and finally:
get_kdc_ip_string: Failed to get KDC ip address
Finished processing child request 54
Here's the part from straace (note relative timestamps in the
first column - waiting for answer takes quite some time):
0.000034 write(1, "get_dc_list: 192.168.177.8 192.1"..., 55get_dc_list:
192.168.177.8 192.168.177.9 192.168.19.6
) = 55
0.000036 epoll_create1(EPOLL_CLOEXEC) = 19
0.000049 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 20
0.000034 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR)
0.000029 fcntl(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
0.000029 fcntl(20, F_GETFD) = 0
0.000028 fcntl(20, F_SETFD, FD_CLOEXEC) = 0
0.000028 connect(20, {sa_family=AF_INET, sin_port=htons(389),
sin_addr=inet_addr("192.168.177.9")}, 16) = 0
0.000052 sendto(20,
"0\\\2\3\0\352GcU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5"...,
94, 0, NULL, 0) = 94
0.000059 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 21
0.000034 fcntl(21, F_GETFL) = 0x2 (flags O_RDWR)
0.000028 fcntl(21, F_SETFL, O_RDWR|O_NONBLOCK) = 0
0.000029 fcntl(21, F_GETFD) = 0
0.000028 fcntl(21, F_SETFD, FD_CLOEXEC) = 0
0.000028 connect(21, {sa_family=AF_INET, sin_port=htons(389),
sin_addr=inet_addr("192.168.19.6")}, 16) = 0
0.000044 sendto(21,
"0[\2\2y\32cU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5N"..., 93,
0, NULL, 0) = 93
0.000051 epoll_ctl(19, EPOLL_CTL_ADD, 20, {events=EPOLLIN|EPOLLRDHUP,
data={u32=91469680, u64=94218789042032}}) = 0
0.000037 epoll_ctl(19, EPOLL_CTL_ADD, 21, {events=EPOLLIN|EPOLLRDHUP,
data={u32=91478128, u64=94218789050480}}) = 0
0.000034 epoll_wait(19, [{events=EPOLLIN, data={u32=91469680,
u64=94218789042032}}], 1, 2000) = 1
0.000816 ioctl(20, FIONREAD, [131]) = 0
0.000033 recvfrom(20,
"0q\2\3\0\352Gdj\4\0000f0d\4\10netlogon1X\4V\27\0\0"..., 131, 0,
{sa_family=AF_INET, sin_port=htons>
0.000054 epoll_ctl(19, EPOLL_CTL_DEL, 20, 0x7ffd48b2a064) = 0
0.000030 close(20) = 0
0.000041 epoll_wait(19, <unfinished ...>
0.965967 <... epoll_wait resumed>[], 1, 986) = 0
0.000035 close(12) = 0
0.000058 epoll_wait(3, <unfinished ...>
1.036069 <... epoll_wait resumed>[], 1, 2000) = 0
0.000212 sendto(21,
"0[\2\2y\32cU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5N"..., 93,
0, NULL, 0) = 93
0.000100 epoll_wait(19, <unfinished ...>
1.683967 <... epoll_wait resumed>[], 1, 3700) = 0
0.000042 epoll_wait(3, <unfinished ...>
0.318118 <... epoll_wait resumed>[], 1, 2000) = 0
0.000209 sendto(21,
"0[\2\2y\32cU\4\0\n\1\0\n\1\0\2\1\0\2\1\0\1\1\0\2406\243\r\4\5N"..., 93,
0, NULL, 0) = 93
0.000095 epoll_wait(19, [], 1, 1995) = 0
1.995811 epoll_ctl(19, EPOLL_CTL_DEL, 21, 0x7ffd48b29f44) = 0
0.000049 close(21) = 0
0.000054 close(19) = 0
0.000045 write(1, "get_kdc_ip_string: Failed to get"...,
48get_kdc_ip_string: Failed to get KDC ip address
More information about the samba
mailing list