[Samba] Annoying Winbind Pause While Looking Up Permissions

Sat Apr 19 01:20:27 GMT 2008

Our linux SLES10 fileserver is running Samba (3.0.28) and is a domain
member in a Win2003 AD. Everything has been working fine for several
months with the exception of an annoying pause that occurs while
browsing shares. The pause will last 30 seconds and occurs roughly once
every 5-10 minutes. The pause is universal, meaning, that all Windows
clients trying to browse or access files will see this pause, as well
as, any Linux processes (like doing 'ls -l'). Basically, any command
that requires winbind to determine access privileges pauses. This only
occurs for objects that have AD users or groups.

Here's the system info:
Samba 3.0.28 (3.0.25a and 3.0.25c also had this problem)
Krb5 1.4.3
Linux 2.6.16 (x64)

All of the usual tests work:
wbinfo -t
wbinfo -u
wbinfo -g
net ads join
kinit
etc...

I finally had time to investigate this further. All I did was 'ls -l
/share' where /share contained files with domain permissions. The
command paused for 30 seconds before finally listing the files
correctly. All Windows clients experience this same pause at the same
time also. Here's the data I collected:

# tcpdump -n -r tcpdump
reading from file tcpdump, link-type EN10MB (Ethernet)
16:52:20.838780 IP 192.168.1.210.44072 > 192.168.1.207.389: P
117:143(26) ack 214 win 1460 <nop,nop,timestamp 1551335289 70870460>
16:52:20.838951 IP 192.168.1.207.389 > 192.168.1.210.44072: P
214:333(119) ack 143 win 65393 <nop,nop,timestamp 70870460 1551335289>
16:52:20.839894 IP 192.168.1.210.32891 > 192.168.1.207.88:  v5
16:52:20.840419 IP 192.168.1.207.88 > 192.168.1.210.32891:
16:52:20.873183 IP 192.168.1.210.58864 > 192.168.1.207.445: . ack 1771
win 2800 <nop,nop,timestamp 1551335298 70870460>
16:52:20.877180 IP 192.168.1.210.44072 > 192.168.1.207.389: . ack 333
win 1460 <nop,nop,timestamp 1551335299 70870460>
16:52:30.837599 IP 192.168.1.210.32891 > 192.168.1.207.88:  v5
16:52:30.838482 IP 192.168.1.207.88 > 192.168.1.210.32891:  v5
16:52:40.837652 IP 192.168.1.210.32891 > 192.168.1.207.88:
16:52:40.838606 IP 192.168.1.207.88 > 192.168.1.210.32891:
16:52:50.837439 IP 192.168.1.210.44072 > 192.168.1.207.389: P
143:1410(1267) ack 333 win 1460 <nop,nop,timestamp 1551342789 70870460>
16:52:50.838112 IP 192.168.1.207.389 > 192.168.1.210.44072: P
333:383(50) ack 1410 win 64126 <nop,nop,timestamp 70870761 1551342789>

192.168.1.210 is the Linux server running Samba and 192.168.1.207 is the
Windows domain controller. You'll notice that everything is humming
along smoothly through time 16:52:20. Between 16:52:20 and 16:52:50 is
the 30 second delay. My understanding is that port 88 is related to
Kerberos. Running tcpdump while executing the same command again yields
instant results with identical network traffic minus the port 88
requests.

# strace -T -t ls -l /share
16:52:20 read(4, "/var/lib/samba/winbindd_privileg"..., 35) = 35
<0.000009>
16:52:20 lstat("/var/lib/samba/winbindd_privileged",
{st_mode=S_IFDIR|0750, st_size=72, ...}) = 0 <0.000011>
16:52:20 lstat("/var/lib/samba/winbindd_privileged/pipe",
{st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0 <0.000011>
16:52:20 socket(PF_FILE, SOCK_STREAM, 0) = 5 <0.000011>
16:52:20 fcntl(5, F_GETFL)              = 0x2 (flags O_RDWR) <0.000006>
16:52:20 fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <0.000007>
16:52:20 fcntl(5, F_GETFD)              = 0 <0.000006>
16:52:20 fcntl(5, F_SETFD, FD_CLOEXEC)  = 0 <0.000006>
16:52:20 connect(5, {sa_family=AF_FILE,
path="/var/lib/samba/winbindd_privileged/pipe"}, 110) = 0 <0.000018>
16:52:20 close(4)                       = 0 <0.000011>
16:52:20 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout) <0.000007>
16:52:20 write(5,
"(\10\0\0\4\0\0\0d\20\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2088) =
2088 <0.000011>
16:52:20 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.997279>
16:52:25 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999895>
16:52:30 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999885>
16:52:35 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999914>
16:52:40 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999891>
16:52:45 select(6, [5], NULL, NULL, {5, 0}) = 0 (Timeout) <4.999916>
16:52:50 select(6, [5], NULL, NULL, {5, 0}) = 1 (in [5], left {4,
968000}) <0.033682>
16:52:50 read(5, "\354\f\0\0\2\0\0\0STARTREK-phx_api_release"..., 3240)
= 3240 <0.000014>

Notice the chain of select() calls between 16:52:20 and 16:52:50 that
all timeout after 5 seconds for a total of 30 seconds!

Finally, here's the relevant section of the winbind log:
[2008/04/18 16:52:20, 4] libsmb/namequery.c:get_dc_list(1599)
  get_dc_list: returning 2 ip addresses in an ordered list
[2008/04/18 16:52:20, 4] libsmb/namequery.c:get_dc_list(1600)
  get_dc_list: 192.168.1.207:389 192.0.2.224:389
[2008/04/18 16:52:20, 5] libads/ldap.c:ads_try_connect(180)
  ads_try_connect: sending CLDAP request to 192.168.1.207 (realm:
STARTREK.LOCAL)
[2008/04/18 16:52:20, 3] libads/ldap.c:ads_connect(394)
  Connected to LDAP server 192.168.1.207
[2008/04/18 16:52:20, 4] libads/ldap.c:ads_current_time(2414)
  time offset is 23 seconds
[2008/04/18 16:52:20, 4] libads/sasl.c:ads_sasl_bind(521)
  Found SASL mechanism GSS-SPNEGO
[2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213)
  ads_sasl_spnego_bind: got OID=1 2 840 48018 1 2 2
[2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213)
  ads_sasl_spnego_bind: got OID=1 2 840 113554 1 2 2
[2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213)
  ads_sasl_spnego_bind: got OID=1 2 840 113554 1 2 2 3
[2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(213)
  ads_sasl_spnego_bind: got OID=1 3 6 1 4 1 311 2 2 10
[2008/04/18 16:52:20, 3] libads/sasl.c:ads_sasl_spnego_bind(222)
  ads_sasl_spnego_bind: got server principal name =
trieste$@STARTREK.LOCAL
[2008/04/18 16:52:20, 3] libsmb/clikrb5.c:ads_krb5_mk_req(593)
  ads_krb5_mk_req: krb5_cc_get_principal failed (No credentials cache
found)
[2008/04/18 16:52:50, 4] libsmb/clikrb5.c:ads_krb5_mk_req(610)
  ads_krb5_mk_req: Advancing clock by 13 seconds to cope with clock skew
[2008/04/18 16:52:50, 3] libsmb/clikrb5.c:ads_cleanup_expired_creds(528)
  ads_cleanup_expired_creds: Ticket in ccache[MEMORY:winbind_ccache]
expiration Fri, 18 Apr 2008 17:13:03 MST
[2008/04/18 16:52:50, 5]
libads/ldap_utils.c:ads_do_search_retry_internal(64)
  Search for
(objectSid=\01\05\00\00\00\00\00\05\15\00\00\00\5F\52\F4\2D\49\0D\1F\07\
7F\6C\71\4D\C6\1F\00\00) in <dc=STARTREK,dc=LOCAL> gave 1 replies

This shows that at the start of the pause I have the following error:
krb5_cc_get_principal failed (No credentials cache found)

This is what I think is happening:
1.) winbind is unable to find it's info in the Kerberos cache
2.) winbind queries the Windows domain controller (via port 88)
3.) the select on this socket times out for 30 seconds (even though the
domain controller responds??)
4.) winbind defaults back to "search_retry_internal" and finds what it
needs to continue

The weird thing is it only does this once every 5-10 minutes. Every
other time I don't get the "krb5_cc_get_principal failed (No credentials
cache found)" error that leads to the 30 second timeout.

If I run kinit for a domain user, that user shows up when I klist.
However, that's the only ticket that shows up.

If anyone could shed some light on this issue I would greatly appreciate
it.

Thanks,
Matt