winbindd stuck at getaddrinfo.

Wed Oct 29 11:34:23 MDT 2014

Hi Volker,

Thanks for the quick response.

Since all other process stacks having fork_domain_child, I assume that this
specific one if from parent since I did not find fork_domain_child() in
stack trace.

Killing stuck winbindd child in case #2 did not recover the situation. Had
to kill the parent and restart winbindd manually. In first case, we have
actually restarted samba.
Looks like libc getaddrinfo() is having 60 seconds timeout. In my internal
recreation of problem, I could see the stuck stack trace remains there for
about a minute and start showing different stack trace (different path
leading to getaddrinfo).

We have one more getaddrinfo() in heimdal.  Since the system(libc) supports
getaddrinfo(), its picking that instead of heimdal. Is there any
recommendation that we should use heimdal version of getaddrinfo() over
libc?

Also I found that krb5_sendto() has retry logic if getaddrinfo() fails for
some reason.

krb5_sendto (krb5_context context,
     const krb5_data *send_data,
     krb5_krbhst_handle handle,
     krb5_data *receive)
{
     ....
     ....
     for (i = 0; i < context->max_retries; ++i) {
 krb5_krbhst_info *hi;

 while (krb5_krbhst_next(context, handle, &hi) == 0) {
     struct addrinfo *ai, *a;

      ....
      ....
      ret = krb5_krbhst_get_addrinfo(context, hi, &ai);
if (ret)
   continue;
....
}

Max_retries are initialized to 3. But I am not sure krb5_krbhst_next will
actually list. If we assume that there is 60 seconds timeout, then with
this retry logic it would take much more time to actually come out of
getaddrinfo() . I can say that this issue seen with customers with big DNS
and DC deployments.

Infact I am unable to relate this issue completely to DNS response.
Because, winbindd is able to recover from this hung when it is restarted.
Problem should be persistent if its specific to DNS.

Sorry for typing this big mail. I am scratching my head to root cause this
issue and find a probable solution. :-)

Thanks,
Hemanth.

On Wed, Oct 29, 2014 at 3:41 AM, Volker Lendecke <Volker.Lendecke at sernet.de>
wrote:

> On Wed, Oct 29, 2014 at 12:20:53AM -0700, Hemanth Thummala wrote:
> > We are using samba 3.6.12+ stack on FreeBSD 8.0. Recently we are hitting
> a
> > winbindd hung issue at few of our customers. Each time we could see that
> > winbindd getting hung at gettaddrinfo.
> >
> > Here are couple of instances.
> >
> > Thread 1 (Thread 8030021c0 (LWP 101442)):
> > #0  0x00000008027079dc in kevent () from /lib/libc.so.7
> > #1  0x00000008026d9d81 in ?? () from /lib/libc.so.7
> > #2  0x00000008026da6e4 in __res_nsend () from /lib/libc.so.7
> > #3  0x00000008026e91ae in ?? () from /lib/libc.so.7
> > #4  0x00000008026e94bf in ?? () from /lib/libc.so.7
> > #5  0x00000008026ea2ba in ?? () from /lib/libc.so.7
> > #6  0x00000008026fb7e3 in nsdispatch () from /lib/libc.so.7
> > #7  0x00000008026eb907 in getaddrinfo () from /lib/libc.so.7
> > #8  0x0000000801505d82 in krb5_krbhst_get_addrinfo () from
> > /usr/lib/libkrb5.so.10
> > #9  0x00000008015051c0 in krb5_sendto () from /usr/lib/libkrb5.so.10
> > #10 0x000000080150549f in krb5_sendto_context () from
> /usr/lib/libkrb5.so.10
> > #11 0x00000008014ee82c in krb5_get_init_creds () from
> /usr/lib/libkrb5.so.10
> > #12 0x00000008014ef34a in krb5_get_init_creds_password () from
> > /usr/lib/libkrb5.so.10
> > #13 0x000000000082c193 in kerberos_kinit_password_ext
> > (principal=0x803030b80 "HOSTNAME$@CORP.DOMAIN.COM", password=0x80300a4d0
> > "4hOZxLoyOqynNr", time_offset=0, expire_time=0x0, renew
> > _till_time=0x0, cache_name=0x80300cdc0 "MEMORY:cliconnect",
> > request_pac=false, add_netbios_addr=false, renewable_time=0,
> ntstatus=0x0)
> > at libads/kerberos.c:232
> > #14 0x000000000082c426 in kerberos_kinit_password (principal=0x1c
> <Address
> > 0x1c out of bounds>, password=0x7fffffff9ba0 "\035", time_offset=1,
> > cache_name=0x7fffffff9bd0 "\n") at libads/kerb
> > eros.c:657
> > #15 0x000000000058e47f in cli_session_setup_spnego (cli=0x803049f50,
> > user=0x803030b80 "HOSTNAME$@CORP.DOMAIN.COM", pass=0x80300a4d0
> > "4hOZxLoyOqynNr", user_domain=0x80300d0f0 "CORP",
> > dest_realm=0x80305f300 "corp.domain.com") at libsmb/cliconnect.c:1861
> > #16 0x000000000049f404 in cm_prepare_connection (retry=<optimized out>,
> > cli=<optimized out>, controller=<optimized out>, sockfd=<optimized out>,
> > domain=<optimized out>) at winbindd/winbindd
> > _cm.c:893
> > #17 cm_open_connection (domain=0x80305f200, new_conn=0x80305f720) at
> > winbindd/winbindd_cm.c:1606
> > #18 0x000000000049f89d in init_dc_connection_network (domain=0x80305f200)
> > at winbindd/winbindd_cm.c:1788
> > #19 0x000000000049f8ee in init_dc_connection (domain=0x1c) at
> > winbindd/winbindd_cm.c:1808
> > #20 0x000000000049f911 in init_dc_connection_rpc (domain=0x1c) at
> > winbindd/winbindd_cm.c:1815
> > #21 0x000000000049f97d in cm_connect_netlogon (domain=0x1c,
> > cli=0x7fffffff9ba0) at winbindd/winbindd_cm.c:2623
> > #22 0x00000000004983da in winbind_samlogon_retry_loop
> (domain=0x80305f200,
> > mem_ctx=0x8030095b0, logon_parameters=2080, server=0x80305f590 "
> > DC01.corp.domain.com", username=0x7fffff
> > ffe074 "1420djc", domainname=0x7fffffffe174 "CORP",
> > workstation=0x7fffffffe47c "MH2017", chal=0x7fffffffe068
> > "\375\211X\310{\303\021\300 \b", lm_response=..., nt_response=...,
> > info3=0x7ffff
> > fffd108) at winbindd/winbindd_pam.c:1178
> > #23 0x0000000000499626 in winbindd_dual_pam_auth_crap
> (domain=0x80305f200,
> > state=0x7fffffffe800) at winbindd/winbindd_pam.c:1875
> > #24 0x00000000004aeeed in child_process_request (state=<optimized out>,
> > child=<optimized out>) at winbindd/winbindd_dual.c:495
> > #25 fork_domain_child (child=<optimized out>) at
> > winbindd/winbindd_dual.c:1609
> > #26 wb_child_request_trigger (req=<optimized out>,
> private_data=<optimized
> > out>) at winbindd/winbindd_dual.c:200
> > #27 0x0000000000569db0 in tevent_common_loop_immediate (ev=0x80301e110)
> at
> > ../lib/tevent/tevent_immediate.c:139
> > #28 0x0000000000568075 in run_events_poll (ev=0x80301e110, pollrtn=0,
> > pfds=0x0, num_pfds=0) at lib/events.c:197
> > #29 0x0000000000568799 in s3_event_loop_once (ev=0x80301e110,
> > location=<optimized out>) at lib/events.c:331
> > #30 0x0000000000568bb1 in _tevent_loop_once (ev=0x80301e110,
> > location=0x8b45fa "winbindd/winbindd.c:1491") at
> ../lib/tevent/tevent.c:494
> > #31 0x0000000000489a22 in main (argc=<optimized out>, argv=<optimized
> out>,
> > envp=<optimized out>) at winbindd/winbindd.c:1491
> >
> > Above stack belongs to parent winbindd process. And below one is child
> > winbindd process.
>
> Really? I see child_process_request() in the backtrace which
> at least hints at it being in a child. That's at least
> interesting to investigate how the winbind parent ended up
> in this call stack. winbindd_dual_pam_auth_crap is at least
> designed to only be called from within a child.
>
> However, the only real way around this is to completely
> isolate and time-out the getaddinfo call in a separate
> process if the FreeBSD libc does not time out at all. This
> would be quite an undertaking, because as you can see we
> also have to much better isolate the kerberos calls.
>
> In the meantime, does killing the stuck winbind child only
> help?
>
> Thanks,
>
> Volker
>
> --
> SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
> phone: +49-551-370000-0, fax: +49-551-370000-9
> AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
> http://www.sernet.de, mailto:kontakt at sernet.de
>