Winbindd using 100% of CPU. Any solution?

Richard Sharpe realrichardsharpe at gmail.com
Mon Dec 16 12:10:05 MST 2013


On Mon, Dec 16, 2013 at 11:05 AM, Andreas Schneider <asn at samba.org> wrote:
> On Monday 16 December 2013 10:57:24 Richard Sharpe wrote:
>> On Mon, Dec 16, 2013 at 10:56 AM, Andreas Schneider <asn at samba.org> wrote:
>> > On Monday 16 December 2013 10:46:26 Richard Sharpe wrote:
>> >> On Wed, Dec 4, 2013 at 12:06 PM, Richard Sharpe
>> >>
>> >> <realrichardsharpe at gmail.com> wrote:
>> >> > The build I got onto the customer system did not have the damn patch
>> >> > to dump core when we hit that problem.
>> >> >
>> >> > Trying again with a new build.
>> >>
>> >> We got a hit or two:
>> >>
>> >> [2013/12/16 13:28:21.454633,  0]
>> >> winbindd/winbindd_cache.c:3148(initialize_winbindd_cach
>> >> e)
>> >>
>> >>   initialize_winbindd_cache: clearing cache and re-creating with
>> >>
>> >> version number 2
>> >> [2013/12/16 13:28:21.933631,  0]
>> >> winbindd/winbindd_util.c:330(trustdom_list_done)
>> >>
>> >>   Got invalid trustdom response:OIAA. Cannot get the SID for (NULL SID).
>> >>
>> >> [2013/12/16 13:33:21.467075,  0]
>> >> winbindd/winbindd_util.c:330(trustdom_list_done)
>> >>
>> >>   Got invalid trustdom response:OIAA. Cannot get the SID for (NULL SID).
>> >>
>> >> [2013/12/16 13:33:21.469075,  0]
>> >> winbindd/winbindd_dual.c:1404(fork_domain_child)
>> >>
>> >>   adding 0x8033730a0 to list at 0xeac360
>> >>
>> >> [2013/12/16 13:33:21.473075,  0] lib/util.c:1117(smb_panic)
>> >>
>> >>   PANIC (pid 3190): duplicate!
>> >>
>> >> [2013/12/16 13:33:21.481075,  0] lib/util.c:1221(log_stack_trace)
>> >>
>> >>   BACKTRACE: 10 stack frames:
>> >>    #0 0x5a51cc <smb_panic+108> at /usr/local/sbin/winbindd
>> >>    #1 0x4bcf48 <wb_child_domain+1224> at /usr/local/sbin/winbindd
>> >>    #2 0x4b92c3 <wb_child_request_send+291> at /usr/local/sbin/winbindd
>> >>    #3 0x5bdb09 <_tevent_queue_create+361> at /usr/local/sbin/winbindd
>> >>    #4 0x5bbef8 <tevent_common_loop_immediate+488> at
>> >>
>> >> /usr/local/sbin/winbindd #5 0x5b8a42 <run_events_poll+82> at
>> >> /usr/local/sbin/winbindd
>> >>
>> >>    #6 0x5b929b <get_timed_events_timeout+395> at /usr/local/sbin/winbindd
>> >>    #7 0x5ba50f <_tevent_loop_once+223> at /usr/local/sbin/winbindd
>> >>    #8 0x48a5d5 <main+2613> at /usr/local/sbin/winbindd
>> >>    #9 0x48702e <_start+142> at /usr/local/sbin/winbindd
>> >>
>> >> [2013/12/16 13:33:21.481075,  0] lib/fault.c:372(dump_core)
>> >>
>> >>   dumping core in /core
>> >>
>> >> We are getting this consistently now.
>> >
>> > I suggest to write a talloc report before you dump the core. It could be
>> > useful to understand the broader picture:
>> >
>> > talloc_report_full(0, fopen("/tmp/talloc_report.log","w"))
>> >
>> >
>> > Maybe create one while winbind is working just fine, so you can compare
>> > it.
>>
>> I Have a talloc report, indeed, I have lots of them.
>>
>> However, there is a limit to what I can do and how many times I can
>> push a new build to the customer.
>
> The talloc report might tell you more of the codepath than gdb can do. Cause
> of the structures and talloc names allocated. Maybe also from the hierarchy of
> the tree. The question is which part of the winbind code is executed before we
> run into the duplicate. Maybe the memory of it is still allocated, then we
> could look at the code run before and with some luck spot the culprit :)
>
> This way we found the talloc_free_children() bug. It took me a week to get my
> head around it but then Simo was able to write a simpler reproducer.
>
>
> https://lists.samba.org/archive/samba-technical/2011-July/078817.html

I have more info but will have to discontinue looking at that until
after lunch with Jeremy. I will grab the talloc reports soon:

This is where it is crashing:

Program received signal SIGABRT, Aborted.
0x00000008029ec78c in kill () from /lib/libc.so.7
(gdb) where
#0  0x00000008029ec78c in kill () from /lib/libc.so.7
#1  0x00000008029eb58b in abort () from /lib/libc.so.7
#2  0x0000000000591e4d in dump_core () at lib/fault.c:391
#3  0x00000000005a5314 in smb_panic (why=0xb61913 "duplicate!") at
lib/util.c:1133
#4  0x00000000004bcf48 in fork_domain_child (child=0x80334c940)
    at winbindd/winbindd_dual.c:1404
#5  0x00000000004b92c3 in wb_child_request_trigger (req=0x80338a3d0,
private_data=0x0)
    at winbindd/winbindd_dual.c:200
#6  0x00000000005bdb09 in tevent_queue_immediate_trigger (ev=0x80331e110,
    im=0x80338a190, private_data=0x80338cad0) at
../lib/tevent/tevent_queue.c:144
#7  0x00000000005bbef8 in tevent_common_loop_immediate (ev=0x80331e110)
    at ../lib/tevent/tevent_immediate.c:135
#8  0x00000000005b8a42 in run_events_poll (ev=0x80331e110, pollrtn=0, pfds=0x0,
    num_pfds=0) at lib/events.c:197
#9  0x00000000005b929b in s3_event_loop_once (ev=0x80331e110,
    location=0xb520c4 "winbindd/winbindd.c:1456") at lib/events.c:331
#10 0x00000000005ba50f in _tevent_loop_once (ev=0x80331e110,
    location=0xb520c4 "winbindd/winbindd.c:1456") at ../lib/tevent/tevent.c:494
#11 0x000000000048a5d5 in main (argc=3, argv=0x7fffffffecb0,
envp=0x7fffffffecd0)
    at winbindd/winbindd.c:1456


It is crashing here:

                child->next = child->prev = NULL;
                DLIST_ADD(winbindd_children, child);

(gdb) p *winbindd_children
$2 = {next = 0x80334c940, prev = 0x80334c9a0, pid = 30293, domain = 0x0,
  logfilename = 0x803330300 "/var/log/samba/log.winbindd-idmap", sock = 30,
  queue = 0x8033302d0, binding_handle = 0x8033303d0, lockout_policy_event = 0x0,
  machine_password_change_event = 0x0, table = 0xe09680}
(gdb) p *winbindd_children->next
$3 = {next = 0x0, prev = 0x0, pid = 31377, domain = 0x80334f400,
  logfilename = 0x80338cb00 "/var/log/samba/log.wb-XCHANGE", sock = -1,
  queue = 0x80338cad0, binding_handle = 0x80338cbd0, lockout_policy_event = 0x0,
  machine_password_change_event = 0x0, table = 0xe09580}
(gdb) p *child
$4 = {next = 0x0, prev = 0x0, pid = 31377, domain = 0x80334f400,
  logfilename = 0x80338cb00 "/var/log/samba/log.wb-XCHANGE", sock = -1,
  queue = 0x80338cad0, binding_handle = 0x80338cbd0, lockout_policy_event = 0x0,
  machine_password_change_event = 0x0, table = 0xe09580}
(gdb) p *child->domain
$5 = {name = "XCHANGE", '\000' <repeats 248 times>,
  alt_name = "xchange.some.dom", '\000' <repeats 239 times>,
  forest_name = '\000' <repeats 255 times>, sid = {sid_rev_num = 1 '\001',
    num_auths = 4 '\004', id_auth = "\000\000\000\000\000\005", sub_auths = {21,
      782252399, 1160315966, 1364796038, 0 <repeats 11 times>}},
domain_flags = 0,
  domain_type = 0, domain_trust_attribs = 0, initialized = true,
native_mode = true,
  active_directory = true, primary = false, internal = false, online = false,
  startup_time = 0, startup = false, can_do_samlogon_ex = false,
  can_do_ncacn_ip_tcp = false, can_do_validation6 = false, methods = 0xe08840,
  backend = 0x0, private_data = 0x0, have_idmap_config = false,
id_range_low = 0,
  id_range_high = 0, dc_probe_pid = -1, dcname = '\000' <repeats 255
times>, dcaddr = {
    ss_len = 0 '\000', ss_family = 0 '\000', __ss_pad1 = "\000\000\000\000\000",
    __ss_align = 0, __ss_pad2 = '\000' <repeats 111 times>}, last_seq_check = 0,
  sequence_number = 4294967295, last_status = {v = 0}, conn = {cli = 0x0,
    samr_pipe = 0x0, sam_connect_handle = {handle_type = 0, uuid =
{time_low = 0,
        time_mid = 0, time_hi_and_version = 0, clock_seq = "\000",
        node = "\000\000\000\000\000"}}, sam_domain_handle =
{handle_type = 0, uuid = {
        time_low = 0, time_mid = 0, time_hi_and_version = 0, clock_seq = "\000",
        node = "\000\000\000\000\000"}}, lsa_pipe = 0x0, lsa_pipe_tcp = 0x0,
    lsa_policy = {handle_type = 0, uuid = {time_low = 0, time_mid = 0,
        time_hi_and_version = 0, clock_seq = "\000", node =
"\000\000\000\000\000"}},
    netlogon_pipe = 0x0}, children = 0x80334c940, check_online_timeout = 0,
  check_online_event = 0x0, prev = 0x80334ee00, next = 0x80334fa00}

(Some names changed ...)

-- 
Regards,
Richard Sharpe
(何以解憂?唯有杜康。--曹操)


More information about the samba-technical mailing list