Winbindd using 100% of CPU. Any solution?

Andreas Schneider asn at samba.org
Mon Dec 16 12:05:32 MST 2013


On Monday 16 December 2013 10:57:24 Richard Sharpe wrote:
> On Mon, Dec 16, 2013 at 10:56 AM, Andreas Schneider <asn at samba.org> wrote:
> > On Monday 16 December 2013 10:46:26 Richard Sharpe wrote:
> >> On Wed, Dec 4, 2013 at 12:06 PM, Richard Sharpe
> >> 
> >> <realrichardsharpe at gmail.com> wrote:
> >> > The build I got onto the customer system did not have the damn patch
> >> > to dump core when we hit that problem.
> >> > 
> >> > Trying again with a new build.
> >> 
> >> We got a hit or two:
> >> 
> >> [2013/12/16 13:28:21.454633,  0]
> >> winbindd/winbindd_cache.c:3148(initialize_winbindd_cach
> >> e)
> >> 
> >>   initialize_winbindd_cache: clearing cache and re-creating with
> >> 
> >> version number 2
> >> [2013/12/16 13:28:21.933631,  0]
> >> winbindd/winbindd_util.c:330(trustdom_list_done)
> >> 
> >>   Got invalid trustdom response:OIAA. Cannot get the SID for (NULL SID).
> >> 
> >> [2013/12/16 13:33:21.467075,  0]
> >> winbindd/winbindd_util.c:330(trustdom_list_done)
> >> 
> >>   Got invalid trustdom response:OIAA. Cannot get the SID for (NULL SID).
> >> 
> >> [2013/12/16 13:33:21.469075,  0]
> >> winbindd/winbindd_dual.c:1404(fork_domain_child)
> >> 
> >>   adding 0x8033730a0 to list at 0xeac360
> >> 
> >> [2013/12/16 13:33:21.473075,  0] lib/util.c:1117(smb_panic)
> >> 
> >>   PANIC (pid 3190): duplicate!
> >> 
> >> [2013/12/16 13:33:21.481075,  0] lib/util.c:1221(log_stack_trace)
> >> 
> >>   BACKTRACE: 10 stack frames:
> >>    #0 0x5a51cc <smb_panic+108> at /usr/local/sbin/winbindd
> >>    #1 0x4bcf48 <wb_child_domain+1224> at /usr/local/sbin/winbindd
> >>    #2 0x4b92c3 <wb_child_request_send+291> at /usr/local/sbin/winbindd
> >>    #3 0x5bdb09 <_tevent_queue_create+361> at /usr/local/sbin/winbindd
> >>    #4 0x5bbef8 <tevent_common_loop_immediate+488> at
> >> 
> >> /usr/local/sbin/winbindd #5 0x5b8a42 <run_events_poll+82> at
> >> /usr/local/sbin/winbindd
> >> 
> >>    #6 0x5b929b <get_timed_events_timeout+395> at /usr/local/sbin/winbindd
> >>    #7 0x5ba50f <_tevent_loop_once+223> at /usr/local/sbin/winbindd
> >>    #8 0x48a5d5 <main+2613> at /usr/local/sbin/winbindd
> >>    #9 0x48702e <_start+142> at /usr/local/sbin/winbindd
> >> 
> >> [2013/12/16 13:33:21.481075,  0] lib/fault.c:372(dump_core)
> >> 
> >>   dumping core in /core
> >> 
> >> We are getting this consistently now.
> > 
> > I suggest to write a talloc report before you dump the core. It could be
> > useful to understand the broader picture:
> > 
> > talloc_report_full(0, fopen("/tmp/talloc_report.log","w"))
> > 
> > 
> > Maybe create one while winbind is working just fine, so you can compare
> > it.
> 
> I Have a talloc report, indeed, I have lots of them.
> 
> However, there is a limit to what I can do and how many times I can
> push a new build to the customer.

The talloc report might tell you more of the codepath than gdb can do. Cause 
of the structures and talloc names allocated. Maybe also from the hierarchy of 
the tree. The question is which part of the winbind code is executed before we 
run into the duplicate. Maybe the memory of it is still allocated, then we 
could look at the code run before and with some luck spot the culprit :)

This way we found the talloc_free_children() bug. It took me a week to get my 
head around it but then Simo was able to write a simpler reproducer.


https://lists.samba.org/archive/samba-technical/2011-July/078817.html


	-- andreas

-- 
Andreas Schneider                   GPG-ID: CC014E3D
Samba Team                             asn at samba.org
www.samba.org



More information about the samba-technical mailing list