Winbind using 100% CPU
Jim McDonough
jmcd at samba.org
Thu Apr 18 10:26:32 MDT 2013
On Thu, Apr 18, 2013 at 12:16 PM, Dylan Klomparens
<dylan.klomparens at gmail.com> wrote:
> The patch has exposed the problem. I've attached the output from winbind.
> Please let me know if you need additional information!
Wow, that's quick. Any other clues, like what was going on on the
system then? what was causing winbind to do work? How many trusted
domains?
>
> -- Dylan
>
>
> On Wed, Apr 17, 2013 at 8:26 AM, Jim McDonough <jmcd at samba.org> wrote:
>>
>> On Wed, Apr 17, 2013 at 2:02 AM, Andreas Schneider <asn at samba.org> wrote:
>> > On Tuesday 16 April 2013 15:53:35 Jim McDonough wrote:
>> >> On Tue, Apr 16, 2013 at 2:45 PM, Jeremy Allison <jra at samba.org> wrote:
>> >> > On Tue, Apr 16, 2013 at 01:46:14PM -0400, Jim McDonough wrote:
>> >> >> On Tue, Apr 16, 2013 at 10:00 AM, Jim McDonough <jmcd at samba.org>
>> >> >> wrote:
>> >> >> > On Thu, Apr 11, 2013 at 6:59 PM, Jeremy Allison <jra at samba.org>
>> >> >> > wrote:
>> >> >> >> On Thu, Apr 11, 2013 at 09:58:33AM -0400, Dylan Klomparens wrote:
>> >> >> >>> (Re-posting on this email list per Jeremy Allison's request.)
>> >> >> >>>
>> >> >> >>> I am trying to figure out why winbind is using 100% CPU on my
>> >> >> >>> file
>> >> >> >>> server.
>> >> >> >>> I am using Samba version 4.0.4. Everything is fine for a few
>> >> >> >>> minutes
>> >> >> >>> when I
>> >> >> >>> start winbind, however after a while it begins using 100% CPU. I
>> >> >> >>> haven't
>> >> >> >>> been able to narrow down what triggers this CPU usage spike, but
>> >> >> >>> I
>> >> >> >>> did
>> >> >> >>> attach the GNU debugger to find out what's going on in the
>> >> >> >>> process.
>> >> >> >>> The
>> >> >> >>> backtrace revealed this information:
>> >> >> >>>
>> >> >> >>> #0 0x000000000041cf30 in _talloc_free at plt ()
>> >> >> >>> #1 0x0000000000452320 in winbindd_reinit_after_fork ()
>> >> >> >>> #2 0x00000000004524e6 in fork_domain_child ()
>> >> >> >>> #3 0x0000000000453585 in wb_child_request_trigger ()
>> >> >> >>> #4 0x000000381d2048e2 in tevent_common_loop_immediate () from
>> >> >> >>> /lib64/libtevent.so.0
>> >> >> >>> #5 0x00007fbed6b98e17 in run_events_poll () from
>> >> >> >>> /lib64/libsmbconf.so.0
>> >> >> >>> #6 0x00007fbed6b9922e in s3_event_loop_once () from
>> >> >> >>> /lib64/libsmbconf.so.0
>> >> >> >>> #7 0x000000381d204060 in _tevent_loop_once () from
>> >> >> >>> /lib64/libtevent.so.0
>> >> >> >>> #8 0x000000000042049a in main ()
>> >> >> >>>
>> >> >> >>> Apparently it's stuck in the winbindd_reinit_after_fork (and
>> >> >> >>> more
>> >> >> >>> specifically the _talloc_free function). This code resides in
>> >> >> >>> $SOURCE_HOME\source3\winbindd\winbindd_dual.c.
>> >> >> >>
>> >> >> >> That looks like corrupted memory - probably a loop
>> >> >> >> in the talloc tree.
>> >> >> >
>> >> >> > I've got a user who sees this and we're adding the same dlinklist
>> >> >> > element twice, creating a loop in the winbind child list.
>> >> >> >
>> >> >> > I've got a broken wrist so responses take a while, but that's my
>> >> >> > current hint. On 3.6.3 and 3.6.13.
>> >> >>
>> >> >> I see this in the parent winbind log. The last 3 entries are
>> >> >> changes
>> >> >> I made to the dlist macros within winbind only. you can see a 2
>> >> >> second delay and then a second add of the same item to the child
>> >> >> winbind list. No entries in between (and a production system so
>> >> >> there
>> >> >> is reluctance to increase the debug level).
>> >> >>
>> >> >> [2013/03/14 09:08:57.795585, 3]
>> >> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>> >> >>
>> >> >> getpwnam xx+xxxxxxxxx-112$
>> >> >>
>> >> >> [2013/03/14 09:08:57.796185, 3]
>> >> >> winbindd/winbindd_getpwnam.c:56(winbindd_getpwnam_send)
>> >> >>
>> >> >> getpwnam xx+yyyyyyy
>> >> >>
>> >> >> [2013/03/14 09:09:00.077660, 0]
>> >> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
>> >> >>
>> >> >> adding 0x7ff91105a000 to list at 0x7ff911053510
>> >> >>
>> >> >> [2013/03/14 09:09:02.435367, 0]
>> >> >> winbindd/winbindd_dual.c:1399(fork_domain_child)
>> >> >>
>> >> >> adding 0x7ff91105a000 to list at 0x7ff91105a000
>> >> >>
>> >> >> [2013/03/14 09:09:02.435510, 0] lib/util.c:1117(smb_panic)
>> >> >>
>> >> >> PANIC (pid 28628): duplicate
>> >> >
>> >> > Jim, do your changes lead to a smb_panic on an
>> >> > DLIST_ADD of a duplicate ? If so, can you post
>> >> > a complete backtrace ?
>> >> >
>> >> > Jeremy.
>> >>
>> >> [2013/03/14 09:09:02.437394, 0] lib/util.c:1221(log_stack_trace)
>> >> BACKTRACE: 11 stack frames:
>> >> #0 /usr/sbin/winbindd(log_stack_trace+0x1a) [0x7f02379883ba]
>> >> #1 /usr/sbin/winbindd(smb_panic+0x2b) [0x7f023798848b]
>> >> #2 /usr/sbin/winbindd(+0x10806e) [0x7f02378d206e]
>> >> #3 /usr/sbin/winbindd(+0x1084b5) [0x7f02378d24b5]
>> >> #4 /usr/sbin/winbindd(tevent_common_loop_immediate+0xea)
>> >> [0x7f023799917a] #5 /usr/sbin/winbindd(run_events_poll+0x3b)
>> >> [0x7f02379973db]
>> >> #6 /usr/sbin/winbindd(+0x1cdb12) [0x7f0237997b12]
>> >> #7 /usr/sbin/winbindd(_tevent_loop_once+0x90) [0x7f0237997f10]
>> >> #8 /usr/sbin/winbindd(main+0x7db) [0x7f02378aa13b]
>> >> #9 /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f0234a2ec36]
>> >> #10 /usr/sbin/winbindd(+0xdd839) [0x7f02378a7839]
>> >>
>> >>
>> >> Unfortunately their core file doesn't seem quite right...this was out
>> >> of their logs and I'm having trouble with the core, but this fscking
>> >> splint keeps me moving at a snail's pace
>> >
>> > Jim, can you port this patch to master and ask Dylan to run with the
>> > patch so
>> > we get a full backtrace?
>> Here it is on master...
>>
>>
>>
>>
>> --
>> Jim McDonough
>> Samba Team
>> SUSE labs
>> jmcd at samba dot org
>> jmcd at themcdonoughs dot org
>
>
--
Jim McDonough
Samba Team
SUSE labs
jmcd at samba dot org
jmcd at themcdonoughs dot org
More information about the samba-technical
mailing list