[Samba] winbind causes Linux to lockup when connectivity to AD is lost (subject line edited for clarity)

Robert LeBlanc robert at leblancnet.us
Fri Oct 23 13:19:46 MDT 2009


Here is a capture of top at the time:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 5842 root      20   0  873m 6912 4612 S  0.0  0.4   0:01.20 winbindd
 5848 root      20   0  872m 3260 2272 S  0.0  0.2   0:00.08 winbindd
 5849 root      20   0  872m 3640 2652 S  0.0  0.2   0:00.06 winbindd
 5850 root      20   0  872m 3320 2200 S  0.0  0.2   0:00.06 winbindd
 5859 root      20   0  874m 2684 1448 S  0.0  0.2   0:00.00 winbindd
 5954 root      20   0  872m 3740 2284 S  0.0  0.2   0:00.02 winbindd
 5955 root      20   0  872m 3804 2348 S  0.0  0.2   0:00.04 winbindd
 6025 root      20   0  873m 1544    4 S  0.0  0.1   0:00.00 winbindd
 6026 root      20   0  873m 1548    4 S  0.0  0.1   0:00.00 winbindd
 6518 root      20   0  873m 5048 3476 S  0.0  0.3   0:00.00 winbindd
 6576 root      20   0  873m 6228 4232 S  0.0  0.4   0:00.00 winbindd
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
  529 root      16  -4 21076  632    0 S  0.0  0.0   0:00.16 udevd
 6574 root      20   0 18824 1264  940 R  0.0  0.1   0:00.10 top
 1761 root      20   0  5904  320  184 S  0.0  0.0   0:00.06 syslogd
 1805 root      20   0 48868  720  216 S  0.0  0.0   0:00.00 sshd
 5768 root      20   0 78572  916  200 S  0.0  0.1   0:00.14 sshd


Robert LeBlanc
Life Sciences & Undergraduate Education Computer Support
Brigham Young University


On Fri, Oct 23, 2009 at 1:17 PM, Robert LeBlanc <robert at leblancnet.us>wrote:

> I also see this in the syslog sometimes:
>
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.132286] rsync invoked oom-killer:
> gfp_mask=0x201d2, order=0, oomkilladj=0
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.132649] Pid: 6516, comm: rsync
> Not tainted 2.6.26-2-amd64 #1
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.132916]
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.132917] Call Trace:
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.133470]  [<ffffffff802738c0>]
> oom_kill_process+0x57/0x1dc
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.133746]  [<ffffffff8023b551>]
> __capable+0x9/0x1c
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.133993]  [<ffffffff80273beb>]
> badness+0x188/0x1c7
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.134245]  [<ffffffff80273e1f>]
> out_of_memory+0x1f5/0x28e
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.140836]  [<ffffffff80276b70>]
> __alloc_pages_internal+0x31d/0x3bf
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.141048]  [<ffffffff80272d1c>]
> generic_file_aio_read+0x3b7/0x4ae
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.141279]  [<ffffffff8029ae47>]
> do_sync_read+0xc9/0x10c
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.141472]  [<ffffffff80246221>]
> autoremove_wake_function+0x0/0x2e
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.141682]  [<ffffffff8029b638>]
> vfs_read+0xaa/0x152
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.141864]  [<ffffffff8029ba19>]
> sys_read+0x45/0x6e
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142046]  [<ffffffff8020beca>]
> system_call_after_swapgs+0x8a/0x8f
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142254]
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142376] Mem-info:
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142511] Node 0 DMA per-cpu:
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142662] CPU    0: hi:    0,
> btch:   1 usd:   0
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142844] Node 0 DMA32 per-cpu:
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.142998] CPU    0: hi:  186,
> btch:  31 usd: 173
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.143183] Active:189862
> inactive:179626 dirty:0 writeback:0 unstable:0
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.143184]  free:3011 slab:7697
> mapped:76 pagetables:1122 bounce:0
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.143592] Node 0 DMA free:6020kB
> min:32kB low:40kB high:48kB active:3012kB inactive:2676kB present:10724kB
> pages_scanned:9007 all_unreclaimable? yes
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.144711] lowmem_reserve[]: 0 1499
> 1499 1499
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.144894] Node 0 DMA32 free:6024kB
> min:4936kB low:6168kB high:7404kB active:756436kB inactive:715828kB
> present:1535136kB pages_scanned:626785 all_unreclaimable? no
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.145479] lowmem_reserve[]: 0 0 0 0
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.145648] Node 0 DMA: 3*4kB 1*8kB
> 1*16kB 5*32kB 3*64kB 2*128kB 3*256kB 1*512kB 0*1024kB 0*2048kB 1*4096kB =
> 6020kB
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.146045] Node 0 DMA32: 162*4kB
> 28*8kB 9*16kB 7*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB
> 1*4096kB = 6040kB
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.155603] 364394 total pagecache
> pages
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.155831] Swap cache: add 0, delete
> 0, find 0/0
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.156064] Free swap  = 0kB
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.156064] Total swap = 0kB
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.164049] 393200 pages of RAM
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.164049] 6902 reserved pages
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.164049] 2124 pages shared
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.164247] 0 pages swap cached
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.164396] Out of memory: kill
> process 5842 (winbindd) score 76798 or a child
> Oct 23 13:09:35 lsbeast-i2 kernel: [74133.164850] Killed process 5847
> (winbindd)
>
> Looks like winbind is running out of memory?
>
> Robert LeBlanc
> Life Sciences & Undergraduate Education Computer Support
> Brigham Young University
>
>
> On Fri, Oct 23, 2009 at 9:33 AM, Robert LeBlanc <robert at leblancnet.us>wrote:
>
>> Just out of curiosity, do any of you have mdns4_minimal or mdsn4 in your
>> /etc/nsswitch.conf file? I think mdns4 doesn't work too well and I usually
>> take it out, but it was alive and well on these machines. Does removing
>> those items help anyone?
>>
>> Robert LeBlanc
>> Life Sciences & Undergraduate Education Computer Support
>> Brigham Young University
>>
>>
>> On Thu, Oct 22, 2009 at 4:45 PM, Robert LeBlanc <robert at leblancnet.us>wrote:
>>
>>> I'm using 3.4.2 right now and I'm seeing a similar problem. We are using
>>> winbind to authenticate our users on our Linux cluster. The worker and
>>> interactive nodes are on a private subnet that is NATed to the local LAN.
>>> Two head nodes provide failover for the NATing. When failover is happening,
>>> winbind whacks out. The system is not unusable, but no authentication
>>> happens for about 30 minutes after the failover. I'm going to see if I can
>>> get iptables to share state between machines to help prevent this, but there
>>> needs to be a faster reconnection after domain controllers seem to be down.
>>>
>>> Robert LeBlanc
>>> Life Sciences & Undergraduate Education Computer Support
>>> Brigham Young University
>>>
>>>
>>>
>>> On Thu, Oct 22, 2009 at 1:55 AM, Clayton Hill <admin at ateamonsite.com>wrote:
>>>
>>>> Hi Jason,
>>>>
>>>> Yup you got the same problem - just going about it a sorta different way
>>>> - ouch that must really suck having winbind\ADdomain own the account you
>>>> are logged in as. bummer!
>>>> My problem is slightly less serious as I am trying to use my local
>>>> accounts (such as root) and I just use samba as a domain member to host
>>>> files with AD ACLs in the filesystem permissions... but we see the same bug.
>>>> because winbind (even caching) kills access to my local accounts.
>>>> I hope this is fixed in 3.4 (I just installed it yesterday) I haven't
>>>> had a chance to run the same test on 3.4
>>>>
>>>> possibilities:
>>>> winbind is not caching right to allow smooth operation when the DC is
>>>> offline and the system is virtually locked up
>>>> winbind doesnt know the moment it cant connect to the DC that it should
>>>> really use cache or just buzz off and die somehow
>>>> winbind may or may not connect back up to the DC immediately
>>>>
>>>> I need to play with parameters and see what the new winbind options in
>>>> 3.4 do. I have been on 3.2 until yesterday.
>>>>
>>>>
>>>> Thanks for the info on the bug report..
>>>>
>>>> Cheers,
>>>> -Clayton
>>>>
>>>> Jason Haar wrote:
>>>>
>>>>> Just a FYI, but this looks an awful lot like the bug I reported months
>>>>> ago
>>>>>
>>>>> https://bugzilla.samba.org/show_bug.cgi?id=6103
>>>>>
>>>>> Basically I'm running Fedora11 with no local accounts (beyond root) -
>>>>> relying on winbind. On occasion winbind appears to "hang" - and no
>>>>> local
>>>>> access works - including root - which shouldn't need winbind to
>>>>> succeed!
>>>>> Normally I have to reboot to fix, however if I was lucky enough for it
>>>>> to happen before my screensaver kicked in, then simply restarting
>>>>> winbind fixes the problem.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list go to the following URL and read the
>>>> instructions:  https://lists.samba.org/mailman/options/samba
>>>>
>>>
>>>
>>
>


More information about the samba mailing list