[Samba] 3.0.4: smbd's + nscd's = 100% CPU; load > 4

Wed Jul 7 07:59:39 GMT 2004

>>>>>> the new 3.0.4 Samba installation seems to work fine
>>>>>> except that from time to time but at least a couple
>>>>>> of times a day one or more smbd processes start
>>>>>> running at 20%-40% CPU each and 6 nscd processes
>>>>>> then share the remaining CPU power. System 70%-80%
>>>>>> users the rest 20%-30%. Load rises fast to over 4.
>>>>>>
>>>>>> I'm sure that each such process is just idling,
>>>>>> but why does it engage so much nscd processing?
>>>>>>
>>>>>> As soon as I kill the excessive smbd process(es)
>>>>>> the situation drops to normal, i.e. load < 0,1
>>>>>> no perceptible CPU%.
>>>>>>
>>>>>> Does anyone know what's happening?
>>>>>>
>>>>> What does strace say ? Can you attach with gdb to
>>>>> a CPU bound process and give a backtrace ?
>>>>>
>>>> Ah backtrace, to see the steps to the black hole?
>>>> OK. Will do, sooen as I get back to office (tomorrow).
>>>>
>>> A pot watched never boils.
>>> But as soon as it happens again, I'll consult
>>> strace and gdb to see how and why it happens.
>>>
>>
>> It does, when you stop watching.
>> I was away yesterday and what do I see
>> this morning:
>>
>>    top - 09:55:35 up 6 days, 17:28,  6 users,  load average: 3.59, 3.82,
3.15
>>    Tasks: 182 total,   3 running, 179 sleeping,   0 stopped,   0 zombie
>>    Cpu(s):  21.6% user,  78.4% system,   0.0% nice,   0.0% idle
>>    Mem:   2060704k total,  2004324k used,    56380k free,   185272k
buffers
>>    Swap:  2402296k total,     5236k used,  2397060k free,  1068752k
cached
>>
>>      PID USER  PR NI  VIRT  RES  SHR S %CPU %MEM     TIME+   Command
>>    12808 robf  25  0  2996 2596 2244 R 23.6  0.1   4:46.98   smbd
>>    12741 robf  22  0  3028 2628 2280 R 20.6  0.1   5:44.55   smbd
>>     2354 root  15  0   724  716  536 S 15.3  0.0 150:31.23   nscd
>>     2356 root  15  0   724  716  536 S 14.9  0.0 150:35.87   nscd
>>     2352 root  15  0   724  716  536 S  9.0  0.0 151:14.08   nscd
>>     2353 root  15  0   724  716  536 S  6.3  0.0 150:39.89   nscd
>>     2355 root  15  0   724  716  536 S  6.3  0.0 150:31.82   nscd
>>     2350 root  15  0   724  716  536 S  3.7  0.0 150:49.78   nscd
>>
>> I attached both of the smbd processes to gdb and
>> backtrace was always:
>>
>>    #0  0x402e5328 in read () from /lib/libc.so.6
>>    #1  0x40343b90 in __DTOR_END__ () from /lib/libc.so.6
>>    #2  0x4031d58b in __nscd_getpwnam_r () from /lib/libc.so.6
>>    #3  0x402c130d in getpwnam_r@@GLIBC_2.1.2 () from /lib/libc.so.6
>>    #4  0x402c0e6f in getpwnam () from /lib/libc.so.6
>>    #5  0x081298fd in get_memberuids ()
>>    #6  0x08129b09 in _samr_query_groupmem ()
>>    #7  0x081215d1 in api_samr_query_groupmem ()
>>    #8  0x081358b2 in api_rpcTNP ()
>>    #9  0x08135632 in api_pipe_request ()
>>    #10 0x0812fc10 in process_request_pdu ()
>>    #11 0x0812fdec in process_complete_pdu ()
>>    #12 0x08130069 in process_incoming_data ()
>>    #13 0x08130220 in write_to_internal_pipe ()
>>    #14 0x081301a4 in write_to_pipe ()
>>    #15 0x080883e0 in api_fd_reply ()
>>    #16 0x080885b9 in named_pipe ()
>>    #17 0x080891bc in reply_trans ()
>>    #18 0x080c80e0 in switch_message ()
>>    #19 0x080c8172 in construct_reply ()
>>    #20 0x080c8491 in process_smb ()
>>    #21 0x080c9004 in smbd_process ()
>>    #22 0x081f812e in main ()
>>    #23 0x402268ae in __libc_start_main () from /lib/libc.so.6
>>
>> After killing both smbd processes with -9 the top soon
>> stabilizes at:
>>
>>    top - 10:12:01 up 6 days, 17:45,  6 users,  load average: 0.03, 0.19,
1.17
>>    Tasks: 175 total,   2 running, 173 sleeping,   0 stopped,   0 zombie
>>    Cpu(s):   0.0% user,   0.7% system,   0.0% nice,  99.3% idle
>>    Mem:   2060704k total,  2034272k used,    26432k free,   185272k
buffers
>>    Swap:  2402296k total,     5236k used,  2397060k free,  1100020k
cached
>>
>>      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
Command
>>    15182 root      15   0   956  956  700 R  0.3  0.0   0:03.67   top
>>
>> Unfortunately I didn't trace the nscd processes.
>> What a shame! I'll do it next time.
>>
>> Nobody complained yet about reduced performance.
>>
>> It's hard to tell when this behaviour started.
>> The upper bound seems to be 9 hours,
>> the combined run times of the nscd processe,
>> some time during the night when the computers
>> were totally quiet. The lower bound based on
>> the run times of the smbd processes is more
>> like half an hour ago.
>>
>> This is the fourth out of 5 times that the same user,
>> "robf", is involved as the effective UID of the smbd process.
>> The other one time was root's own smbd.
>>
>> Jeremy, can I provide more information?
>
> I had a similiar Problem , and a loglevel of 4 shows ,
> that samba was trying to look up a user nobody and a
> user Administrator, all the time.
> If I killed nscd the load of the ldap server becomes high...
>
> I added these user to my ldap backend, and the problem disappears.

It's a valuable pointer but I'm not sure it really
applies here, Hans. For one, your problem seems to
have been persistent up until you added those users
to your ldap backend. In my case, it happens very
intermittently. Besides, my passdb backend is the
default "smbpasswd".

I'll give it a more thorough check next time it
happens, including user name lookups.