recursion error in nmbd

Andrew Tridgell tridge at samba.org
Thu Dec 17 06:16:09 GMT 1998


Jeremy,

John reported a problem of nmbd going crazy, chewing lots of memory
until it crashed the system. This was 2.0betaX on Intel Linux. The
logs show this:

[1998/12/17 14:49:37, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(427)
become_local_master_fail2: failed to register name XXXXXX<1d> on subnet XXX.XX.0.2. Failed to become a local master browser.
[1998/12/17 14:49:37, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(427)
become_local_master_fail2: failed to register name XXXXXX<1d> on subnet XXX.XX.0.2. Failed to become a local master browser.
[1998/12/17 14:49:37, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(427)
become_local_master_fail2: failed to register name XXXXXX<1d> on subnet XXX.XX.0.2. Failed to become a local master browser.
[1998/12/17 14:49:37, 0] nmbd/nmbd_become_lmb.c:become_local_master_fail2(427)
become_local_master_fail2: failed to register name XXXXXX<1d> on subnet XXX.XX.0.2. Failed to become a local master browser.

this continues for quite some time, at about 50 messages per
second. The machines then dies after nmbd chews up all available
memory. 

My best guess as to the cause is the recursive
retransmit_or_expire_response_records() call in nmbd_become_lmb.c. I
think the problem is that if the registration failure was caused by a
timeout (the WINS server going down at the wrong moment for example)
then the initial call to retransmit_or_expire_response_records() will
call the registration timeout function without first removing the
response entry from the queue. So when we get to
unbecome_local_master_browser() as part of the
become_local_master_fail2() call we end up recursively calling
retransmit_or_expire_response_records() with the current response
record still on the queue! This means we chew stack until the machine
dies.

The obvious solutions are:

1) always remove the reposnse entry from the queue in
   retransmit_or_expire_response_records() before calling the timeout
   function

or 

2) mark response records with a "being dealt with" flag and don't deal
with them if the flag is set. 

or

3) remove the call to
   retransmit_or_expire_response_records() in
   unbecome_local_master_browser(). I'm a little dubious about this
   call being there in the first place, as recursion doesn't sit well
   with our current nmbd design (as this problem demonstrates).

I'm sure there are other solutions. I'll let you choose which one to
use as you are most familiar with this code.

I think a fix better go into 2.0. Right now nmbd can bring the
machine down :)

Cheers, Tridge


More information about the samba-technical mailing list