[PATCH] Don't set child->requests to NULL in parent process after fork

Sun Jan 11 04:09:04 GMT 2009

boyang wrote:
> Jeremy Allison wrote:
>   
>> On Sat, Jan 10, 2009 at 02:02:25PM +0800, boyang wrote:
>>   
>>     
>>> Hi, everyone:
>>>      I think it is a logic error to set child->requests to NULL in
>>> parent process. Because requests are added to requests list before
>>> fork() and schedule_async_request(). If it is set to NULL, only the
>>> first request gets scheduled, others are lost. Why does the parent shoot
>>> itself like this? :-)
>>>      Please review the patch, Thanks!
>>>     
>>>       
>> Ok, whilst shopping @ Costco I've been doing a lot of
>> thinking about why we haven't been bitten harder by
>> this bug over the time we've had it :-). It's a pretty
>> bad bug.
>>
>> I think it's because this would only happen after
>> we had a long-running winbind child, with requests
>> queued for it, which then died. The first request
>> would cause the winbindd child to be created, and
>> then get NULL'ed out, but this wouldn't cause a
>> crash as DLIST_REMOVE doesn't check that the value
>> being removed is actually on the list, so all that
>> would happen is the request queue would be one shorter
>> than it should have been, and the first request would
>> have been handled by the child and then disappear
>> (not noticing it had already been NULL'ed out from
>> the request queue). All subsequent requests would
>> have been handled correctly until the winbindd
>> child died, and was re-created.
>>
>> The bug would have bitten us when there was a
>> slow winbindd child (so lots of requests queued
>> in the parent) and the winbindd child died, and
>> then the queued requests other than the one in
>> process would get dropped. This would explain
>> client processes not getting a response. But
>> the question would be why did the winbindd child
>> die and what state is the connection to the
>> domain in at that point ? So it comes down to
>> the robustness of the recovery logic, which
>> depends on the reason the child died.
>>   
>>     
> Ha, This is a hard problem... I think too many reasons can cause child
> dying. For a slow child, I can image one reason for it. Assume request
> arrived for the child every second, and child needs 2 seconds to process
> one request. The the request arrives in order have to wait 2*n + 1
> seconds. If 2*n + 1 > 300(the default time out value), child get killed
> because the parent times out the request and send SIGTERM to child.
>   
Hmmm, It doesn't work this way. Timer is set when request is sent out,
it is not set when request is queued... The above words is not a case of
why child dies. :-(
> Other reason might be panic in child itself.
>
> Anyhow, it might not be easy as I thought. We might have to spend some 
> more time to analyze this. :-)
>   
>> My guess is we didn't get bitten by this
>> often, but when we did it bit us *hard*.
>>
>> Anyway boyang, thanks a *LOT* for finding this
>> one - please keep looking over the winbindd
>> logic and challenging the assumptions there,
>> you're getting really good at this :-) :-).
>>
>> Cheers,
>>
>> 	Jeremy.
>>
>>   
>>     
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: boyang.vcf
Type: text/x-vcard
Size: 187 bytes
Desc: not available
Url : http://lists.samba.org/archive/samba-technical/attachments/20090111/f826bbe1/boyang.vcf