[PATCH] Don't set child->requests to NULL in parent process after fork

Sun Jan 11 03:56:23 GMT 2009

Jeremy Allison wrote:
> On Sat, Jan 10, 2009 at 02:02:25PM +0800, boyang wrote:
>   
>> Hi, everyone:
>>      I think it is a logic error to set child->requests to NULL in
>> parent process. Because requests are added to requests list before
>> fork() and schedule_async_request(). If it is set to NULL, only the
>> first request gets scheduled, others are lost. Why does the parent shoot
>> itself like this? :-)
>>      Please review the patch, Thanks!
>>     
>
> Ok, whilst shopping @ Costco I've been doing a lot of
> thinking about why we haven't been bitten harder by
> this bug over the time we've had it :-). It's a pretty
> bad bug.
>
> I think it's because this would only happen after
> we had a long-running winbind child, with requests
> queued for it, which then died. The first request
> would cause the winbindd child to be created, and
> then get NULL'ed out, but this wouldn't cause a
> crash as DLIST_REMOVE doesn't check that the value
> being removed is actually on the list, so all that
> would happen is the request queue would be one shorter
> than it should have been, and the first request would
> have been handled by the child and then disappear
> (not noticing it had already been NULL'ed out from
> the request queue). All subsequent requests would
> have been handled correctly until the winbindd
> child died, and was re-created.
>
> The bug would have bitten us when there was a
> slow winbindd child (so lots of requests queued
> in the parent) and the winbindd child died, and
> then the queued requests other than the one in
> process would get dropped. This would explain
> client processes not getting a response. But
> the question would be why did the winbindd child
> die and what state is the connection to the
> domain in at that point ? So it comes down to
> the robustness of the recovery logic, which
> depends on the reason the child died.
>   
Ha, This is a hard problem... I think too many reasons can cause child
dying. For a slow child, I can image one reason for it. Assume request
arrived for the child every second, and child needs 2 seconds to process
one request. The the request arrives in order have to wait 2*n + 1
seconds. If 2*n + 1 > 300(the default time out value), child get killed
because the parent times out the request and send SIGTERM to child.

Other reason might be panic in child itself.

Anyhow, it might not be easy as I thought. We might have to spend some 
more time to analyze this. :-)
> My guess is we didn't get bitten by this
> often, but when we did it bit us *hard*.
>
> Anyway boyang, thanks a *LOT* for finding this
> one - please keep looking over the winbindd
> logic and challenging the assumptions there,
> you're getting really good at this :-) :-).
>
> Cheers,
>
> 	Jeremy.
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: boyang.vcf
Type: text/x-vcard
Size: 187 bytes
Desc: not available
Url : http://lists.samba.org/archive/samba-technical/attachments/20090111/9290a871/boyang.vcf