PATCH/RFC: Another stab at the Cygwin hang problem
jw schultz
jw at pegasys.ws
Sun Jul 13 01:25:07 EST 2003
On Sat, Jul 12, 2003 at 11:42:52PM +0900, Anthony Heading wrote:
> On Sat, Jul 12, 2003 at 03:52:59AM -0700, jw schultz wrote:
> > There is a huge window between the write() and the return of
> > waitpid() that depending on scheduling and signal delivery
> > allows the child pid to be reaped by SIGCHILD handler. That
> > results in this waitpid() returning -1 with errno of ECHILD.
> > EINTER would also be possible. The timing dependencies
> > account for intermittency of the error.
>
> Hi JW -
>
> Afraid I've not really been following the rsync mailing list,
> and it seems you've been addressing your comments about
> my patch to James Tillman?
Not in the least. I've addressed them to the list.
> As I said originally, it was illustrative patch - I didn't
> flesh out the error handling since that made the concept
> more difficult to follow.
>
> Catching up now, I think your observation here is right.
> In fact I'd made a similar change already myself locally.
>
> Only one difference - I was conciously avoiding calling
> wait_process(), since that function calls msleep() - which
> was implicated in the original hanging problem! Since
> there is no signal being sent any more, hopefully it's not
> a problem (except for the SIGUSR2 cases?) - however I
> was wanting to ensure that the hangs were _completely_
> eliminated, and thus didn't want to take any chances.
>
> So my own patch here is checking the errno and gives
> the OK for ECHILD. I would worry that the whole
> msleep NOHANG io_flush stuff is a very complex loop
> to run simply to collect an exit status, particularly
> when we believe that the root of the hang lies with
> the underlying Cygwin OS.
I don't recall msleep being a hang problem. I don't see how
it could be. Myself i wonder why the WNOHANG and msleep
loop instead of a normal waitpid. I initially had waitpid
with checking of the pid_stat_table if ECHILD but disliked
having the duplicate code. Besides, if wait_process has a
hang problem lets fix that instead of orphaning it.
> But I think as long as the hangs don't reappear, your
> updated patch is obviously more concise. Otherwise, I'll be
> further tempted to take the axe to the SIGCHLD handling,
> which looks somewhat jammed with voodoo cruft.
Layer on layer. I don't care for it myself but changes in
this tend to cause problems on less popular platforms.
> Anyhow, just to let you know. If you're happy tidying
> up and refining the patch yourself, please go ahead. If
> you want to me to do anything, or have any comments on
> what I've done, I'd appreciate an email. However I
> will try to follow the rsync list for the next few
> weeks at least.
As i said earlier, i intuit you are on to something with
this patch. If you care to clean it up that would be good.
I would rather someone experiencing the hangs do the fix.
That tends to reduce the cycle times.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list