PATCH/RFC: Another stab at the Cygwin hang problem

Sun Jul 13 01:25:07 EST 2003

On Sat, Jul 12, 2003 at 11:42:52PM +0900, Anthony Heading wrote:
> On Sat, Jul 12, 2003 at 03:52:59AM -0700, jw schultz wrote:
> > There is a huge window between the write() and the return of
> > waitpid() that depending on scheduling and signal delivery
> > allows the child pid to be reaped by SIGCHILD handler.  That
> > results in this waitpid() returning -1 with errno of ECHILD.
> > EINTER would also be possible.  The timing dependencies
> > account for intermittency of the error.
> 
> Hi JW - 
> 
> Afraid I've not really been following the rsync mailing list,
> and it seems you've been addressing your comments about
> my patch to James Tillman?

Not in the least.  I've addressed them to the list.

> As I said originally, it was illustrative patch - I didn't
> flesh out the error handling since that made the concept
> more difficult to follow.
> 
> Catching up now, I think your observation here is right.
> In fact I'd made a similar change already myself locally.
> 
> Only one difference - I was conciously avoiding calling
> wait_process(), since that function calls msleep() - which
> was implicated in the original hanging problem!  Since
> there is no signal being sent any more, hopefully it's not
> a problem (except for the SIGUSR2 cases?) - however I
> was wanting to ensure that the hangs were _completely_
> eliminated, and thus didn't want to take any chances.
> 
> So my own patch here is checking the errno and gives
> the OK for ECHILD.  I would worry that the whole
> msleep NOHANG io_flush stuff is a very complex loop
> to run simply to collect an exit status, particularly
> when we believe that the root of the hang lies with
> the underlying Cygwin OS.

I don't recall msleep being a hang problem.  I don't see how
it could be.  Myself i wonder why the WNOHANG and msleep
loop instead of a normal waitpid.  I initially had waitpid
with checking of the pid_stat_table if ECHILD but disliked
having the duplicate code.  Besides, if wait_process has a
hang problem lets fix that instead of orphaning it.

> But I think as long as the hangs don't reappear, your
> updated patch is obviously more concise.  Otherwise, I'll be
> further tempted to take the axe to the SIGCHLD handling,
> which looks somewhat jammed with voodoo cruft.

Layer on layer.  I don't care for it myself but changes in
this tend to cause problems on less popular platforms.

> Anyhow, just to let you know.  If you're happy tidying
> up and refining the patch yourself, please go ahead. If
> you want to me to do anything, or have any comments on
> what I've done, I'd appreciate an email.  However I
> will try to follow the rsync list for the next few
> weeks at least.

As i said earlier, i intuit you are on to something with
this patch.  If you care to clean it up that would be good.
I would rather someone experiencing the hangs do the fix.
That tends to reduce the cycle times.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt