PATCH/RFC: Another stab at the Cygwin hang problem

Tillman, James JamesTillman at fdle.state.fl.us
Mon Jul 14 21:05:50 EST 2003


Ah, I just found the patch that jw sent (email system locked it as potential
virus).  Will try to compile and test this week.  My own environment uses
only SSH push.

jpt

> -----Original Message-----
> From: jw schultz [mailto:jw at pegasys.ws]
> Sent: Saturday, July 12, 2003 6:53 AM
> To: rsync at lists.samba.org
> Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem
> 
> 
> On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: jw schultz [mailto:jw at pegasys.ws]
> > > Sent: Wednesday, July 09, 2003 5:59 AM
> > > To: rsync at lists.samba.org
> > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem
> > > 
> > > 
> > > > I can't quite place why but my instincts inform me that you
> > > > have latched onto something.  Some sort of one character
> > > > buffering error in the io libraries under cygwin.  Most
> > > > likely in the windos libs.
> > > > 
> > > > Well, we have two reports of this fixing the rsync hang
> > > > problem when signals failed.  I'd like a little more testing
> > > > before mainlining it.
> > > 
> > > Nope!  This is a no-go.  It intermittantly produces
> > > 
> > > 	error (10) -- error in socket IO
> > > 
> > > on both network and local transfers.
> > > 
> > 
> > I guess I'd better double check my processes to make sure 
> that I'm getting a
> > satisfactory success rate on my own servers.  If I see any 
> clues, I'll
> > report them here.  Any hope for a fix, or does this look 
> like an inherent
> > problem in the method being used?
> 
> It looks like the method is fairly sound.  The problem seems
> to primarily be in dealing with the child termination.
> 
>  	io_set_error_fd(-1);
> -	kill(pid, SIGUSR2);
> -	wait_process(pid, &status);
> +	write(cleanup_pipe[1], ".", 1);
> +	if (waitpid(pid, &status, 0) != pid) {
> +		rprintf(FERROR,"cleanup in do_recv failed\n");
> +		exit_cleanup(RERR_SOCKETIO);
> +	}
>  	return status;
> 
> There is a huge window between the write() and the return of
> waitpid() that depending on scheduling and signal delivery
> allows the child pid to be reaped by SIGCHILD handler.  That
> results in this waitpid() returning -1 with errno of ECHILD.
> EINTER would also be possible.  The timing dependencies
> account for intermittency of the error.
> 
> I've attached an altered patch.  I've only dealt with this
> one location which produced errors doing a ssh pull.  I
> haven't addressed the local transfer errors but i suspect
> that derived from this waitpid error.  Further testing will
> still be needed to ensure that ssh push and rsyncd usage are
> unbroken.  This really needs testing in cygwin which i don't
> have.  If it takes care of the the cygwin hang then we can
> polish it.  There remains the issue of an error status when
> when the only failure is termination.
> 
> -- 
> ________________________________________________________________
> 	J.W. Schultz            Pegasystems Technologies
> 	email address:		jw at pegasys.ws
> 
> 		Remember Cernan and Schmitt
> 



More information about the rsync mailing list