PATCH/RFC: Another stab at the Cygwin hang problem
Tillman, James
JamesTillman at fdle.state.fl.us
Mon Jul 14 21:05:50 EST 2003
Ah, I just found the patch that jw sent (email system locked it as potential
virus). Will try to compile and test this week. My own environment uses
only SSH push.
jpt
> -----Original Message-----
> From: jw schultz [mailto:jw at pegasys.ws]
> Sent: Saturday, July 12, 2003 6:53 AM
> To: rsync at lists.samba.org
> Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem
>
>
> On Wed, Jul 09, 2003 at 06:47:35AM -0400, Tillman, James wrote:
> >
> >
> > > -----Original Message-----
> > > From: jw schultz [mailto:jw at pegasys.ws]
> > > Sent: Wednesday, July 09, 2003 5:59 AM
> > > To: rsync at lists.samba.org
> > > Subject: Re: PATCH/RFC: Another stab at the Cygwin hang problem
> > >
> > >
> > > > I can't quite place why but my instincts inform me that you
> > > > have latched onto something. Some sort of one character
> > > > buffering error in the io libraries under cygwin. Most
> > > > likely in the windos libs.
> > > >
> > > > Well, we have two reports of this fixing the rsync hang
> > > > problem when signals failed. I'd like a little more testing
> > > > before mainlining it.
> > >
> > > Nope! This is a no-go. It intermittantly produces
> > >
> > > error (10) -- error in socket IO
> > >
> > > on both network and local transfers.
> > >
> >
> > I guess I'd better double check my processes to make sure
> that I'm getting a
> > satisfactory success rate on my own servers. If I see any
> clues, I'll
> > report them here. Any hope for a fix, or does this look
> like an inherent
> > problem in the method being used?
>
> It looks like the method is fairly sound. The problem seems
> to primarily be in dealing with the child termination.
>
> io_set_error_fd(-1);
> - kill(pid, SIGUSR2);
> - wait_process(pid, &status);
> + write(cleanup_pipe[1], ".", 1);
> + if (waitpid(pid, &status, 0) != pid) {
> + rprintf(FERROR,"cleanup in do_recv failed\n");
> + exit_cleanup(RERR_SOCKETIO);
> + }
> return status;
>
> There is a huge window between the write() and the return of
> waitpid() that depending on scheduling and signal delivery
> allows the child pid to be reaped by SIGCHILD handler. That
> results in this waitpid() returning -1 with errno of ECHILD.
> EINTER would also be possible. The timing dependencies
> account for intermittency of the error.
>
> I've attached an altered patch. I've only dealt with this
> one location which produced errors doing a ssh pull. I
> haven't addressed the local transfer errors but i suspect
> that derived from this waitpid error. Further testing will
> still be needed to ensure that ssh push and rsyncd usage are
> unbroken. This really needs testing in cygwin which i don't
> have. If it takes care of the the cygwin hang then we can
> polish it. There remains the issue of an error status when
> when the only failure is termination.
>
> --
> ________________________________________________________________
> J.W. Schultz Pegasystems Technologies
> email address: jw at pegasys.ws
>
> Remember Cernan and Schmitt
>
More information about the rsync
mailing list