Quick analysis of 'Broken pipe' error at io.c(463)

Thu Jul 11 18:37:05 EST 2002

	This one has been biting me recently, and it seems to be completely 
deterministic: it utterly fails on most directories, but when it fails 
on one, it does so reproducibly.  Also, I'm doing a simple copy from one 
directory on my machine to another - there is no remote server or 
network to fail.  The drive is only 24% full.

	Here's my take on it:

1) The error is printed as occurring in io.c: writefd_unbuffered, when a 
SIGPIPE is being thrown by the system from (I'm guessing) a lack of pipe 
readers.  main.c (907) sets this up to be ignored, claiming "we'll see 
the EPIPE"... but EPIPE is never explicitly checked for in io.c, just 
assumed as the last case.  Perhaps this should be handled more cleanly?  
There may be a timing issue here that could be handled with a sleep().

2) The core file being thrown by my setup is for the parent rsync only - 
during flist.c: send_file_name, which would be for the writer task... 
but it's the reader task forked off that's apparently going into lala 
land and tanking the enterprise by triggering a SIGPIPE/EPIPE on write.

	So, advice to all seeing this error: have gdb handy to bind to the 
*second* rsync process, as it is the reader process that seems to be 
failing, and throwing a monkey in the works.

	Someone correct me if I'm wrong, but this seems like the best 
approach.  Also, if anyone has a good slick way of attaching gdb to the 
second rsync process, I'd love to hear it.  Right now it's a mad 
scramble with ps -aux | grep rsync and gdb.

	A few details:

	rsync --archive --update --relative --exclude-from=exlist is the 
option string.  It fails on any set of options, however, including none, 
and -vv.

	-vvv causes it to *hang*, not die.  Waiting for information from 
the hung reader process?

	I set up a series of directories to be synced, including the 'bad' 
one.  The point at which is choked differed depending on where in the 
list the 'bad' directory was placed.  The 'bad' one isn't extremely 
huge, certainly smaller than many that are successfully syncing.  This 
may point to a memory allocation problem?

	Hope this helps.