[Bug 5478] rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32)

Thu Sep 29 14:39:55 MDT 2011

https://bugzilla.samba.org/show_bug.cgi?id=5478

--- Comment #18 from Eric Shubert <ejs at shubes.net> 2011-09-29 20:39:54 UTC ---
(In reply to comment #17)
> 
> I'm presently thinking that the problem is not with rsync itself, but with some
> aspect of the network connection. I'm seeing no corresponding ethernet errors
> on either end though, and the network seems to be fine otherwise.
> 
> I'd have to say at this point that the failures appear to be happening only
> with large files.
> 

I've (re)read through all 5 bug reports that Tim identified. A general
consensus is that this is happening with relatively large files, and relatively
slow writes (ie loaded targets).

Upon closer examination of a pair of straces, I see this on the target side:
6814  11:34:25.018774 read(0,
"\3037sE\303{\264\16\377\333\357\375\7\274}1c\21\306\377\215\256\202\7\376\245\26\24\302{\16c"...,
4092) = 4092
6814  11:34:25.020201 select(1, [0], [], NULL, {60, 0} <unfinished ...>
6814  11:35:25.009953 <... select resumed> ) = 0 (Timeout)

I found the corresponding block of data in the strace on the source side:
940   11:33:57.494543 write(4,
"\3037sE\303{\264\16\377\333\357\375\7\274}1c\21\306\377\215\256\202\7\376\245\26\24\302{\16c"...,
4092 <unfinished ...>
940   11:33:57.494681 <... write resumed> ) = 4092

I had previously set both hosts to use ntp with the same time server, so their
times should be very close.

In the source side strace, I counted ~510 (successful) writes of data between
the one listed above, and the point at which the source host timed out:
940   11:34:25.073394 select(5, NULL, [4], [4], {60, 0}) = 1 (out [4], left
{60, 0})
940   11:34:25.073514 write(4, "\374\17\0\7", 4) = 4
940   11:34:25.073620 select(5, NULL, [4], [4], {60, 0}) = 0 (Timeout)

Notice, the last successful read on the target side happened .06 seconds before
the last write on the source side, which is pretty much at the same time. At
this point, there were ~510 x 4K blocks of data (or ~2MB) "in the pipe", that
had been written, but not read.

So I'm wondering, is it possible that "the pipe" (buffers) get filled up, and
when they do, that the select function is unable to handle it? I don't (yet)
know how these things work, so I could be totally off base. At the same time,
this would seem to explain things. 

I'll continue to poke at this to see what I can come up with. In the meantime,
any feedback from Wayne and Tim will be appreciated.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.