[Bug 5478] rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32)

Wed Sep 28 14:40:16 MDT 2011

https://bugzilla.samba.org/show_bug.cgi?id=5478

--- Comment #16 from Eric Shubert <ejs at shubes.net> 2011-09-28 20:40:15 UTC ---
I've done some further testing and would like to report my findings.

I tried the rsync to a different target host on a different IP address via a
different ISP, and lo and behold, it worked. I conclude from this that the
problem is outside of the source host, either in the network transport
somewhere or in the the target host.

I have a quickbooks backup (.qbb) file that fails pretty regularly, so I took a
closer look at the strace on the target host for this file.
On one failure, I get:
3060  19:43:19.737510 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60,
0})
3060  19:43:19.737768 read(0,
"\252\220\346\2\f\216\t3\\\256(\t\202\3\222\211\16\0246\310H\1\342\202\203\2\374\346\0273l\317"...,
4092) = 4092
3060  19:43:19.738449 select(1, [0], [], NULL, {60, 0} <unfinished ...>
3059  19:43:33.847680 <... select resumed> ) = 0 (Timeout)
3059  19:43:33.847918 select(4, [3], [], NULL, {60, 0} <unfinished ...>
3060  19:44:19.730067 <... select resumed> ) = 0 (Timeout)

On a subsequent run a few minutes later, I also get a failure, but
approximately 3 seconds later. Curiously, I see this in the 2nd run at the
point of failure in the first run (notice the identical data):
3170  20:08:03.378981 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60,
0})
3170  20:08:03.379237 read(0,
"\252\220\346\2\f\216\t3\\\256(\t\202\3\222\211\16\0246\310H\1\342\202\203\2\374\346\0273l\317"...,
4092) = 4092
3170  20:08:03.379744 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {59,
640000})
3170  20:08:03.728257 read(0, "\374\17\0\7", 4) = 4

The 2nd run continues for 3 seconds, then times out like the previous run:
3170  20:08:07.624052 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {60,
0})
3170  20:08:07.624424 read(0,
"\3,\234\37J\341\374\240\v\347\227W\30\277\220\342\377\222\v\347\27W\30\277\214\342\377\202\v\347\247\366"...,
4092) = 4092
3170  20:08:07.625000 select(1, [0], [], NULL, {60, 0} <unfinished ...>
3169  20:08:48.205338 <... select resumed> ) = 0 (Timeout)
3169  20:08:48.205547 select(4, [3], [], NULL, {60, 0} <unfinished ...>
3170  20:09:07.627769 <... select resumed> ) = 0 (Timeout)

I thought perhaps the "left {59, 640000}" was a clue given that it happened at
the exact same place as a previous failure, but I see the same sort of thing
after every 8 reads, so I don't expect it's indicative of a problem.

So now I'm beginning to think that the problem is (indeed) outside of the rsync
code. I've examined the 2 target hosts (one that worked and one that doesn't),
and tried to determine "what's different?". These hosts are all VMs that are
running under VMware Server2. The host that worked is running CentOS 5.6, while
the one that's failing is running CentOS 5.4 (450 days uptime). I'm going to
upgrade the failing host OS to CentOS 5.7 and see if that changes anything. I
know that some versions of glibc (> 2.5-34, < 2.5.58) had mysterious problems
with VMware Server 2 (guests terminating for no apparent reason). I'm wondering
if there's something else in glibc that's perhaps having a problem with VMware
Server 2. (I realize that Server 2 is essentially dead and will be migrating to
KVM in the near future, although Server2 has been quite stable for me.)

I'll report my findings when the upgrade is complete.
Any pointers toward how to troubleshoot this thing would be more than welcome.
TIA.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.