[Bug 5478] rsync: writefd_unbuffered failed to write 4092 bytes [sender]: Broken pipe (32)

Fri Oct 7 15:57:55 MDT 2011

https://bugzilla.samba.org/show_bug.cgi?id=5478

--- Comment #20 from Tim Taiwanese Liim <tim.liim at alcatel-lucent.com> 2011-10-07 21:57:54 UTC ---
I agree with Wayne and Eric that Eric's issue is outside of rsync,
somewhere in the transport.

Eric,
Have you tried to check the TCP buffers of the ssh process on both
ends?  For example,
    p=192.168.51.98:22
    while true; do date; netstat -tn | grep $p; sleep 1; done
        Proto Recv-Q Send-Q Local Address    Foreign Address     State
        tcp        0  19328 192.168.51.98:22 192.168.51.51:53010 ESTABLISHED
In this example, the sender has 19328 bytes in its TCP sending buffer.

You can also use tcpdump and wireshark to graph how well the tcp pipe
goes:
    # catch 100 bytes of each packet on eth0, write to t.pcap.  We
    # need only first 100 bytes because we care only about the TCP
    # sequence number, but not the actual file content.
    # need root access to sniff packets.
    tcpdump -i eth0 -s 100 -w t.pcap
then feed the trace to wireshark:
    wireshark t.pcap
    # then select menu "Statistics" --> TCP Stream Graph --> 
    # Time-Sequence Graph (Stevens)
With this you can visualize how the TCP flow goes (smooth or stalled
or fluctuates or excessive retries).  With tcpdump from both ends, you
can also check for lost packets.  (A few years ago I worked on a case
of stalled ssh; turned out the NIC firmware was defective, causing
excessive packet loss) Comparison between the two targets (good and
bad one) may show the difference.  Do your two targets run on the same
host machine?  Or host machines of the same configuration (same NIC
etc)?  Could one has bad NIC (eg. working but excessive packet loss in
bursts)?

As Wayne pointed out, there is pipe (or unix domain socket) between
rsync and ssh as well.  I don't know how to track the queue size in
pipe yet, so let's track the known ones (TCP) first.  

> Notice, the last successful read on the target side happened .06
> seconds before the last write on the source side, which is pretty
> much at the same time.
This is an important clue, although I don't know what to make out of
it yet; maybe a few lost tcp acks in a row?

How did your "bwlimit=32" test go?

BTW, I am not rsync developer; I do use rsync a lot and tried to help
when possible, so please do not take my response as official.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.