rsync 3.0.7 network errors on MS-Windows

Wed Jun 2 06:55:24 MDT 2010

andrew.marlow at uk.bnpparibas.com wrote:
> 
>    I am experiencing intermittent network failures on rsync 3.0.7 built
>    using cygwin for Windows-XP (SP2). I am using GCC v4.4.2 and the
>    latext version of cygwin.
>    The rsync error long indicates things like:
>    rsync: writefd_unbuffered failed to write 4092 bytes to socket
>    [generator]:
>    Connection reset by peer (104)rsync: read error: Connection reset by
>    peer (104)
>    rsync error: error in rsync protocol data stream (code 12) at
>    io.c(1530) [generator=3.0.7]
>    rsync error: error in rsync protocol data stream (code 12) at
>    io.c(760) [receiver=3.0.7]
>    Googling I see that these problems were put down to the way socket are
>    cleaned up in Windows and a fix put in place in cleanup.c, in
>    close_all(). But the fix is surrounded by conditional compilation:-
>    #ifdef SHUTDOWN_ALL_SOCKETS
>       :
>       :
>    #endif
>    Can someone please explain why that is? Shouldn't the fix just be
>    there always, and regardless of which operating system?

It's not needed on most operating systems - as the comment there implies.

According to the notes copied below, SO_LINGER is off by default on
unix sockets, and this means close() will gracefully send the
remaining data in the background, rather than TCP RST.  You can assume
that program exit has the same effect as close().

If SO_LINGER is turned on with a zero timeout, the notes below say TCP
RST is sent on close, which is much like what the comment for
SHUTDOWN_ALL_SOCKETS says is happening on Windows without SO_LINGER.

Presumably Windows sockets - or at least some version of them (there
are several versions of Winsock) - behaves differently from unix
sockets in this area.  It wouldn't be surprising, as historically
Winsock ran inside the process not the kernel, so an exiting process
couldn't implement the unix graceful close behaviour, and maybe they
kept that behaviour the same in later versions.

That said, I still don't see why SHUTDOWN_ALL_SOCKETS would fix it.
Calling shutdown(fd,2) closes it in both directions, and at least with
usual unix sockets, that would trigger TCP RST anyway if the other end
sends any data after the shutdown.

Which it seems to be doing: "writefd_unbuffered failed to write 4092
bytes to socket" implies the other end has closed or shutdown(fd,1) or
shutdown(fd,2), and then data is sent to it which can't be accepted so
the other end sent back TCP RST anyway.

If rsync is doing that in normal operation, that ought to be a problem
on unix just as much as Windows - and SHUTDOWN_ALL_SOCKETS ought to be
insufficient to prevent the reset.

Which suggests to me that "writefd_unbuffered failed to write 4092
bytes to socket" is a symptom of a different problem.

Here are the notes I referred to above.
These are the notes which explain SO_LINGER's behaviour:

   Unix Socket FAQ
   http://www.developerweb.net/forum/archive/index.php/t-2982.html

   4.6 - What exactly does SO_LINGER do?

   Contributed by Cyrus Patel

   SO_LINGER affects the behaviour of the close() operation as described
   below. No socket operation other than close() is affected by SO_LINGER.

   The following description of the effect of SO_LINGER has been culled
   from the setsockopt() and close() man pages for several systems, but may
   still not be applicable to your system. The range of differences in
   implementation ranges from not supporting SO_LINGER at all; or only
   supporting it partially; or having to deal with the "peculiarities" in a
   particular implementation. (see portability notes at end).

   Moreover, the purpose of SO_LINGER is very, very specific and only a
   tiny minority of socket applications actually need it. Unless you are
   extremely familiar with the intricacies of TCP and the BSD socket API,
   you could very easily end up using SO_LINGER in a way for which it was
   not designed.

   The effect of an setsockopt(..., SO_LINGER,...) depends on what the
   values in the linger structure (the third parameter passed to
   setsockopt()) are:

   Case 1: linger->l_onoff is zero (linger->l_linger has no meaning):
   This is the default.

   On close(), the underlying stack attempts to gracefully shutdown the
   connection after ensuring all unsent data is sent. In the case of
   connection-oriented protocols such as TCP, the stack also ensures that
   sent data is acknowledged by the peer. The stack will perform the
   above-mentioned graceful shutdown in the background (after the call to
   close() returns), regardless of whether the socket is blocking or
   non-blocking.

   Case 2: linger->l_onoff is non-zero and linger->l_linger is zero:

   A close() returns immediately. The underlying stack discards any unsent
   data, and, in the case of connection-oriented protocols such as TCP,
   sends a RST (reset) to the peer (this is termed a hard or abortive
   close). All subsequent attempts by the peer's application to
   read()/recv() data will result in an ECONNRESET.

   Case 3: linger->l_onoff is non-zero and linger->l_linger is non-zero:

   A close() will either block (if a blocking socket) or fail with
   EWOULDBLOCK (if non-blocking) until a graceful shutdown completes or the
   time specified in linger->l_linger elapses (time-out). Upon time-out the
   stack behaves as in case 2 above.

   ---------------------------------------------------------------

   Portability note 1: Some implementations of the BSD socket API do not
   implement SO_LINGER at all. On such systems, applying SO_LINGER either
   fails with EINVAL or is (silently) ignored. Having SO_LINGER defined in
   the headers is no guarantee that SO_LINGER is actually implemented.

   Portability note 2: Since the BSD documentation on SO_LINGER is sparse
   and inadequate, it is not surprising to find the various implementations
   interpreting the effect of SO_LINGER differently. For instance, the
   effect of SO_LINGER on non-blocking sockets is not mentioned at all in
   BSD documentation, and is consequently treated differently on different
   platforms. Taking case 3 for example: Some implementations behave as
   described above. With others, a non-blocking socket close() succeed
   immediately leaving the rest to a background process. Others ignore
   non-blocking'ness and behave as if the socket were blocking. Yet others
   behave as if SO_LINGER wasn't in effect [as if the case 1, the default,
   was in effect], or ignore linger->l_linger [case 3 is treated as case
   2]. Given the lack of adequate documentation, such differences are not
   (by themselves) indicative of an "incomplete" or "broken"
   implementation. They are simply different, not incorrect.

   Portability note 3: Some implementations of the BSD socket API do not
   implement SO_LINGER completely. On such systems, the value of
   linger->l_linger is ignored (always treated as if it were zero).

   Technical/Developer note: SO_LINGER does (should) not affect a stack's
   implementation of TIME_WAIT. In any event, SO_LINGER is not the way to
   get around TIME_WAIT. If an application expects to open and close many
   TCP sockets in quick succession, it should be written to use only a
   fixed number and/or range of ports, and apply SO_REUSEPORT to sockets
   that use those ports.

   Related note: Many BSD sockets implementations also support a
   SO_DONTLINGER socket option. This socket option has the exact opposite
   meaning of SO_LINGER, and the two are treated (after inverting the value
   of linger->l_onoff) as equivalent. In other words, SO_LINGER with a zero
   linger->l_onoff is the same as SO_DONTLINGER with a non-zero
   linger->l_onoff, and vice versa.

-- Jamie