discussing a "reverse rsync algorithm"

Fri Apr 28 18:23:34 GMT 2006

On Thu, Apr 27, 2006 at 05:11:41PM -0400, Matt McCutchen wrote:
> I believe the sender should compute and send the block hashes.

This has come up before, and it may well be something to make available
as an option, but I don't think it should be the default for a number
of reasons:

(1) This increases the amount of data flowing in the transfer direction,
removing the largest part of the data currently flowing in the opposite
direction.  This is a bad thing if the network is the bottleneck for the
transfer.  The current algorithm helps to balance out the network use by
sending the checksum data in the opposite direction of the transfer.

(2) This increases the disk I/O on the sending side because it would
need to read the file twice (assuming that cached checksums aren't
available).  Since rsync is running in a pipelined mode, the sender
will be iterating over several future files by the time it gets back a
request for data chucks from the receiving side.  Hopefully the data
will still be in the disk cache, but if it is not, the transfer would
bog down to a significant degree.

(3) This increases the time during which a change on the sending side
results in the sending of a bogus file to the receiver.  The reason is
that the checksums are sent some amount of elapsed time prior to when
the requests for file hunks begin arriving.  If changes happen during
this period, the file will not be constructed correctly.  The current
algorithm has a similar problem, but is sensitive to changes occurring
on the receiving side.  For many applications, the receiving side only
changes when the files are updated by rsync (e.g. backups), so the
current algorithm has the advantage that the checksum-aging window is
on the side that is less likely to be changing.

(4) Either the number of active temporary files increases, or the
memory needed for remembering the matching blocks increases.  Either
way, it looks like the disk I/O on the receiving side also increases.
This occurs because the receiver first gets the checksum data a good
time before the sender data starts arriving for the file.  One
algorithm to handle this is to start constructing a sparse temporary
file as the basis data is read, keeping the temp file open until the
server data arrives to finish the file.  The bad thing about this
algorithm is that it makes it hard to compute the full-file checksum
without re-reading all the matched data from the temp file (it also
increases the number of active temporary files, which increases the
demands on free disk space).  The alternative is to not do any temp-
file building until the sender data starts arriving, at which point
we'd need to re-read the matched data from the basis file -- this would
require caching (in memory?) all the matching block info we computed
for all the files that are awaiting sender-side data.

One solution to some of the above is to remove the pipelining from
the transfer so that rsync is operating on one file at a time, adding
delays to the transfer as each file is requested in sequence.

The main benefit of a reverse-algorithm is that the sender does not
need as much CPU, especially if the checksum data is cached.  It seems
to me that the extra disk I/O needed on both sides might outweigh this
advantage, especially for large files that cannot remain in the disk-
cache long enough for the necessary re-reading.

..wayne..