Possible ways to increase performance of rsync

Tue Nov 8 22:04:08 GMT 2005

Rsync's method of taking advantage of information common to two machines
to speed the transfer of additional information is interesting
philosophically, and I suspect that additional performance improvements
might be possible if the rsync algorithm were exploited more fully.  For
instance, consider the following speculations about behavior that could
be added as an option to rsync:

Chris Shoemaker wrote:
> [...] where log.4 appears to be a missing file but is really just a
> renamed log.3.  And, log.3, log.2 and log.1, will probably be
> retransmitted in full (there's a problem for another day, but this is
> why I was thinking of a hashtable of all files --checksums). [...]

If rsync is going to make a table of file checksums beforehand, it could
compute the rsync-algorithm hashes of all the blocks of all the
receiver's files in the transfer at the very beginning and send all the
hashes to the sender.  That way, rsync can efficiently handle not only
renamed or moved files but also files that were split, joined, or
otherwise rearranged.  The disadvantage is that, in a very large
transfer, there will be lots of block hashes, so the sender will need a
lot of memory and a hash table with a lot of buckets so lookup is
efficient.

Traditionally, if two files associated with the same path in the
transfer pass rsync's quick check, they are considered identical for
rsync's purposes.  Consider this: after the sender constructs a nice,
organized hash table from the gigantic list of receiver block hashes, it
dumps the hash table into a cache file, noting which receiver file each
block hash came from /and/ that file's size, mtime, and checksum (if the
checksum was ever computed) on the receiver at the time of the transfer.

At the beginning of future transfers, the sender reads the cache into
memory in bulk and sends the expected file metadata from the cache along
with the file list.  If a receiver file matches the corresponding cache
clump according to the quick check, then the sender already has the
file's block hashes in memory and the receiver doesn't need to do
anything!  If the file does not match, the sender discards its cache
clump, the receiver computes the hashes, and the sender stores them in
the table.

Along these lines, it might even be possible to use the rsync algorithm
itself to synchronize the file lists or block hash lists of the two
sides before transferring of real data begins.
-- 
Matt McCutchen, ``hashproduct''
hashproduct at verizon.net -- http://mysite.verizon.net/hashproduct/