efficient file appends

rsync at ka9q.net rsync at ka9q.net
Wed Dec 12 19:35:17 EST 2001


Hi. When I discovered rsync, it immediately became one of my most
indispensable utilities. It's a real godsend on bandwidth-limited
links, especially digital cellular.

It works remarkably well in the general case, but I think the
algorithm could be improved for one very important special case.

Many (or even most) of the updated files I transfer with rsync change
only by stuff being appending to the end. Examples of such files
include system logs and (especially) email archives in mbox format.

Rsync correctly handles these files, of course, but I think it could
do so more efficiently. Right now, the receiver sends back a list of
checksums for the blocks it has, and this checksum list can grow quite
long when the file is large. I often see transfers of large mailboxes
where the appendage of one small email message to the sender's copy
results in a reverse transfer of checksum blocks that is much larger
than the new message.

It seems to me that this situation is common enough that the rsync
protocol should look for it as a special case. Once the protocol has
determined from differing timestamps and/or lengths that a file needs
to be synchronized, the receiver should return a hash (and length) of
its copy of the entire file to the sender.  The sender then computes
the hash for the corresponding leading segment of its copy. If they
match, the sender simply sends the newly appended data and instructs
the receiver to append it to its copy.

I just joined this list, and I couldn't find any obvious discussion of
this issue in the archives. My apologies if it has already been
discussed.

Phil Karn




More information about the rsync mailing list