Extremely poor rsync performance on very large files (near 100GB and larger)

Matt McCutchen hashproduct+rsync at gmail.com
Mon Oct 8 03:29:34 GMT 2007


On 10/7/07, Wayne Davison <wayned at samba.org> wrote:
> On Mon, Jan 08, 2007 at 10:16:01AM -0800, Wayne Davison wrote:
> > And one final thought that occurred to me:  it would also be possible
> > for the sender to segment a really large file into several chunks,
> > handling each one without overlap, all without the generator or the
> > receiver knowing that it was happening.
>
> I have a patch that implements this:
>
> http://rsync.samba.org/ftp/unpacked/rsync/patches/segment_large_hash.diff

I like better performance, but I'm not entirely happy with a fixed
upper limit on the distance that data can migrate and still be matched
by the delta-transfer algorithm: if someone is copying an image of an
entire hard disk and rearranges the partitions within the disk, rsync
will needlessly retransmit all the partition data.  An alternative
would be to use several different block sizes spaced by a factor of 16
or so and have a separate hash table for each.  Each hash table would
hold checksums for a sliding window of 8/10*TABLESIZE blocks around
the current position.  This way, small blocks could be matched across
small distances without overloading the hash table, and large blocks
could still be matched across large distances.

Matt


More information about the rsync mailing list