rsync performance on large files strongly depends on file's (dis)similarity

Fri Apr 11 05:35:44 MDT 2014

Hi list,

I've found this post on rsync's expected performance for large files:

https://lists.samba.org/archive/rsync/2007-January/017033.html

I have a related but different observation to share: with files in the
multi-gigabyte-range, I've noticed that rsync's runtime also depends
on how much the source/destination diverge, i.e., synchronization is
faster if the files are similar. However, this is not just because
less data must be transferred.

For example, on an 8 GiB file with 10% updates, rsync takes 390
seconds. With 50% updates, it takes about 1400 seconds, and at 90%
updates about 2400 seconds.

My current explanation, and it would be awesome if someone more
knowledgeable than me could confirm, is this: with very large files,
we'd expect a certain level of false alarms, i.e., weak checksum
matches, but strong checksum does not. However, with large files that
are very similar, a weak match is much more likely to be confirmed
with a matching strong checksum. Contrary, with large files that are
very dissimilar a weak match is much less likely to be confirmed with
a strong checksum, exactly because the files are very different from
each other. rsync ends up computing lots of strong checksums, which do
not result in a match.

Is this a valid/reasonable explanation? Can someone else confirm this
relationship between rsync's computational overhead and the file's
(dis)similarity?

Thanks,
Thomas.