rsync performance on large files strongly depends on file's (dis)similarity

Fri Apr 11 07:09:35 MDT 2014

Maybe an alternative explanation is that a high degree of similarity
allows to skip more bytes on the sender. For each matched block, the
sender can does not need to compute any checksums, weak or strong, for
the next S bytes, where S is the block size.

As the number of matched blocks decreases, i.e., dissimilarity
increases, the number of computed checksums grows. This relationship
is especially apparent for large files, where many strong (and
expensive) checksum must be computed, due to many false alarms.

On Fri, Apr 11, 2014 at 1:35 PM, Thomas Knauth <thomas.knauth at gmx.de> wrote:
> Hi list,
>
> I've found this post on rsync's expected performance for large files:
>
> https://lists.samba.org/archive/rsync/2007-January/017033.html
>
> I have a related but different observation to share: with files in the
> multi-gigabyte-range, I've noticed that rsync's runtime also depends
> on how much the source/destination diverge, i.e., synchronization is
> faster if the files are similar. However, this is not just because
> less data must be transferred.
>
> For example, on an 8 GiB file with 10% updates, rsync takes 390
> seconds. With 50% updates, it takes about 1400 seconds, and at 90%
> updates about 2400 seconds.
>
> My current explanation, and it would be awesome if someone more
> knowledgeable than me could confirm, is this: with very large files,
> we'd expect a certain level of false alarms, i.e., weak checksum
> matches, but strong checksum does not. However, with large files that
> are very similar, a weak match is much more likely to be confirmed
> with a matching strong checksum. Contrary, with large files that are
> very dissimilar a weak match is much less likely to be confirmed with
> a strong checksum, exactly because the files are very different from
> each other. rsync ends up computing lots of strong checksums, which do
> not result in a match.
>
> Is this a valid/reasonable explanation? Can someone else confirm this
> relationship between rsync's computational overhead and the file's
> (dis)similarity?
>
> Thanks,
> Thomas.