rsync algorithm for large files

Carlos Carvalho carlos at fisica.ufpr.br
Fri Sep 4 20:57:16 MDT 2009


Matthias Schniedermeyer (ms at citd.de) wrote on 5 September 2009 00:34:
 >On 04.09.2009 18:00, eharvey at lyricsemiconductors.com wrote:
 >> 
 >> Why does it take longer the 3rd time I run it?  Shouldn?t the performance
 >> always be **at least** as good as the initial sync?
 >
 >Not per se.
 >
 >First you have to determine THAT the file has changed, then the file is 
 >synced if there was a change. At least that's what you have to do when 
 >the file-size is unchanged and only the timestamp is differs.
 >(Which is unfortunatly often the case for Virtual Machine Images)
 >
 >Worst case: Takes double the time if the change is at end of the file.

No, rsync assumes that the file has changed if either the size or the
timestamp differs, and syncs it immediately.

For a new file transfer it's read once in the source and written once
in the destination. For an update it's still read once in the source
but read twice and written once in the destination, no matter how many
or extensive the changes are. The source also has to do the slidding
checksumming. This is usually faster than reading the file, so it'll
only slow down the process if the source is very slow or the cpu is
busy with other tasks. OTOH, the IO on the destination is
significantly higher for big files; this is often the cause of a
slower transfer rate than a full copy.

 >There are also some other options that may or may not have a speed 
 >impact for you:
 >--inplace, so that rsync doesn't create a tmp-copy that is later moved over 
 >the previous file on the target-site.

Yes, this is useful because it avoids both a second reading and the
full write on the destination (in priciple; I didn't bother to check
the actual implementation). For large files with small changes this
option is probably the best. The problem is that if the update aborts
for any reason you lose your backup. One might want to keep at least
two days of backups in this case.

 >--whole-file, so that rsync doesn't use delta-transfer but rather copies 
 >the whole file.

Yes but causes a lot of net traffic. He mentions an average transfer
rate of about 11MB/s, so for a 100Mb/s net whole-file is probably not
suitable. If however he has a free gigabit link it'll be the best if
--inplace is not acceptable.

 >Also you may to separate the small from the large files with:
 >--min-size
 >--max-size
 >So you can use different options for the small/large file(s).

Agreed.

I'd also suggest using rsync v3 because it limits the blocksize.
Previous versions will use quite a large block for big files and if
changes are scattered it'll transfer much more than v3.


More information about the rsync mailing list