rsync speedup - how ?

Sat Aug 8 18:57:59 MDT 2009

devzero at web.de (devzero at web.de) wrote on 6 August 2009 20:15:
 >i`m using rsync to sync large virtual machine files from one esx server to another. 
 >the speed is "reasonable", but i guess it`s not the optimum - at least i donŽt know where the bottleneck is.

That's vague and subjective, so difficult to answer.

 >i read that rsync would be not very efficient with ultra-large files (i`m syncing files with up to 80gb size)

The larger the file the longer it takes to locate matches and
differences but you save more transfers. It also depends on the rsync
version you're using, latest versions are better.

 >regarding the bottleneck:  neither cpu, network or disk is at their limits - neither on the source nor on the destination system.
 >i don`t see 100% cpu, i don`t see 100% network or 100% disk i/o usage

There may be other factors interfering. To see how fast rsync is you
need a comparison with another transfer program. To see the effect of
the overhead you could use the -W option.

devzero at web.de (devzero at web.de) wrote on 7 August 2009 18:44:
 >so, the question is: is rsync rolling checksum algorithm the perfect
 >(i.e. fastest) algorithm to match changed blocks at fixed locations
 >between source and destination files ?

No. If you know where the differences are you can optimize for them.
This would avoid discovering differences and matches, would avoid
reading the identical portions in the sender and would avoid reading
them twice in the destination. Updating in-place would even avoid
reading them at all. If you know so much about the files, they're so
big and the differences are so small why don't you just sync the
variable portions and merge them?

 >but what i`m unsure about is, if rsync isn`t doing too much work
 >with detecting the differences. it doesn`t need to "look forth and
 >back" (as i read somewhere it would) ,

It doesn't. Everything is determined in a single pass over the file,
at both ends.

 >> > besides that, for transferring complete files i know faster methods than rsync.
...
 >here is some example: http://communities.vmware.com/thread/29721

All this can be done with rsync also (with the -W option). In the
destination side I think it's unlikely something go faster; all
programs should be similar. On the source side it's possible to go
faster if the sender uses sendfile() (or the equivalent in other
operating systems).

 >> Assuming the rsync algorithm works correctly, I don't
 >> see any difference between the end result of copying
 >> a 100gb file with the rsync algorithm or without it.
 >> The only difference is the amount of disk and network
 >> I/O that must occur.
 >
 >the rsync algorithm is using checksumming to find differences.
 >checksums are sort of "data reduction" which create a hash from
 >a larger amount of data. i just want to understand what makes
 >sure that there are no hash collisions which break the algorithm.

There are several checksums. The ones to find differences are weak but
at the end rsync checks the md5sum and if they don't match it
transfers the file once more (tweaking things to avoid the same
mismatch happening again). So the probability of failure is the one of
accidentally having two different files with the same md5sum, which is
2^(-128).

 >mind that rsync exists for some time and by that time file sizes
 >transferred with rsync may have grown by a factor of 100 or 
 >even 1000.  

It used to do a md4sum. Version 3 uses md5sum, which is safer.