Extremely poor rsync performance on very large files (near 100GB and larger)

Fri Jan 12 12:39:23 GMT 2007

Evan Harris wrote:
> Would it make more sense just to make rsync pick a more sane blocksize
> for very large files?  I say that without knowing how rsync selects
> the blocksize, but I'm assuming that if a 65k entry hash table is
> getting overloaded, it must be using something way too small.
rsync picks a block size that is the square root of the file size. As I
didn't write this code, I can safely say that it seems like a very good
compromise between too small block sizes (too many hash lookups) and too
large blocksizes (decreased chance of finding matches).
> Should it be scaling the blocksize with a power-of-2 algorithm rather
> than the hash table (based on filesize)?
If Wayne intends to make the hash size a power of 2, maybe selecting
block sizes that are smaller will make sense. We'll see how 3.0 comes along.
> I haven't tested to see if that would work.  Will -B accept a value of
> something large like 16meg?
It should. That's about 10 times the block size you need in order to not
overflow the hash table, though, so a block size of 2MB would seem more
appropriate to me for a file size of 100GB.
>   At my data rates, that's about a half a second of network bandwidth,
> and seems entirely reasonable.
> Evan
I would just like to note that since I submitted the "large hash table"
patch, I have seen no feedback on anyone actually testing it. If you can
compile a patched rsync and report how it goes, that would be very
valuable to me.

Shachar