Extremely poor rsync performance on very large files (near 100GB and larger)

Mon Jan 8 23:30:26 GMT 2007

On Mon, 8 Jan 2007, Wayne Davison wrote:

> On Mon, Jan 08, 2007 at 01:37:45AM -0600, Evan Harris wrote:
>
>> I've been playing with rsync and very large files approaching and
>> surpassing 100GB, and have found that rsync has excessively very poor
>> performance on these very large files, and the performance appears to
>> degrade the larger the file gets.
>
> Yes, this is caused by the current hashing algorithm that the sender
> uses to find matches for moved data.  The current hash table has a fixed
> size of 65536 slots, and can get overloaded for really large files.
>...

Would it make more sense just to make rsync pick a more sane blocksize for 
very large files?  I say that without knowing how rsync selects the 
blocksize, but I'm assuming that if a 65k entry hash table is getting 
overloaded, it must be using something way too small.  Should it be scaling 
the blocksize with a power-of-2 algorithm rather than the hash table (based 
on filesize)?

I know that may result in more network traffic as a bigger block containing 
a difference will be considered "changed" and need to be sent instead of 
smaller blocks, but in some circumstances wasting a little more network 
bandwidth may be wholly warranted.  Then maybe the hash table size doesn't 
matter, since there are fewer blocks to check.

I haven't tested to see if that would work.  Will -B accept a value of 
something large like 16meg?  At my data rates, that's about a half a second 
of network bandwidth, and seems entirely reasonable.

Evan