Extremely poor rsync performance on very large files (near 100GB
and larger)
Evan Harris
eharris at puremagic.com
Mon Jan 8 23:30:26 GMT 2007
On Mon, 8 Jan 2007, Wayne Davison wrote:
> On Mon, Jan 08, 2007 at 01:37:45AM -0600, Evan Harris wrote:
>
>> I've been playing with rsync and very large files approaching and
>> surpassing 100GB, and have found that rsync has excessively very poor
>> performance on these very large files, and the performance appears to
>> degrade the larger the file gets.
>
> Yes, this is caused by the current hashing algorithm that the sender
> uses to find matches for moved data. The current hash table has a fixed
> size of 65536 slots, and can get overloaded for really large files.
>...
Would it make more sense just to make rsync pick a more sane blocksize for
very large files? I say that without knowing how rsync selects the
blocksize, but I'm assuming that if a 65k entry hash table is getting
overloaded, it must be using something way too small. Should it be scaling
the blocksize with a power-of-2 algorithm rather than the hash table (based
on filesize)?
I know that may result in more network traffic as a bigger block containing
a difference will be considered "changed" and need to be sent instead of
smaller blocks, but in some circumstances wasting a little more network
bandwidth may be wholly warranted. Then maybe the hash table size doesn't
matter, since there are fewer blocks to check.
I haven't tested to see if that would work. Will -B accept a value of
something large like 16meg? At my data rates, that's about a half a second
of network bandwidth, and seems entirely reasonable.
Evan
More information about the rsync
mailing list