Rsyncing really large files

Tue Mar 1 07:31:06 GMT 2005

Kevin Day wrote:

> I would *strongly* recommend that you dig into the thesis a bit (just 
> the section that describes the rsync algorithm itself).

I tried a few weeks ago. I started to print it, and my printer ran out 
of ink :-). I will read it electronically eventually (I hope).

> Now, if you have huge files, then the 16 bit checksum may not be 
> sufficient to keep the hit rate down.  At that point, you may want to 
> consider a few algorithmic alternatives:
>
> 1.  Break up your file into logical blocks and rsync each block 
> individually.  If the file is an "append only" file, then this is 
> fine.  However, if the contents of the file get re-ordered across 
> block boundaries, then the efficiency of the rsync algorithm would be 
> seriously degraded.
>
> 2.  Use a larger hash table.  Instead of 16 bits, expand it to 20 bits 
> - it will require 16 times as much memory for the hash table, but that 
> may not be an issue for you - you are probably workring with some 
> relatively beefy hardware anyway, so what the heck.

Now, here's the part where I don't get it. We have X blocks checksummed, 
covering Y bytes each (actually, we have X blocks of checksum covering X 
bytes each, but that's not important). This means we actually know, 
before we get the list of checksums, how many we will have.

As for the hash table size - that's standard engineering. Alpha is 
defined as the ratio between the number of used buckets in the table to 
the number of total buckets. 0.8 is considered a good value.

What I can propose is to make the hash table size a function of X. If we 
take Lars' case, he has 500GB file, which means you ideally need about 1 
million buckets in the hash to have reasonable performance. We only have 
65 thousand. His Alpha is 0.008. No wonder he is getting abysmal 
performance.

On the other hand, if I'm syncing a 100K file, I'm only going to have 
320 blocks. A good hash table size for me will be 400 buckets. Having 
65536 buckets instead means I'm less likely to have memory cache hits, 
and performance suffers again. My Alpha value is 204 (instead of 0.8).

If my proposal is accepted, we will be adaptive in CPU-memory trade off.

>  I'll leave the excercise of converting the full rsync 32 bit rolling 
> checksum into a 20 bit value to you.

A simple modulo ought to be good enough. If the checksum is 42891 and I 
have 320 buckets, it should go into bucket 11. Assuming the checksums 
are sufficiently random-like, this algorithm is good enough.

> Cheers,
>
> Kevin

       Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html