Rsyncing really large files

Shachar Shemesh rsync at
Sat Mar 5 18:07:20 GMT 2005

Wayne Davison wrote:

>On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote:
>>And I'm suggesting making it static, by adjusting the hash table's
>>size according to the number of blocks.
>The block-size? 
Definitely not! I was talking about the hash table load. I.e. - the 
ratio between the number of blocks and the number of hash table buckets.

I.e. - after determining the number of blocks, only then decide on a 
hash table size, and work accordingly. This means you use little memory 
for small files, and more memory for big files - should be an acceptable 
trade off.

>Since it only needs to note a
>found/not-found state, the table can be a single bit per node, and a
>19-bit lookup only needs 64k of memory.
But that only works if the checksum function and the hash table are 
exactly the same size. Also, you still need to store the verify value 
somewhere, and efficiently find it. I'm not sure that's optimal.

If we take a 500GB file, as is Lars' case, and assuming we don't touch 
the block size (i.e. - we use the default 740K blocks of 740K size 
each), we will need about 900 thousand buckets in the hash table at 
alpha ratio of 80%, which means 4MB in pointers. I hardly think this is 
enough memory consumption (for efficiently transferring a 500GB file) to 
justify further complicated bit operations.

(on the flip side, 64KB fit into the CPU's data cache, while 4MB usually 
will not. I'm not sure how crucial that is going to be turn out to be).

>  This allows a rapid yes/no
>pre-check for the weak value before we look-up the actual strong
>checksum value in the hash table and should result in less searching
>for values that aren't there.
But how will you find it there? If you are going to have 740K blocks 
(i.e. - 740,000 strong hashes) in a 16bit hash table, you are going to 
have lots of collisions there (190 per bucket, on average), and you 
gained nothing.


Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work?

More information about the rsync mailing list