Rsyncing really large files
Shachar Shemesh
rsync at shemesh.biz
Sat Mar 5 18:07:20 GMT 2005
Wayne Davison wrote:
>On Thu, Mar 03, 2005 at 10:18:01AM +0200, Shachar Shemesh wrote:
>
>
>>And I'm suggesting making it static, by adjusting the hash table's
>>size according to the number of blocks.
>>
>>
>
>The block-size?
>
>
Definitely not! I was talking about the hash table load. I.e. - the
ratio between the number of blocks and the number of hash table buckets.
I.e. - after determining the number of blocks, only then decide on a
hash table size, and work accordingly. This means you use little memory
for small files, and more memory for big files - should be an acceptable
trade off.
>Since it only needs to note a
>found/not-found state, the table can be a single bit per node, and a
>19-bit lookup only needs 64k of memory.
>
But that only works if the checksum function and the hash table are
exactly the same size. Also, you still need to store the verify value
somewhere, and efficiently find it. I'm not sure that's optimal.
If we take a 500GB file, as is Lars' case, and assuming we don't touch
the block size (i.e. - we use the default 740K blocks of 740K size
each), we will need about 900 thousand buckets in the hash table at
alpha ratio of 80%, which means 4MB in pointers. I hardly think this is
enough memory consumption (for efficiently transferring a 500GB file) to
justify further complicated bit operations.
(on the flip side, 64KB fit into the CPU's data cache, while 4MB usually
will not. I'm not sure how crucial that is going to be turn out to be).
> This allows a rapid yes/no
>pre-check for the weak value before we look-up the actual strong
>checksum value in the hash table and should result in less searching
>for values that aren't there.
>
But how will you find it there? If you are going to have 740K blocks
(i.e. - 740,000 strong hashes) in a 16bit hash table, you are going to
have lots of collisions there (190 per bucket, on average), and you
gained nothing.
>..wayne..
>
>
Shachar
--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html
More information about the rsync
mailing list