Rsyncing really large files

Shachar Shemesh rsync at shemesh.biz
Sat Mar 5 12:18:01 GMT 2005


Kevin Day wrote:

> As a quick FYI, the block size absolutely has an impact on the 
> effectiveness of the checksum table - larger blocks means fewer 
> blocks, which means fewer hash colissions.

Since you wouldn't expect that many blocks in the file, a 32bit weak 
checksum would only produce about 4 or 5 real collisions. Mind you, this 
is me doing educated guesses. I haven't worked out the actual math yet. 
I don't think we need to worry about this particular problem just yet. 
Hash table collisions, however, are much more likely, which is what I'm 
trying to solve here.  

> That said, however, I completely agree that for very large files, the 
> number of buckets in the current implementation is not optimal.  
> Perhaps having a mode for really large files would be appropriate.

I don't see why such a mode would be necessary.

> One caution on increasing the size of the hash:  The current 
> implementation gives 16 bits of spread, so modding that value with the 
> desired bucket count would work fine.

That's not what I read into it. It seems to me that the checksum 
function gives a 32bit result, and we are squashing that into a 16bit 
hash table. Can you point me to the code? Wayne?

>   However, if you choose to start with the 32 bit rolling hash and mod 
> that, you will have problems.  The rolling checksum has two distinct 
> parts, and modding will only pull info from the low order bits,

Why? This may be something I missed within the code.

> which will likely get you considerably less than even the 16 bits that 
> the current implementation gives.

If the source is 16 bit, doing any hash table size bigger than 65536 
buckets would make no sense, true. Is it 16bit?

> I'd recommend using a real 32 bit hashing function to mix the two 
> rolling checksum components,

What two parts? If I understand rsync correctly, we have a rolling 
checksum, and a real checksum. The rolling checksum is used to single 
out potential matches, and the real checksum makes sure these are, 
indeed, real matches. We only need to put the first one into the hash, 
as we are never doing mass-lookups on the second.

Am I missing something basic here?

          Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html



More information about the rsync mailing list