Rsyncing really large files
Shachar Shemesh
rsync at shemesh.biz
Thu Mar 10 06:31:57 GMT 2005
Kevin Day wrote:
> Shachar-
>
> I think Wayne is mostly pointing you to the correct location here. If
> you look at the code where the checksum is computer, you'll find that
> the rolling checksum actually consists of two parts - one is the sum
> of all of the byte values in the window, the other is the offset
> weighted sum of all of the byte values in the window. The first of
> these is then left shifted 16 bits and added to the other to come up
> with the official "32 bit" rolling checksum. This works fine as long
> as you aren't counting on a random distribution of bits among the 32 -
> if you mod the value, you are giving much greater importance to the
> lower XX bits, effectively dropping the distribution of the high order
> bits...
>
> Anyway, the two 16 bit values may be random enough that my concern is
> not founded, but it should be tested before assuming that the rolling
> checksum is really a 32 bit value that can easily be divided up into
> buckets.
>
> PS - none of the above has anything to do with the strong signature of
> the window - just the rolling check sum.
>
> Cheers!
>
> - K
Some modern algebra for you, then.
We have two numbers. One is always multiplied by 2^16. We want both
numbers to be able to totally affect the bucket into which the eventual
checksum arrives. If we will choose a hash table size that is co-prime
to the bits we want to remain significant, then we achieve that.
Well, guess what? Factorization of any and each bit in the combined
checksums yields only twos. In other words, any hash table size that
will be odd (i.e. - two is not in it's factorization primes) will be
co-prime to our shifted checksums, thus promising that they will get an
equal chance of affecting what bucket our checksum actually falls into.
It therefor follows that I have to amend my previously proposed hash
table size choosing formula. The new formula is:
(numblocks/8+1)*10+1
And you're done. Of course, this can also be written as:
(numblocks/8)*10+11
Which is slightly more efficient.
Shachar
--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html
More information about the rsync
mailing list