Problem with checksum failing on large files
Donovan Baarda
abo at minkirri.apana.org.au
Sun Oct 13 02:33:00 EST 2002
On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote:
> > My theory is that this is expected behavior given the check sum size.
>
> Craig,
> Excellent analysis!
I was a bit concerned about his maths at first, but I did it myself from
scratch using a different aproach and got the same figures...
for those who are interested, an OK approximation for the maths turns out to
be;
p = (c^2)/2^(b+1)
where:
p is probability of "collision"
c is number of blocks
b is number of bits in checksum.
Provided c is significantly less than 2^b. As c approaches 2^b, p becomes
too large using this approx(when c=2^b, p should be 1).
> Assuming your hypothesis is correct, I like the adaptive checksum
> idea. But how much extra processor overhead is there with a larger
> checksum bit size? Is it worth the extra code and testing to use an
> adaptive algorithm?
There is no extra processor overhead, because the full md4sums are currently
calculated anyway, before they are truncated down to minimise the signature
size.
> I'd be more inclined to say "This ain't the 90's anymore", realize
> that overall filesizes have increased (MP3, MS-Office, CD-R .iso, and DV)
> and that people are moving from dialup to DSL/Cable, and then make either
> the default (a) initial checksum size, or (b) block size, a bit larger.
librsync has options to specify the "strong sum" size, though I think it
currently just uses the full md4sum size. pysync also uses the full md4sum
size.
I'm not sure what best aproach is, but there should be a relationship
between block size, signature size, and file size.
--
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------
More information about the rsync
mailing list