Problem with checksum failing on large files

Sun Oct 13 02:33:00 EST 2002

On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote:
> > My theory is that this is expected behavior given the check sum size.
> 
>      Craig,
> 	Excellent analysis!

I was a bit concerned about his maths at first, but I did it myself from
scratch using a different aproach and got the same figures...

for those who are interested, an OK approximation for the maths turns out to
be;

   p = (c^2)/2^(b+1)

where:
   p is probability of "collision"
   c is number of blocks
   b is number of bits in checksum.

Provided c is significantly less than 2^b. As c approaches 2^b, p becomes
too large using this approx(when c=2^b, p should be 1).

> 	Assuming your hypothesis is correct, I like the adaptive checksum
> idea.  But how much extra processor overhead is there with a larger
> checksum bit size?  Is it worth the extra code and testing to use an
> adaptive algorithm?

There is no extra processor overhead, because the full md4sums are currently
calculated anyway, before they are truncated down to minimise the signature
size.

> 	I'd be more inclined to say "This ain't the 90's anymore", realize
> that overall filesizes have increased (MP3, MS-Office, CD-R .iso, and DV)
> and that people are moving from dialup to DSL/Cable, and then make either
> the default (a) initial checksum size, or (b) block size, a bit larger.

librsync has options to specify the "strong sum" size, though I think it
currently just uses the full md4sum size. pysync also uses the full md4sum
size.

I'm not sure what best aproach is, but there should be a relationship
between block size, signature size, and file size.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------