Problem with checksum failing on large files

Mon Oct 14 13:50:00 EST 2002

On Sun, Oct 13, 2002 at 12:32:42PM +1000, Donovan Baarda wrote:
> On Sat, Oct 12, 2002 at 11:13:50AM -0700, Derek Simkowiak wrote:
> > > My theory is that this is expected behavior given the check sum size.
> > 
> >      Craig,
> > 	Excellent analysis!
> 
> I was a bit concerned about his maths at first, but I did it myself from
> scratch using a different aproach and got the same figures...
> 
> for those who are interested, an OK approximation for the maths turns out to
> be;
> 
>    p = (c^2)/2^(b+1)
> 
> where:
>    p is probability of "collision"
>    c is number of blocks
>    b is number of bits in checksum.
> 
> Provided c is significantly less than 2^b. As c approaches 2^b, p becomes
> too large using this approx(when c=2^b, p should be 1).
> 
[snip --jw]
> 
> I'm not sure what best aproach is, but there should be a relationship
> between block size, signature size, and file size.

Another thought just came to me.  I haven't looked at the
code yet for this but anyway...

Instead of messing with block and signature size what about
restricting locality.  Only allow a block checksum to be
compared within an offset window.  As an example the a
rolling checksum would only match against those blocks that
are within 1MB of that block's adjusted offset.  That would
reduce considerably the number of blocks that would apply to
your formula and hence reduce the collisions.

That would not require any change in the protocol nor any
heuristics to calculate block or signature sizes.  I'm
guessing that the code change could be relatively small.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt