Problem with checksum failing on large files

Donovan Baarda abo at minkirri.apana.org.au
Tue Oct 15 00:29:10 EST 2002


On Mon, Oct 14, 2002 at 04:50:27PM -0700, jw schultz wrote:
> On Tue, Oct 15, 2002 at 02:25:00AM +1000, Donovan Baarda wrote:
> > On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote:
> > > On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote:
> > [...]
> > > > Does the first pass signature block checksum really only use 2 bytes of the
> > > > md4sum? That seems pretty damn small to me. For 100M~1G you need at least
> > > > 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
> > > > 64bits, but you should probably increase the block size as well/instead.
> > > 
> > > It is worth remembering that increasing the block size with
> > > a fixed checksum size increases the likelihood of two
> > > unequal blocks having the same checksums.
> > 
> > I haven't done the maths, but I think the difference this makes is
> > negiligable, and is far outweighed by the fact that a larger block size
> > means less blocks.
> 
> We've just seen one face of the checksum undersize.  This
> problem is after all because we have unequal blocks that
> have the same (truncated) checksums.  That is with 700 bytes
> being compressed to a 4 byte checksum.  Increasing the block
> size without increasing the checksum size increases the
> chance of unequal blocks having the same checksum.

Yeah, but I think when you do the maths, the variation this causes taipers
to nothing as the block size gets significantly large compared to the
checksum. All the maths we have been doing so far uses the simplifying
assumption that the block size is infinitely large compared to the checksum,
so the probability of a checksum clash is based entirely on the checksum
size.

But I haven't done the maths, so feel free to go through it and prove
otherwise :-)

what the heck... here it is;

  p = 1/2^b - 1/2^n

where:
  p is the probability that any two blocks have the same checksum, but are
  actually different.
  b is the number of bits in the checksum
  n is the number of bits in a block.
  
The first term is the probability any two random blocks have the same
checksum. The second term is the probability that any two random blocks are
actually the same. As you can see, as b -> n, p -> 0, but as n gets
significantly larger than b, it has less and less affect. We are talking
relative probability changes in the order 1/2^1000.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------



More information about the rsync mailing list