Problem with checksum failing on large files

jw schultz jw at pegasys.ws
Mon Oct 14 23:51:01 EST 2002


On Tue, Oct 15, 2002 at 02:25:00AM +1000, Donovan Baarda wrote:
> On Mon, Oct 14, 2002 at 06:22:36AM -0700, jw schultz wrote:
> > On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote:
> [...]
> > > Does the first pass signature block checksum really only use 2 bytes of the
> > > md4sum? That seems pretty damn small to me. For 100M~1G you need at least
> > > 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
> > > 64bits, but you should probably increase the block size as well/instead.
> > 
> > It is worth remembering that increasing the block size with
> > a fixed checksum size increases the likelihood of two
> > unequal blocks having the same checksums.
> 
> I haven't done the maths, but I think the difference this makes is
> negiligable, and is far outweighed by the fact that a larger block size
> means less blocks.

We've just seen one face of the checksum undersize.  This
problem is after all because we have unequal blocks that
have the same (truncated) checksums.  That is with 700 bytes
being compressed to a 4 byte checksum.  Increasing the block
size without increasing the checksum size increases the
chance of unequal blocks having the same checksum.

> > I think we want both the block and checksum sizes to
> > increase with file size.  Just increasing block size gains
> > diminishing returns but just increasing checksum size will
> > cause a non-linear increase in bandwidth requirement.
> > Increasing both in tandem is appropriate.  Larger files call
> > for larger blocks and larger blocks deserve larger
> > checksums.
> > 
> > I do think we want a ceiling on block size unless we layer
> > the algorithm.  The idea of transmitting 300K because a 4K
> > block in a 2GB DB file was modified is unsettling.
> 
> I think that command-line overides are critical. Just as you can force a
> blocksize, you should be able to force a sigsumsize. However the defaults
> should be reasonable.

We are finding that fixed size defaults are not reasonable
and that the variability really should be per-file.  Under
such conditions the fixed command-line overrides do more
harm than good.  Command-line overrides that modify the
heuristics (--blocksize-gamma, --checksum-threshold) are
more suitable.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt



More information about the rsync mailing list