Problem with checksum failing on large files

jw schultz jw at pegasys.ws
Mon Oct 14 13:23:00 EST 2002


On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote:
> In conclusion, a blocksize of 700 with the current 48bit signature blocksum
> has an unacceptable failure rate (>5%) for any file larger than 100M, unless
> the file being synced is almost identical.
> 
> Increasing the blocksize will help, with the following minimum sizes being
> recommended for a <5% failure rate;
> 
> file	block
>  100M	  1K
>  200M	  3K
>  400M	 12K
>  800M	 48K
>    1G	 75K
>    2G	300K
>    4G	1.2M
> 
> Note that the required block size is growing faster than the file size is,
> so the number of blocks in the signature is shrinking as the file grows. We
> absolutely need to increase the signature checksum size as the filesize
> increases.
> 
> > If my new hypothesis is correct we definitely need to increase the size
> > of the first-pass checksum for files bigger than maybe 50MB.
> 
> Does the first pass signature block checksum really only use 2 bytes of the
> md4sum? That seems pretty damn small to me. For 100M~1G you need at least
> 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
> 64bits, but you should probably increase the block size as well/instead.

It is worth remembering that increasing the block size with
a fixed checksum size increases the likelihood of two
unequal blocks having the same checksums.

I think we want both the block and checksum sizes to
increase with file size.  Just increasing block size gains
diminishing returns but just increasing checksum size will
cause a non-linear increase in bandwidth requirement.
Increasing both in tandem is appropriate.  Larger files call
for larger blocks and larger blocks deserve larger
checksums.

I do think we want a ceiling on block size unless we layer
the algorithm.  The idea of transmitting 300K because a 4K
block in a 2GB DB file was modified is unsettling.

Note for rsync2 or superlifter:
	We may want to layer the algorithm so that large
	files get a first pass with large blocks but
	modified blocks are accomplished with a second pass
	using smaller blocks.
	ie. 2GB file is checked with 500KB blocks and a
	500KB block that changed is checked with 700B
	so rsyncing the file would be almost like rsyncing
	a directory with -c.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt



More information about the rsync mailing list