Problem with checksum failing on large files
jw schultz
jw at pegasys.ws
Mon Oct 14 13:23:00 EST 2002
On Mon, Oct 14, 2002 at 10:45:44PM +1000, Donovan Baarda wrote:
> In conclusion, a blocksize of 700 with the current 48bit signature blocksum
> has an unacceptable failure rate (>5%) for any file larger than 100M, unless
> the file being synced is almost identical.
>
> Increasing the blocksize will help, with the following minimum sizes being
> recommended for a <5% failure rate;
>
> file block
> 100M 1K
> 200M 3K
> 400M 12K
> 800M 48K
> 1G 75K
> 2G 300K
> 4G 1.2M
>
> Note that the required block size is growing faster than the file size is,
> so the number of blocks in the signature is shrinking as the file grows. We
> absolutely need to increase the signature checksum size as the filesize
> increases.
>
> > If my new hypothesis is correct we definitely need to increase the size
> > of the first-pass checksum for files bigger than maybe 50MB.
>
> Does the first pass signature block checksum really only use 2 bytes of the
> md4sum? That seems pretty damn small to me. For 100M~1G you need at least
> 56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
> 64bits, but you should probably increase the block size as well/instead.
It is worth remembering that increasing the block size with
a fixed checksum size increases the likelihood of two
unequal blocks having the same checksums.
I think we want both the block and checksum sizes to
increase with file size. Just increasing block size gains
diminishing returns but just increasing checksum size will
cause a non-linear increase in bandwidth requirement.
Increasing both in tandem is appropriate. Larger files call
for larger blocks and larger blocks deserve larger
checksums.
I do think we want a ceiling on block size unless we layer
the algorithm. The idea of transmitting 300K because a 4K
block in a 2GB DB file was modified is unsettling.
Note for rsync2 or superlifter:
We may want to layer the algorithm so that large
files get a first pass with large blocks but
modified blocks are accomplished with a second pass
using smaller blocks.
ie. 2GB file is checked with 500KB blocks and a
500KB block that changed is checked with 700B
so rsyncing the file would be almost like rsyncing
a directory with -c.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list