Problem with checksum failing on large files

Mon Oct 14 12:46:00 EST 2002

On Mon, Oct 14, 2002 at 12:36:36AM -0700, Craig Barratt wrote:
> craig> My theory is that this is expected behavior given the check sum size.
[...]
> But it just occurred to me that checksum collisions is only part of
> the story.  Things are really a lot worse.
> 
> Let's assume that the block checksums are unique (so we don't get
> tripped up from this first problem).  Let's also assume that the old
> file is completely different to the new one, ie: no blocks at any
> offset really match.  So rsync will compare the checksum at every
> byte offset in the file looking for any match.  If there are nBlocks
> blocks, each check has an nBlocks / 2^48 chance of a false match.
> 
> Since this test is repeated at every byte offset, the probability
> that the file has no false matches is:
> 
>     p = (1 - nBlocks / (2^48)) ^ fSize
> 
> where fSize is the size of the file (more precisely the exponent should
> be (fSize - 700)).
> 
> Now for some numbers for 700 byte blocks:
> 
>   - 100MB file (104857600 bytes).  nBlocks = 149797.  p = 0.945.
> 
>   - 500MB file (524288000 bytes).  nBlocks = 748983.  p = 0.248.
> 
>   - 1000MB file (1048576000 bytes).  nBlocks = 1497966.  p = 0.003.
> 
> So, on average, if you have a random "new" 1GB file and a random "old"
> 1GB file, and you rsync then, the 1st phase will fail 99.7% of the
> time.
> 
> Someone could test this theory: generate two random 500MB files and
> rsync them.  Try it a few times.  I claim that on average the first
> pass will fail around 75% of the time.
> 
> Things get a lot better when the files are very similar.  For each
> block that matches, rsync skips the whole block (eg, 700 bytes)
> before it starts looking for matching checksums.  So for a file
> that is identical it only does nBlock checks, not fSize checks
> (700 times fewer).

Wow... scary. This can be generalised to;

   p = 1 - (1 - c/(2^b))^s

where:

  p is the probablility of a false match
  c is the number of blocks in the signature
  b is the number of bits in the checksum
  s is the number of "different bytes"

In conclusion, a blocksize of 700 with the current 48bit signature blocksum
has an unacceptable failure rate (>5%) for any file larger than 100M, unless
the file being synced is almost identical.

Increasing the blocksize will help, with the following minimum sizes being
recommended for a <5% failure rate;

file	block
 100M	  1K
 200M	  3K
 400M	 12K
 800M	 48K
   1G	 75K
   2G	300K
   4G	1.2M

Note that the required block size is growing faster than the file size is,
so the number of blocks in the signature is shrinking as the file grows. We
absolutely need to increase the signature checksum size as the filesize
increases.

> If my new hypothesis is correct we definitely need to increase the size
> of the first-pass checksum for files bigger than maybe 50MB.

Does the first pass signature block checksum really only use 2 bytes of the
md4sum? That seems pretty damn small to me. For 100M~1G you need at least
56bits, for 1G~10G you need 64bits. If you go above 10G you need more than
64bits, but you should probably increase the block size as well/instead.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------