Problem with checksum failing on large files

Craig Barratt craig at atheros.com
Mon Oct 14 07:38:01 EST 2002


craig> My theory is that this is expected behavior given the check sum size.

derek>      Craig,
derek> 	Excellent analysis!

donovan> I was a bit concerned about his maths at first, but I did
donovan> it myself from scratch using a different aproach and got
donovan> the same figures...

Ok, so the chance that two (different) blocks have the same first-pass
48 bit checksum is small, but significant (at least 6% for a 4GB file
with 700 bytes blocks).  This probably isn't enough to explain Terry's
problem.

But it just occurred to me that checksum collisions is only part of
the story.  Things are really a lot worse.

Let's assume that the block checksums are unique (so we don't get
tripped up from this first problem).  Let's also assume that the old
file is completely different to the new one, ie: no blocks at any
offset really match.  So rsync will compare the checksum at every
byte offset in the file looking for any match.  If there are nBlocks
blocks, each check has an nBlocks / 2^48 chance of a false match.

Since this test is repeated at every byte offset, the probability
that the file has no false matches is:

    p = (1 - nBlocks / (2^48)) ^ fSize

where fSize is the size of the file (more precisely the exponent should
be (fSize - 700)).

Now for some numbers for 700 byte blocks:

  - 100MB file (104857600 bytes).  nBlocks = 149797.  p = 0.945.

  - 500MB file (524288000 bytes).  nBlocks = 748983.  p = 0.248.

  - 1000MB file (1048576000 bytes).  nBlocks = 1497966.  p = 0.003.

So, on average, if you have a random "new" 1GB file and a random "old"
1GB file, and you rsync then, the 1st phase will fail 99.7% of the
time.

Someone could test this theory: generate two random 500MB files and
rsync them.  Try it a few times.  I claim that on average the first
pass will fail around 75% of the time.

Things get a lot better when the files are very similar.  For each
block that matches, rsync skips the whole block (eg, 700 bytes)
before it starts looking for matching checksums.  So for a file
that is identical it only does nBlock checks, not fSize checks
(700 times fewer).

I recall from Terry's output that the number of bytes transferred after
the two attempts was roughly the same as the file size, so about half
the file is different.  In this case, about fSize/2 lookups will be
done

    p = (1 - nBlocks / (2^48)) ^ (fSize/2)

which is about 0.06 (ie: a 94% chance the 1st pass fails).

For a 1GB byte file with 4096 byte blocks and about half the file
changed, the probability of the first pass working is about 62%,
which is still not great.  So just doing a single test with a
4096 block size might not confirm or contradict my hypothesis.
The probability does go up to about 97% with a 64K block size.

If my new hypothesis is correct we definitely need to increase the size
of the first-pass checksum for files bigger than maybe 50MB.

Craig



More information about the rsync mailing list