Problem with checksum failing on large files

Sat Oct 12 06:15:02 EST 2002

terry> I'm having a problem with large files being rsync'd twice 
terry> because of the checksum failing.

terry> Is there a different checksum mechanism used on the second
terry> pass (e.g., different length)?  If so, perhaps there is an
terry> issue with large files for what is used by default for the
terry> first pass?

The first pass block checksum is 48 bits: the 32 bit adler32 (rolling)
checksum, plus the first 2 bytes of the MD4 block checksum.  The second
pass is 160 bits: the same 32 bit adler32 (rolling) plus the entire 128
bit MD4 block checksum.

donovan> I wonder if this is related to the rsync md4sum not producing
donovan> "correct" md4sums for files larger than 512M?
donovan>
donovan> The existing rsync md4sum implementation does not produce
donovan> md4sums the same as the RSA implementation for files larger
donovan> than 512M... but I thought it was consistant with itself so
donovan> this didn't affect anything.

I doubt this matters, for just the reason you mention: it is consistent
and statistically it is still well behaved, so it won't matter.

My theory is that this is expected behavior given the check sum size.
Now, 48 bits sounds like a lot.

Let's start with an analogy.  If I have 23 (randomly-selected) people
in a room, what is the probability that some pair of people have the
same birthday?  You might guess it is quite small, maybe 23/365.  But
that's wrong.  It's actually more than 50%.  The probability that 3
people have different birthdays is:

    364/365 * 363/365.

Similarly, the probability that 23 people all have unique birthdays is

    364/365 * 363/365 * .... * 343/365,

which is less than 0.5 (50%).

So, back to our first pass checksum.  A 4GB file has 2^32 / 700 blocks.
(The blocks are like the people, each birthday is the checksum, and the
2^48 possible checksums are like the 365 days in the year.)  Let's assume
the 48 bit checksums are random.  What's the chance that two blocks have
the same checksum?  It sounds very unlikely, but the chance is around
6.5%.  For an 8GB file it's 23%.  In reality, the block checksums are
not completely random, so the real probabilities of a collision will
be higher.

If we increase the block size to 2048, the probabilities drop to
0.8% for a 4GB file and 3% for an 8GB file.  For a block size of
4096 we get 0.2% for a 4GB file and 0.8% for an 8GB file.

To test this theory, try a bigger --block-size (eg: 4096).  If you
still see a similar number of files needing a repeat then my theory
is wrong, and a bug could be the cause.

If the theory is supported by your tests (ie: most/all files work on the
first pass) then rsync could use an adaptive-length first pass checksum:
use one or two more bytes from the MD4 block checksum (ie: 56 or 64 bits
total) for files bigger than, say, 256MB and 2GB.  Both sides know the
file size.

Craig