Problem with checksum failing on large files
burley at tabq.com.au
Mon Oct 14 00:07:00 EST 2002
This is an interesting problem ;-)
I think I understand Craig'a theory but what I don't understand is why
the second time rsync is applied to Terry's large file that the transfer
is successful? Aren't the two blocks that are actually different that
matched by chance going to match every time. In addition to this,
Craig's analysis works for detecting blocks that are the
_same_. ie. really the same, the same data in the block. Maybe the
probabliity calcs need to consider the chance of a same checksum on two
different blocks which is the only way to end up with different files
after a successful rsync. Isn't it, I'm not sure?
Craig Barratt <craig at atheros.com> writes:
> terry> I'm having a problem with large files being rsync'd twice
> terry> because of the checksum failing.
> terry> Is there a different checksum mechanism used on the second
> terry> pass (e.g., different length)? If so, perhaps there is an
> terry> issue with large files for what is used by default for the
> terry> first pass?
> The first pass block checksum is 48 bits: the 32 bit adler32 (rolling)
> checksum, plus the first 2 bytes of the MD4 block checksum. The second
> pass is 160 bits: the same 32 bit adler32 (rolling) plus the entire 128
> bit MD4 block checksum.
> donovan> I wonder if this is related to the rsync md4sum not producing
> donovan> "correct" md4sums for files larger than 512M?
> donovan> The existing rsync md4sum implementation does not produce
> donovan> md4sums the same as the RSA implementation for files larger
> donovan> than 512M... but I thought it was consistant with itself so
> donovan> this didn't affect anything.
> I doubt this matters, for just the reason you mention: it is consistent
> and statistically it is still well behaved, so it won't matter.
> My theory is that this is expected behavior given the check sum size.
> Now, 48 bits sounds like a lot.
> Let's start with an analogy. If I have 23 (randomly-selected) people
> in a room, what is the probability that some pair of people have the
> same birthday? You might guess it is quite small, maybe 23/365. But
> that's wrong. It's actually more than 50%. The probability that 3
> people have different birthdays is:
> 364/365 * 363/365.
> Similarly, the probability that 23 people all have unique birthdays is
> 364/365 * 363/365 * .... * 343/365,
> which is less than 0.5 (50%).
> So, back to our first pass checksum. A 4GB file has 2^32 / 700 blocks.
> (The blocks are like the people, each birthday is the checksum, and the
> 2^48 possible checksums are like the 365 days in the year.) Let's assume
> the 48 bit checksums are random. What's the chance that two blocks have
> the same checksum? It sounds very unlikely, but the chance is around
> 6.5%. For an 8GB file it's 23%. In reality, the block checksums are
> not completely random, so the real probabilities of a collision will
> be higher.
> If we increase the block size to 2048, the probabilities drop to
> 0.8% for a 4GB file and 3% for an 8GB file. For a block size of
> 4096 we get 0.2% for a 4GB file and 0.8% for an 8GB file.
> To test this theory, try a bigger --block-size (eg: 4096). If you
> still see a similar number of files needing a repeat then my theory
> is wrong, and a bug could be the cause.
> If the theory is supported by your tests (ie: most/all files work on the
> first pass) then rsync could use an adaptive-length first pass checksum:
> use one or two more bytes from the MD4 block checksum (ie: 56 or 64 bits
> total) for files bigger than, say, 256MB and 2GB. Both sides know the
> file size.
> To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html
More information about the rsync