[ccache] Why not using MD5?

tridge at samba.org tridge at samba.org
Mon Nov 15 09:31:46 GMT 2004


Gunnar,

 > while reading the implementation notes for ccache, I came to the
 > question why it uses MD4 and not MD5. MD4 has a significantly
 > higher collision probability (2^20 instead of 2^64, cf. Menezes/
 > Oorschot/Vanstone, 'Handbook of Applied Cryptography', p. 339,
 > <http://www.cacr.math.uwaterloo.ca/hac/> chapter 9). This means
 > that one can expect a 50 % collision probability after just one
 > million different compilations with MD4; it becomes so much less
 > probable with MD5 that I don't know how to express it in English.

You are badly misinterpreting that table in HAC. The table gives you
an idea of how much computation (in terms of number of hash
calculations) would be required for an _attacker_ to find a
collision. This is totally different from the number of collisions
found with data chosen by a non-attack method.

As an example, think of a trivial, but long, hash based on a simple
CRC. It would take an enormous number of random pieces of data to
find a collision in a long CRC, but it would take an attacker with
knowledge of cryptanalysis only a very few cycles.

In case you don't believe me, I have in fact tried this. See
http://samba.org/ftp/unpacked/junkcode/md4-collision/ for my test code
that will tell you if you get a collision in the first approximately
ten million random md4 calculations (beyond that there is a chance of
a bucket overflow, and fixing that would require more than the 80
lines of code I have used). I have run this code and found no
collisions.

I believe that it is well established that MD4 is quite collision
resistant against random data, and I imagine that the guys at RSA
would have hung themselves with embarrassment if they had released a
message digest with a random collision rate of just 2^20. As far as I
know, the expected number of pairs of random data to produce a
collision is 2^64, which is the same as MD5.

What has been found is that MD4 is lousy as a _cryptographic_
cipher. But in ccache we aren't using its cryptographic properties. We
are just using it as a convenient, fairly collision resistant hash.

If you think you need the cryptographic properties of a good message
digest function in ccache, then I wonder what C code you are
compiling! It would have to be hand-crafted C from a crypt-analyst
hell bent on causing you trouble. In that case I recommend not
compiling their code for them.

Now, despite all of the above, we could switch ccache over to MD5 if
it gives some people a warm-fuzzy feeling, but we should do it when
there is some need to change the hash function ccache is using anyway,
to minimize disruption to existing caches. That tends to happen every
few revisions when some fundamental change is made to what goes into
the hash.

As a final warm-fuzzy for all you skeptics out there, ccache also
includes the number of bytes in the hash input in the hash filename,
like this:

  .ccache/0/2/fa9002c30d7c9ef159664ba94bdaec-355123

the 355123 is the number of bytes being hashed. So even if MD4 did
collide, it would have to collide on exactly the same length of cpp
output (plus associated hash input data). 

Cheers, Tridge

PS: If someone wants to extend my md4-collision finder to not be
memory bound then please send me a patch


More information about the ccache mailing list