[ccache] Why not using MD5?

Sun Nov 14 23:49:58 GMT 2004

Hi,

while reading the implementation notes for ccache, I came to the
question why it uses MD4 and not MD5. MD4 has a significantly
higher collision probability (2^20 instead of 2^64, cf. Menezes/
Oorschot/Vanstone, 'Handbook of Applied Cryptography', p. 339,
<http://www.cacr.math.uwaterloo.ca/hac/> chapter 9). This means
that one can expect a 50 % collision probability after just one
million different compilations with MD4; it becomes so much less
probable with MD5 that I don't know how to express it in English.

But a million lies within reasonable bounds. For example, compiling
a Linux kernel produces about 1000 objects in a typical setup. Then
if one uses a cache for two years for twenty-five projects of that
size in the average, changing the compiler version four times within
that period, ten versions of each project that change a global header
will suffice to produce one million different objects. A compilation
server for a Linux distribution with that usage pattern will then
reach a 50 % probability of a ccache collision within these two years.
(This analysis might not be totally correct because ccache does not
pass randomly generated input to MD4. However, it seems clear that
the risk of a collision cannot be neglected here.)

Using MD5 comes with a speed penalty (cf. p. 345 of the named book)
but it seems absolutely negligible for ccache. A quick profiling run
shows that MD4 consumes about 1/100 of the time the preprocessor needs
with Linux/glibc. If MD5 consumes 1/70 of its time, there is still no
user-visible difference. I have tested ccache with both MD4 and MD5
and could not measure any resulting difference without profiling again.

So I would suggest that future ccache versions use MD5 instead of MD4.

Changing ccache to use MD5 is trivial. I can send a patch if someone
really wishes, though.

	Gunnar

-- 
http://omnibus.ruf.uni-freiburg.de/~gritter