[ccache] Questions about two hot functions in ccache

Joel Rosdahl joel at rosdahl.net
Tue Oct 19 15:15:00 MDT 2010

On 2010-10-18 03:44, Justin Lebar wrote:
> So it appears that 13% of my CPU time is spent computing md4 hashes,
> while another 25% is spent in hash_source_code_string but outside the
> MD4 code.
> To someone new to the code like me, it appears that there's some room
> for optimization here.

Indeed. (This was by the way mentioned on the list a couple of weeks
ago: http://www.mail-archive.com/ccache@lists.samba.org/msg00532.html)

> [...]
> If all we really care about is finding the strings "__DATE__" and
> "__TIME__", there are faster algorithms than a character-by-character
> search.

Yes, the current algorithm isn't particularly good, but it was easy
enough to implement quickly when the direct mode was more or less just a
proof of concept. Also, I wanted as few false negatives as possible,
which is why I decided to care about being inside or outside strings.

> (Note also that the current implementation copies the whole file into
> hashbuf one character at a time.  Again, do the benefits of stripping
> out comments really offset this?)

While I think that stripping out comments is nice, I agree that it has
to be weighed against the cost, and the same thing for the "false
negatives in comments" thing. So I guess my answer is "it depends on the

Do you want to work on this? That would be awesome! I currently don't
have much ccache time, and when I get some, I would like to work on
other things first.

> * Why does ccache still use MD4? Surely there's a better / faster hash
> out there. I noticed that ccache includes murmurhash, but it doesn't
> seem like it's used in too many places. There's probably a good reason
> for this, but it's not too apparent to me.
> [...]

I added murmurhash to get a good hash function for some hash tables I
introduced when implementing the direct mode, which is a relatively late
invention. I'm far from an expert on hash functions, but surely
murmurhash and similar functions (that are designed to be good hash
functions for a hash table) aren't suitable to use for identifying
(without verification) arbitrary content like a cryptograhic hash
function is. Even the 64-bit version of murmurhash has way too high
collision rate.

MD4 has been there from the start and neither Tridge or I have seen any
reason to switch it. MD5, SHA1 and other even more modern cryptograhic
hash functions are indeed stronger but also slower, and the increased
resistance against various crypto attacks doesn't seem necessary in a
tool like ccache. That said, I'm sure there nowadays may exist hash
functions that are both better (i.e., with lower collision rate) AND
faster than MD4. Do you (or anyone else) know of any with properties
that would be a good fit for ccache?

-- Joel

More information about the ccache mailing list