[ccache] Using git file hashes for ccache

Fri Dec 31 06:12:25 MST 2010

On Fri, Dec 31, 2010 at 4:27 AM, Wilson Snyder <wsnyder at wsnyder.org> wrote:
> I also think this is a good approach, though having been
> down the road before, mtime isn't always enough as you
> noted, but including the size also makes it *almost*
> perfect.  Most edits change the size.
>
> Note several tools like scons use this technique, and some
> store the hashes in a single hash file inside each source
> directory.  That has the nice advantage of allowing sharing,
> though the downside of poluting the source areas so I don't
> really like it.  I think putting it into the ccache
> infrastructure is nicer; but you may still want multiple
> hashes to be stored under a hash of the directory name,
> instead of a hash of the filename, because that allows
> reading fewer files.  (Otherwise reading the hundreds of
> hash files will become the new bottleneck.)

I actually see 3 different variants being discussed in this thread:

A) index based on hash of file name + attributes instead of hash of
file contents
B) index based on hash of file contents, but have a ccache maintain
database of (file name + attributes) -> (hash of file contents) pairs
C) index based on hash of file contents, and use git index for looking
up (file name + attributes) -> (hash of file contents) pairs

A is simplest, and would probably work well enough for system include
files. Not so much for project files though, especially if we want to
support CCACHE_BASEDIR (ctime/mtime probably won't match across
checked out versions).

B could work pretty well, I think. There is the question of where to
store that new database, but it's probably doable - the database is
only a cache, so it's always OK to expire entries if it grows too
much.

C benefits people who frequently switch their git workspace between
multiple branches. When switching back to a previously compiled
branch, the file mtimes will be updated, but the git index shows that
the contents haven't. This type of operation is the source of many
ccache hits for me (after all, the compiler wouldn't even get invoked
by make if no mtimes had changed).

Making C work seems complicated, as we'd need to be able to read the
git index. OTOH, this also nicely solves the problem of expiring
database entries: git is in charge of maintaining the index so we
don't need to care about it for project files, and out-of-project
files such as system headers shouldn't change nearly as often so we'd
hardly ever need to expire them from the ccache database. We could
even avoid any problems of concurrent database updates by just never
having ccache update any (file name + attributes) -> (hash of file
contents) database - git would be in charge of updating its index for
in-project files, and we could have an out-of-line ccache option to do
it for infrequently-modified system files...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.