[ccache] Using git file hashes for ccache

Martin Pool mbp at sourcefrog.net
Thu Dec 30 15:18:03 MST 2010


On December 2010 16:43, Justin Lebar <justin.lebar at gmail.com> wrote:
>> It is my understanding that in the ccache hit case, a significant
>> fraction of the running time is spent computing hashes of the original
>> source files.
>
> Yes, ccache spends most of its time hashing when it gets a direct mode
> cache hit, at least according to my measurements.  I wrote a patch a
> little while ago which uses a less-secure hash function which speeds
> up ccache somewhat; you may want to try applying it and see if it
> speeds up your builds.  (Interestingly, the ccache speed improvement
> didn't translate into faster Firefox builds for me -- I haven't had a
> chance to investigate why.)
>
>> git is also frequently used for development, makes use of file hashes,
>> and is extremely fast. When doing operations such as git diff, in the
>> common case where the source file has not been modified, git will
>> notice that the file's attributes (including mtime) matches these
>> stored in the git index file, and thus it won't have to actually read
>> the file to conclude that the contents have not changed.
>
> Maybe the right thing to do would be to have ccache keep track of the
> source files' attributes.  If some environment variable was set,
> ccache would treat a file with unchanged attributes as unchanged.
> (ccache could maintain a new index into its cache, indexed on absolute
> path, or it could hash a string "magic-bitstring | file-path | file
> attributes" and use the current cache infrastructure.)  This seems a
> lot simpler than trying to interface with git.

I think that is a better approach too.  It's probably enough to just
store the mtime and (on unix) ctime.  There are a couple of tricks to
doing this safely: if the time == the current time, you can't trust it
because the file could be modified again before the end of the current
second.  On some filesystems (eg vfat) the fs actually only stores
2-second granularities.  On some Linux systems you get sub-second
accuracy while the file is in cache but not when it's been flushed to
disk (this might have been fixed.)

On the other hand, it's conceivable that reading the git index could
be faster (because of better readahead) than actually stating all the
files.  But it does seem more complex, and more likely to be buggy.
Perhaps it's best to do this first by stating the files and then
perhaps later by pulling it out of the git index.

-- 
Martin


More information about the ccache mailing list