[ccache] Duplicate object files in the ccache - possible optimization?

Mon Nov 7 01:53:53 MST 2011

On 5 November 2011 11:12, Frank Klotz <frank.klotz at alcatel-lucent.com> wrote:
>  I used ccache at my previous employer, and was very convinced of its value.
>  Now that I have started a new job, I am in the process of trying to bring
> the new shop on board with ccache, so I have been doing lots of test runs
> and looking at things.  Here is one thing I am thinking could add some
> value.
>
> Looking through the ccache, I find many pairs of files which have different
> names (different hashes), but exactly identical content.  This actually
> makes sense, as each file would have an index hash and a preprocessed hash,
> and since ccache needs to be able to find a match on either, then both need
> to be in the cache.

What is an index hash?

> (Actually, thinking about it, I'm a little surprised
> that there are any files in the ccache that DON'T appear twice - shouldn't
> EVERY compilation have 2 hashes?)

I don't understand why you would expect that.

It seems like you expect there is another indirection layer by which
ccache tries to find jobs that produce identical output.  I don't
think there is one at present.  I don't think this would happen very
often in reality, except perhaps for trivial cases like compiling
empty files, and that's not so important to accelerate, and it will
not use up much disk space.

If you're getting duplicated cache files due to for instance doing
builds in different directories or from different trees that produce
identical output you could change the ccache options to make it less
stringent.

>
> But it seems to me that it would make a lot of sense to store the data of
> these 2 files only once, by hard-linking the 2 names to the same inode.
>  (For filesystems that support hard links, of course!)  Every time ccache
> does an actual compilation and stores a file in the cache, it should store
> it under hard links for BOTH hashes - the indexed hash and the proprocessed
> hash.  And if it gets a hash miss on the indexed hash but a hit on the
> preprocessed hash, then it should add the missed index hash as a hard link
> to the file found.  So a given file (inode) in the cache could actually be
> referenced by MANY directory entries: one preprocessed hash, and multiple
> index hashes for various different combinations of source files and header
> files which end up producing the same output when passed through the
> preprocessor.

This mail is the first time google has heard of "ccache indexed hash"...

>
> This could increase the storage efficiency of the ccache.
>
> Of course, since not every filesystem supports hard links, the simplest
> solution was of course just to have multiple file copies.  So I guess adding
> code to do this would require some way to determine if the filesystem the
> cache is on can in fact support hardlinks.
>
> If you think this sounds like a good idea, but don't have bandwidth to do
> it, I would be willing to give it a try.  Any hints on where to start would
> of course be welcome.
>
> Thanks,
> Frank Klotz
> _______________________________________________
> ccache mailing list
> ccache at lists.samba.org
> https://lists.samba.org/mailman/listinfo/ccache
>
>