[ccache] Duplicate object files in the ccache - possible optimization?

Mon Nov 7 10:49:57 MST 2011

  Hi Martin,

Thanks for your responses.

s/index hash/direct mode hash/g

Apologies - I had a brain burp and was using the wrong terminology.

That aside, however, with  the advent of direct mode, there ARE two 
hashes possible for any given object file - the direct mode hash 
(hashing all the sources that go into the compilation) and the 
preprocessed hash (hashing the result of running all those sources 
through the preprocessor).  And any time there is a cache miss, ccache 
has computed both those hashes, hasn't it?  (Or maybe not - if not, see 
discussion below.)  And it appears to me that in many cases, the 
resulting object file occurs twice in the cache, once under each hash.  
And currently, those two occurrences are two separate files, which could 
be combined into a single inode with two hard-linked directory entries.

Or am I confused about how direct mode interacts with preprocessed 
mode?  If running in direct mode, does ccache never compute the 
preprocessed hash?  If not, it obviously could, and I would recommend 
that it should.  Why?  Because when changes are made to a widely-used 
header file, it very commonly occurs that those changes only actually 
modify the preprocessor output of a small subset of the sources that 
include that header file, while many other sources don't use the changes 
(say, definition of new macros or new constants), so end up with the 
same preprocessed output, and the same object file, even though the 
input header files and direct mode hash did change).  In that case, 
ccache could still find hits in the cache with the preprocessed mode, 
even if it's a miss with the direct mode hash.  If ccache does not get a 
direct mode hit, it certainly will have to RUN the preprocessor to 
recompile the file - how much extra cost to compute the preprocessed 
hash, look it up (to avoid recompiling if it is found with THAT hash), 
and if a compile is still needed, store the resulting object file inode 
with 2 directory entries rather than just one?

The way I read the doc about how direct mode works, I thought it would 
compute the direct mode hash, and if no hit, "fall back to preprocessed 
mode".  I thought that meant it would compute the preprocessed hash and 
look for that too.   Is that incorrect - does it only compute ONE hash 
in all cases - a direct mode hash if running in direct mode and a 
preprocessed hash if not in direct mode?  If so, then let's modify my 
suggested enhancement to be that in direct mode, calculate and use the 
preprocessed hash whenever there is no hit with direct mode, and create 
hard links using all computed hashes to the one single object file inode 
that eventually exists in the ccache.  I don't think direct mode and 
preprocessed mode HAVE to be mutually exclusive - when direct mode gets 
a miss, preprocessed mode can still often provide a hit.

And if no preprocessed hash gets computed/stored when running in direct 
mode, then I suspect that the reason I see so many pairs of identical 
object files in my ccache is because of the situation I describe above, 
where a header file change has triggered a direct mode hash miss, but 
preprocessing the sources has resulted in an identical preprocessed file 
which was then passed to the compiler which produced an identical object 
file.  But ccache didn't KNOW that they were identical, because it 
didn't compute the preprocessed hash.

Looking at a ccache with about 40,000 .o files in it (created with 
direct mode turned on); of the 55 largest files, I found 11 pairs and 
one triplet of identical object files.  That's almost 25% of redundant 
storage that could have been avoided by looking at the preprocessed hash 
when there is no hit in direct mode.

Thanks,
Frank

On 11/07/2011 12:53 AM, Martin Pool wrote:
> On 5 November 2011 11:12, Frank Klotz<frank.klotz at alcatel-lucent.com>  wrote:
>>   I used ccache at my previous employer, and was very convinced of its value.
>>   Now that I have started a new job, I am in the process of trying to bring
>> the new shop on board with ccache, so I have been doing lots of test runs
>> and looking at things.  Here is one thing I am thinking could add some
>> value.
>>
>> Looking through the ccache, I find many pairs of files which have different
>> names (different hashes), but exactly identical content.  This actually
>> makes sense, as each file would have an index hash and a preprocessed hash,
>> and since ccache needs to be able to find a match on either, then both need
>> to be in the cache.
> What is an index hash?
>
>>   (Actually, thinking about it, I'm a little surprised
>> that there are any files in the ccache that DON'T appear twice - shouldn't
>> EVERY compilation have 2 hashes?)
> I don't understand why you would expect that.
>
> It seems like you expect there is another indirection layer by which
> ccache tries to find jobs that produce identical output.  I don't
> think there is one at present.  I don't think this would happen very
> often in reality, except perhaps for trivial cases like compiling
> empty files, and that's not so important to accelerate, and it will
> not use up much disk space.
>
> If you're getting duplicated cache files due to for instance doing
> builds in different directories or from different trees that produce
> identical output you could change the ccache options to make it less
> stringent.
>
>> But it seems to me that it would make a lot of sense to store the data of
>> these 2 files only once, by hard-linking the 2 names to the same inode.
>>   (For filesystems that support hard links, of course!)  Every time ccache
>> does an actual compilation and stores a file in the cache, it should store
>> it under hard links for BOTH hashes - the indexed hash and the proprocessed
>> hash.  And if it gets a hash miss on the indexed hash but a hit on the
>> preprocessed hash, then it should add the missed index hash as a hard link
>> to the file found.  So a given file (inode) in the cache could actually be
>> referenced by MANY directory entries: one preprocessed hash, and multiple
>> index hashes for various different combinations of source files and header
>> files which end up producing the same output when passed through the
>> preprocessor.
> This mail is the first time google has heard of "ccache indexed hash"...
>
>> This could increase the storage efficiency of the ccache.
>>
>> Of course, since not every filesystem supports hard links, the simplest
>> solution was of course just to have multiple file copies.  So I guess adding
>> code to do this would require some way to determine if the filesystem the
>> cache is on can in fact support hardlinks.
>>
>> If you think this sounds like a good idea, but don't have bandwidth to do
>> it, I would be willing to give it a try.  Any hints on where to start would
>> of course be welcome.
>>
>> Thanks,
>> Frank Klotz
>> _______________________________________________
>> ccache mailing list
>> ccache at lists.samba.org
>> https://lists.samba.org/mailman/listinfo/ccache
>>
>>