[ccache] Duplicate object files in the ccache - possible optimization?

Mon Nov 7 11:55:43 MST 2011

> Looking at a ccache with about 40,000 .o files in it (created with direct
> mode turned on); of the 55 largest files, I found 11 pairs and one triplet
> of identical object files.  That's almost 25% of redundant storage that
> could have been avoided by looking at the preprocessed hash when there is no
> hit in direct mode.

It's much more interesting to look at the whole cache, I think.

$ find -name '*.o' -type f | wc -l
39312
jlebar at turing:~/.ccache$ find -name '*.o' -type f | xargs -P16 sha1sum
| cut -f 1 -d ' ' | sort | uniq -d | wc -l
1230

So it looks like there's some duplication on my machine, but not a
ton.  I'd be curious if you got significantly different numbers.

On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
<frank.klotz at alcatel-lucent.com> wrote:
>  Hi Martin,
>
> Thanks for your responses.
>
> s/index hash/direct mode hash/g
>
> Apologies - I had a brain burp and was using the wrong terminology.
>
> That aside, however, with  the advent of direct mode, there ARE two hashes
> possible for any given object file - the direct mode hash (hashing all the
> sources that go into the compilation) and the preprocessed hash (hashing the
> result of running all those sources through the preprocessor).  And any time
> there is a cache miss, ccache has computed both those hashes, hasn't it?
>  (Or maybe not - if not, see discussion below.)  And it appears to me that
> in many cases, the resulting object file occurs twice in the cache, once
> under each hash.  And currently, those two occurrences are two separate
> files, which could be combined into a single inode with two hard-linked
> directory entries.
>
> Or am I confused about how direct mode interacts with preprocessed mode?  If
> running in direct mode, does ccache never compute the preprocessed hash?  If
> not, it obviously could, and I would recommend that it should.  Why?
>  Because when changes are made to a widely-used header file, it very
> commonly occurs that those changes only actually modify the preprocessor
> output of a small subset of the sources that include that header file, while
> many other sources don't use the changes (say, definition of new macros or
> new constants), so end up with the same preprocessed output, and the same
> object file, even though the input header files and direct mode hash did
> change).  In that case, ccache could still find hits in the cache with the
> preprocessed mode, even if it's a miss with the direct mode hash.  If ccache
> does not get a direct mode hit, it certainly will have to RUN the
> preprocessor to recompile the file - how much extra cost to compute the
> preprocessed hash, look it up (to avoid recompiling if it is found with THAT
> hash), and if a compile is still needed, store the resulting object file
> inode with 2 directory entries rather than just one?
>
> The way I read the doc about how direct mode works, I thought it would
> compute the direct mode hash, and if no hit, "fall back to preprocessed
> mode".  I thought that meant it would compute the preprocessed hash and look
> for that too.   Is that incorrect - does it only compute ONE hash in all
> cases - a direct mode hash if running in direct mode and a preprocessed hash
> if not in direct mode?  If so, then let's modify my suggested enhancement to
> be that in direct mode, calculate and use the preprocessed hash whenever
> there is no hit with direct mode, and create hard links using all computed
> hashes to the one single object file inode that eventually exists in the
> ccache.  I don't think direct mode and preprocessed mode HAVE to be mutually
> exclusive - when direct mode gets a miss, preprocessed mode can still often
> provide a hit.
>
> And if no preprocessed hash gets computed/stored when running in direct
> mode, then I suspect that the reason I see so many pairs of identical object
> files in my ccache is because of the situation I describe above, where a
> header file change has triggered a direct mode hash miss, but preprocessing
> the sources has resulted in an identical preprocessed file which was then
> passed to the compiler which produced an identical object file.  But ccache
> didn't KNOW that they were identical, because it didn't compute the
> preprocessed hash.
>
> Looking at a ccache with about 40,000 .o files in it (created with direct
> mode turned on); of the 55 largest files, I found 11 pairs and one triplet
> of identical object files.  That's almost 25% of redundant storage that
> could have been avoided by looking at the preprocessed hash when there is no
> hit in direct mode.
>
> Thanks,
> Frank
>
>
> On 11/07/2011 12:53 AM, Martin Pool wrote:
>>
>> On 5 November 2011 11:12, Frank Klotz<frank.klotz at alcatel-lucent.com>
>>  wrote:
>>>
>>>  I used ccache at my previous employer, and was very convinced of its
>>> value.
>>>  Now that I have started a new job, I am in the process of trying to
>>> bring
>>> the new shop on board with ccache, so I have been doing lots of test runs
>>> and looking at things.  Here is one thing I am thinking could add some
>>> value.
>>>
>>> Looking through the ccache, I find many pairs of files which have
>>> different
>>> names (different hashes), but exactly identical content.  This actually
>>> makes sense, as each file would have an index hash and a preprocessed
>>> hash,
>>> and since ccache needs to be able to find a match on either, then both
>>> need
>>> to be in the cache.
>>
>> What is an index hash?
>>
>>>  (Actually, thinking about it, I'm a little surprised
>>> that there are any files in the ccache that DON'T appear twice -
>>> shouldn't
>>> EVERY compilation have 2 hashes?)
>>
>> I don't understand why you would expect that.
>>
>> It seems like you expect there is another indirection layer by which
>> ccache tries to find jobs that produce identical output.  I don't
>> think there is one at present.  I don't think this would happen very
>> often in reality, except perhaps for trivial cases like compiling
>> empty files, and that's not so important to accelerate, and it will
>> not use up much disk space.
>>
>> If you're getting duplicated cache files due to for instance doing
>> builds in different directories or from different trees that produce
>> identical output you could change the ccache options to make it less
>> stringent.
>>
>>> But it seems to me that it would make a lot of sense to store the data of
>>> these 2 files only once, by hard-linking the 2 names to the same inode.
>>>  (For filesystems that support hard links, of course!)  Every time ccache
>>> does an actual compilation and stores a file in the cache, it should
>>> store
>>> it under hard links for BOTH hashes - the indexed hash and the
>>> proprocessed
>>> hash.  And if it gets a hash miss on the indexed hash but a hit on the
>>> preprocessed hash, then it should add the missed index hash as a hard
>>> link
>>> to the file found.  So a given file (inode) in the cache could actually
>>> be
>>> referenced by MANY directory entries: one preprocessed hash, and multiple
>>> index hashes for various different combinations of source files and
>>> header
>>> files which end up producing the same output when passed through the
>>> preprocessor.
>>
>> This mail is the first time google has heard of "ccache indexed hash"...
>>
>>> This could increase the storage efficiency of the ccache.
>>>
>>> Of course, since not every filesystem supports hard links, the simplest
>>> solution was of course just to have multiple file copies.  So I guess
>>> adding
>>> code to do this would require some way to determine if the filesystem the
>>> cache is on can in fact support hardlinks.
>>>
>>> If you think this sounds like a good idea, but don't have bandwidth to do
>>> it, I would be willing to give it a try.  Any hints on where to start
>>> would
>>> of course be welcome.
>>>
>>> Thanks,
>>> Frank Klotz
>>> _______________________________________________
>>> ccache mailing list
>>> ccache at lists.samba.org
>>> https://lists.samba.org/mailman/listinfo/ccache
>>>
>>>
>
> _______________________________________________
> ccache mailing list
> ccache at lists.samba.org
> https://lists.samba.org/mailman/listinfo/ccache
>