[ccache] Duplicate object files in the ccache - possible optimization?

Mon Nov 7 19:03:43 MST 2011

  On 11/07/2011 10:55 AM, Justin Lebar wrote:
>> Looking at a ccache with about 40,000 .o files in it (created with direct
>> mode turned on); of the 55 largest files, I found 11 pairs and one triplet
>> of identical object files.  That's almost 25% of redundant storage that
>> could have been avoided by looking at the preprocessed hash when there is no
>> hit in direct mode.
> It's much more interesting to look at the whole cache, I think.
>
> $ find -name '*.o' -type f | wc -l
> 39312
> jlebar at turing:~/.ccache$ find -name '*.o' -type f | xargs -P16 sha1sum
> | cut -f 1 -d ' ' | sort | uniq -d | wc -l
> 1230
>
> So it looks like there's some duplication on my machine, but not a
> ton.  I'd be curious if you got significantly different numbers.
Hi Justin,

Here's what I got:

 > find -name '*.o' -type f | wc -l
43507
 > find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' ' | 
sort | uniq -d | wc -l
13087

I would say the difference is significant - you got 3%, while I got 30%.

This is in a project environment where we have fairly fast churn in a 
number of widely-used header files, and I am guessing that the changes 
in those files often consist of addition of or changes to macros and 
constants that are not used in all the files where the header files get 
included.

Now I am more than ready to agree that there is room for improvement in 
the design and implementation of the header file layouts here, and we 
are working on that; but at the same time it looks to me like a fairly 
straightforward enhancement to ccache could give us some performance 
boost here.  Since ccache already knows how to do preprocessed hashes, I 
think all I am looking to do is to have it use that existing code when 
the direct mode hash doesn't get a hit.  Then put both hash names on the 
resulting object file, and voila: better hit rates and more unique files 
in the cache.

It would be interesting to know if many others see anything at all 
similar to the numbers I have here (i.e., lots of people could use the 
enhancement I am suggesting), of if there is something unique about this 
environment that leads to this much duplication in the cache.

Thanks,
Frank
> On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
> <frank.klotz at alcatel-lucent.com>  wrote:
>>   Hi Martin,
>>
>> Thanks for your responses.
>>
>> s/index hash/direct mode hash/g
>>
>> Apologies - I had a brain burp and was using the wrong terminology.
>>
>> That aside, however, with  the advent of direct mode, there ARE two hashes
>> possible for any given object file - the direct mode hash (hashing all the
>> sources that go into the compilation) and the preprocessed hash (hashing the
>> result of running all those sources through the preprocessor).  And any time
>> there is a cache miss, ccache has computed both those hashes, hasn't it?
>>   (Or maybe not - if not, see discussion below.)  And it appears to me that
>> in many cases, the resulting object file occurs twice in the cache, once
>> under each hash.  And currently, those two occurrences are two separate
>> files, which could be combined into a single inode with two hard-linked
>> directory entries.
>>
>> Or am I confused about how direct mode interacts with preprocessed mode?  If
>> running in direct mode, does ccache never compute the preprocessed hash?  If
>> not, it obviously could, and I would recommend that it should.  Why?
>>   Because when changes are made to a widely-used header file, it very
>> commonly occurs that those changes only actually modify the preprocessor
>> output of a small subset of the sources that include that header file, while
>> many other sources don't use the changes (say, definition of new macros or
>> new constants), so end up with the same preprocessed output, and the same
>> object file, even though the input header files and direct mode hash did
>> change).  In that case, ccache could still find hits in the cache with the
>> preprocessed mode, even if it's a miss with the direct mode hash.  If ccache
>> does not get a direct mode hit, it certainly will have to RUN the
>> preprocessor to recompile the file - how much extra cost to compute the
>> preprocessed hash, look it up (to avoid recompiling if it is found with THAT
>> hash), and if a compile is still needed, store the resulting object file
>> inode with 2 directory entries rather than just one?
>>
>> The way I read the doc about how direct mode works, I thought it would
>> compute the direct mode hash, and if no hit, "fall back to preprocessed
>> mode".  I thought that meant it would compute the preprocessed hash and look
>> for that too.   Is that incorrect - does it only compute ONE hash in all
>> cases - a direct mode hash if running in direct mode and a preprocessed hash
>> if not in direct mode?  If so, then let's modify my suggested enhancement to
>> be that in direct mode, calculate and use the preprocessed hash whenever
>> there is no hit with direct mode, and create hard links using all computed
>> hashes to the one single object file inode that eventually exists in the
>> ccache.  I don't think direct mode and preprocessed mode HAVE to be mutually
>> exclusive - when direct mode gets a miss, preprocessed mode can still often
>> provide a hit.
>>
>> And if no preprocessed hash gets computed/stored when running in direct
>> mode, then I suspect that the reason I see so many pairs of identical object
>> files in my ccache is because of the situation I describe above, where a
>> header file change has triggered a direct mode hash miss, but preprocessing
>> the sources has resulted in an identical preprocessed file which was then
>> passed to the compiler which produced an identical object file.  But ccache
>> didn't KNOW that they were identical, because it didn't compute the
>> preprocessed hash.
>>
>> Looking at a ccache with about 40,000 .o files in it (created with direct
>> mode turned on); of the 55 largest files, I found 11 pairs and one triplet
>> of identical object files.  That's almost 25% of redundant storage that
>> could have been avoided by looking at the preprocessed hash when there is no
>> hit in direct mode.
>>
>> Thanks,
>> Frank
>>
>>
>> On 11/07/2011 12:53 AM, Martin Pool wrote:
>>> On 5 November 2011 11:12, Frank Klotz<frank.klotz at alcatel-lucent.com>
>>>   wrote:
>>>>   I used ccache at my previous employer, and was very convinced of its
>>>> value.
>>>>   Now that I have started a new job, I am in the process of trying to
>>>> bring
>>>> the new shop on board with ccache, so I have been doing lots of test runs
>>>> and looking at things.  Here is one thing I am thinking could add some
>>>> value.
>>>>
>>>> Looking through the ccache, I find many pairs of files which have
>>>> different
>>>> names (different hashes), but exactly identical content.  This actually
>>>> makes sense, as each file would have an index hash and a preprocessed
>>>> hash,
>>>> and since ccache needs to be able to find a match on either, then both
>>>> need
>>>> to be in the cache.
>>> What is an index hash?
>>>
>>>>   (Actually, thinking about it, I'm a little surprised
>>>> that there are any files in the ccache that DON'T appear twice -
>>>> shouldn't
>>>> EVERY compilation have 2 hashes?)
>>> I don't understand why you would expect that.
>>>
>>> It seems like you expect there is another indirection layer by which
>>> ccache tries to find jobs that produce identical output.  I don't
>>> think there is one at present.  I don't think this would happen very
>>> often in reality, except perhaps for trivial cases like compiling
>>> empty files, and that's not so important to accelerate, and it will
>>> not use up much disk space.
>>>
>>> If you're getting duplicated cache files due to for instance doing
>>> builds in different directories or from different trees that produce
>>> identical output you could change the ccache options to make it less
>>> stringent.
>>>
>>>> But it seems to me that it would make a lot of sense to store the data of
>>>> these 2 files only once, by hard-linking the 2 names to the same inode.
>>>>   (For filesystems that support hard links, of course!)  Every time ccache
>>>> does an actual compilation and stores a file in the cache, it should
>>>> store
>>>> it under hard links for BOTH hashes - the indexed hash and the
>>>> proprocessed
>>>> hash.  And if it gets a hash miss on the indexed hash but a hit on the
>>>> preprocessed hash, then it should add the missed index hash as a hard
>>>> link
>>>> to the file found.  So a given file (inode) in the cache could actually
>>>> be
>>>> referenced by MANY directory entries: one preprocessed hash, and multiple
>>>> index hashes for various different combinations of source files and
>>>> header
>>>> files which end up producing the same output when passed through the
>>>> preprocessor.
>>> This mail is the first time google has heard of "ccache indexed hash"...
>>>
>>>> This could increase the storage efficiency of the ccache.
>>>>
>>>> Of course, since not every filesystem supports hard links, the simplest
>>>> solution was of course just to have multiple file copies.  So I guess
>>>> adding
>>>> code to do this would require some way to determine if the filesystem the
>>>> cache is on can in fact support hardlinks.
>>>>
>>>> If you think this sounds like a good idea, but don't have bandwidth to do
>>>> it, I would be willing to give it a try.  Any hints on where to start
>>>> would
>>>> of course be welcome.
>>>>
>>>> Thanks,
>>>> Frank Klotz
>>>> _______________________________________________
>>>> ccache mailing list
>>>> ccache at lists.samba.org
>>>> https://lists.samba.org/mailman/listinfo/ccache
>>>>
>>>>
>> _______________________________________________
>> ccache mailing list
>> ccache at lists.samba.org
>> https://lists.samba.org/mailman/listinfo/ccache
>>