[ccache] Duplicate object files in the ccache - possible optimization?

Tue Nov 8 02:36:38 MST 2011

Maybe you should try adding an option to check both hashes, and see how
that performs.
On Nov 8, 2011 1:03 PM, "Frank Klotz" <frank.klotz at alcatel-lucent.com>
wrote:

>  On 11/07/2011 10:55 AM, Justin Lebar wrote:
>
>> Looking at a ccache with about 40,000 .o files in it (created with direct
>>> mode turned on); of the 55 largest files, I found 11 pairs and one
>>> triplet
>>> of identical object files.  That's almost 25% of redundant storage that
>>> could have been avoided by looking at the preprocessed hash when there
>>> is no
>>> hit in direct mode.
>>>
>> It's much more interesting to look at the whole cache, I think.
>>
>> $ find -name '*.o' -type f | wc -l
>> 39312
>> jlebar at turing:~/.ccache$ find -name '*.o' -type f | xargs -P16 sha1sum
>> | cut -f 1 -d ' ' | sort | uniq -d | wc -l
>> 1230
>>
>> So it looks like there's some duplication on my machine, but not a
>> ton.  I'd be curious if you got significantly different numbers.
>>
> Hi Justin,
>
> Here's what I got:
>
> > find -name '*.o' -type f | wc -l
> 43507
> > find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' ' | sort |
> uniq -d | wc -l
> 13087
>
> I would say the difference is significant - you got 3%, while I got 30%.
>
> This is in a project environment where we have fairly fast churn in a
> number of widely-used header files, and I am guessing that the changes in
> those files often consist of addition of or changes to macros and constants
> that are not used in all the files where the header files get included.
>
> Now I am more than ready to agree that there is room for improvement in
> the design and implementation of the header file layouts here, and we are
> working on that; but at the same time it looks to me like a fairly
> straightforward enhancement to ccache could give us some performance boost
> here.  Since ccache already knows how to do preprocessed hashes, I think
> all I am looking to do is to have it use that existing code when the direct
> mode hash doesn't get a hit.  Then put both hash names on the resulting
> object file, and voila: better hit rates and more unique files in the cache.
>
> It would be interesting to know if many others see anything at all similar
> to the numbers I have here (i.e., lots of people could use the enhancement
> I am suggesting), of if there is something unique about this environment
> that leads to this much duplication in the cache.
>
> Thanks,
> Frank
>
>> On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
>> <frank.klotz at alcatel-lucent.**com <frank.klotz at alcatel-lucent.com>>
>>  wrote:
>>
>>>  Hi Martin,
>>>
>>> Thanks for your responses.
>>>
>>> s/index hash/direct mode hash/g
>>>
>>> Apologies - I had a brain burp and was using the wrong terminology.
>>>
>>> That aside, however, with  the advent of direct mode, there ARE two
>>> hashes
>>> possible for any given object file - the direct mode hash (hashing all
>>> the
>>> sources that go into the compilation) and the preprocessed hash (hashing
>>> the
>>> result of running all those sources through the preprocessor).  And any
>>> time
>>> there is a cache miss, ccache has computed both those hashes, hasn't it?
>>>  (Or maybe not - if not, see discussion below.)  And it appears to me
>>> that
>>> in many cases, the resulting object file occurs twice in the cache, once
>>> under each hash.  And currently, those two occurrences are two separate
>>> files, which could be combined into a single inode with two hard-linked
>>> directory entries.
>>>
>>> Or am I confused about how direct mode interacts with preprocessed mode?
>>>  If
>>> running in direct mode, does ccache never compute the preprocessed hash?
>>>  If
>>> not, it obviously could, and I would recommend that it should.  Why?
>>>  Because when changes are made to a widely-used header file, it very
>>> commonly occurs that those changes only actually modify the preprocessor
>>> output of a small subset of the sources that include that header file,
>>> while
>>> many other sources don't use the changes (say, definition of new macros
>>> or
>>> new constants), so end up with the same preprocessed output, and the same
>>> object file, even though the input header files and direct mode hash did
>>> change).  In that case, ccache could still find hits in the cache with
>>> the
>>> preprocessed mode, even if it's a miss with the direct mode hash.  If
>>> ccache
>>> does not get a direct mode hit, it certainly will have to RUN the
>>> preprocessor to recompile the file - how much extra cost to compute the
>>> preprocessed hash, look it up (to avoid recompiling if it is found with
>>> THAT
>>> hash), and if a compile is still needed, store the resulting object file
>>> inode with 2 directory entries rather than just one?
>>>
>>> The way I read the doc about how direct mode works, I thought it would
>>> compute the direct mode hash, and if no hit, "fall back to preprocessed
>>> mode".  I thought that meant it would compute the preprocessed hash and
>>> look
>>> for that too.   Is that incorrect - does it only compute ONE hash in all
>>> cases - a direct mode hash if running in direct mode and a preprocessed
>>> hash
>>> if not in direct mode?  If so, then let's modify my suggested
>>> enhancement to
>>> be that in direct mode, calculate and use the preprocessed hash whenever
>>> there is no hit with direct mode, and create hard links using all
>>> computed
>>> hashes to the one single object file inode that eventually exists in the
>>> ccache.  I don't think direct mode and preprocessed mode HAVE to be
>>> mutually
>>> exclusive - when direct mode gets a miss, preprocessed mode can still
>>> often
>>> provide a hit.
>>>
>>> And if no preprocessed hash gets computed/stored when running in direct
>>> mode, then I suspect that the reason I see so many pairs of identical
>>> object
>>> files in my ccache is because of the situation I describe above, where a
>>> header file change has triggered a direct mode hash miss, but
>>> preprocessing
>>> the sources has resulted in an identical preprocessed file which was then
>>> passed to the compiler which produced an identical object file.  But
>>> ccache
>>> didn't KNOW that they were identical, because it didn't compute the
>>> preprocessed hash.
>>>
>>> Looking at a ccache with about 40,000 .o files in it (created with direct
>>> mode turned on); of the 55 largest files, I found 11 pairs and one
>>> triplet
>>> of identical object files.  That's almost 25% of redundant storage that
>>> could have been avoided by looking at the preprocessed hash when there
>>> is no
>>> hit in direct mode.
>>>
>>> Thanks,
>>> Frank
>>>
>>>
>>> On 11/07/2011 12:53 AM, Martin Pool wrote:
>>>
>>>> On 5 November 2011 11:12, Frank Klotz<frank.klotz at alcatel-**lucent.com<frank.klotz at alcatel-lucent.com>
>>>> >
>>>>  wrote:
>>>>
>>>>>  I used ccache at my previous employer, and was very convinced of its
>>>>> value.
>>>>>  Now that I have started a new job, I am in the process of trying to
>>>>> bring
>>>>> the new shop on board with ccache, so I have been doing lots of test
>>>>> runs
>>>>> and looking at things.  Here is one thing I am thinking could add some
>>>>> value.
>>>>>
>>>>> Looking through the ccache, I find many pairs of files which have
>>>>> different
>>>>> names (different hashes), but exactly identical content.  This actually
>>>>> makes sense, as each file would have an index hash and a preprocessed
>>>>> hash,
>>>>> and since ccache needs to be able to find a match on either, then both
>>>>> need
>>>>> to be in the cache.
>>>>>
>>>> What is an index hash?
>>>>
>>>>   (Actually, thinking about it, I'm a little surprised
>>>>> that there are any files in the ccache that DON'T appear twice -
>>>>> shouldn't
>>>>> EVERY compilation have 2 hashes?)
>>>>>
>>>> I don't understand why you would expect that.
>>>>
>>>> It seems like you expect there is another indirection layer by which
>>>> ccache tries to find jobs that produce identical output.  I don't
>>>> think there is one at present.  I don't think this would happen very
>>>> often in reality, except perhaps for trivial cases like compiling
>>>> empty files, and that's not so important to accelerate, and it will
>>>> not use up much disk space.
>>>>
>>>> If you're getting duplicated cache files due to for instance doing
>>>> builds in different directories or from different trees that produce
>>>> identical output you could change the ccache options to make it less
>>>> stringent.
>>>>
>>>>  But it seems to me that it would make a lot of sense to store the data
>>>>> of
>>>>> these 2 files only once, by hard-linking the 2 names to the same inode.
>>>>>  (For filesystems that support hard links, of course!)  Every time
>>>>> ccache
>>>>> does an actual compilation and stores a file in the cache, it should
>>>>> store
>>>>> it under hard links for BOTH hashes - the indexed hash and the
>>>>> proprocessed
>>>>> hash.  And if it gets a hash miss on the indexed hash but a hit on the
>>>>> preprocessed hash, then it should add the missed index hash as a hard
>>>>> link
>>>>> to the file found.  So a given file (inode) in the cache could actually
>>>>> be
>>>>> referenced by MANY directory entries: one preprocessed hash, and
>>>>> multiple
>>>>> index hashes for various different combinations of source files and
>>>>> header
>>>>> files which end up producing the same output when passed through the
>>>>> preprocessor.
>>>>>
>>>> This mail is the first time google has heard of "ccache indexed hash"...
>>>>
>>>>  This could increase the storage efficiency of the ccache.
>>>>>
>>>>> Of course, since not every filesystem supports hard links, the simplest
>>>>> solution was of course just to have multiple file copies.  So I guess
>>>>> adding
>>>>> code to do this would require some way to determine if the filesystem
>>>>> the
>>>>> cache is on can in fact support hardlinks.
>>>>>
>>>>> If you think this sounds like a good idea, but don't have bandwidth to
>>>>> do
>>>>> it, I would be willing to give it a try.  Any hints on where to start
>>>>> would
>>>>> of course be welcome.
>>>>>
>>>>> Thanks,
>>>>> Frank Klotz
>>>>> ______________________________**_________________
>>>>> ccache mailing list
>>>>> ccache at lists.samba.org
>>>>> https://lists.samba.org/**mailman/listinfo/ccache<https://lists.samba.org/mailman/listinfo/ccache>
>>>>>
>>>>>
>>>>>  ______________________________**_________________
>>> ccache mailing list
>>> ccache at lists.samba.org
>>> https://lists.samba.org/**mailman/listinfo/ccache<https://lists.samba.org/mailman/listinfo/ccache>
>>>
>>>
>