[ccache] Duplicate object files in the ccache - possible optimization?

Tue Nov 8 10:55:41 MST 2011

  On 11/08/2011 01:36 AM, Martin Pool wrote:
>
> Maybe you should try adding an option to check both hashes, and see 
> how that performs.
>

Yes, it looks like that's where I'm headed.  I was hoping I might find 
someone else who had already tried this, or thought about it; but it 
looks like I'm the pioneer here, so if/as I get time to experiment with 
it I will try it out.

Thanks much,
Frank

> On Nov 8, 2011 1:03 PM, "Frank Klotz" <frank.klotz at alcatel-lucent.com 
> <mailto:frank.klotz at alcatel-lucent.com>> wrote:
>
>      On 11/07/2011 10:55 AM, Justin Lebar wrote:
>
>             Looking at a ccache with about 40,000 .o files in it
>             (created with direct
>             mode turned on); of the 55 largest files, I found 11 pairs
>             and one triplet
>             of identical object files.  That's almost 25% of redundant
>             storage that
>             could have been avoided by looking at the preprocessed
>             hash when there is no
>             hit in direct mode.
>
>         It's much more interesting to look at the whole cache, I think.
>
>         $ find -name '*.o' -type f | wc -l
>         39312
>         jlebar at turing:~/.ccache$ find -name '*.o' -type f | xargs -P16
>         sha1sum
>         | cut -f 1 -d ' ' | sort | uniq -d | wc -l
>         1230
>
>         So it looks like there's some duplication on my machine, but not a
>         ton.  I'd be curious if you got significantly different numbers.
>
>     Hi Justin,
>
>     Here's what I got:
>
>     > find -name '*.o' -type f | wc -l
>     43507
>     > find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' '
>     | sort | uniq -d | wc -l
>     13087
>
>     I would say the difference is significant - you got 3%, while I
>     got 30%.
>
>     This is in a project environment where we have fairly fast churn
>     in a number of widely-used header files, and I am guessing that
>     the changes in those files often consist of addition of or changes
>     to macros and constants that are not used in all the files where
>     the header files get included.
>
>     Now I am more than ready to agree that there is room for
>     improvement in the design and implementation of the header file
>     layouts here, and we are working on that; but at the same time it
>     looks to me like a fairly straightforward enhancement to ccache
>     could give us some performance boost here.  Since ccache already
>     knows how to do preprocessed hashes, I think all I am looking to
>     do is to have it use that existing code when the direct mode hash
>     doesn't get a hit.  Then put both hash names on the resulting
>     object file, and voila: better hit rates and more unique files in
>     the cache.
>
>     It would be interesting to know if many others see anything at all
>     similar to the numbers I have here (i.e., lots of people could use
>     the enhancement I am suggesting), of if there is something unique
>     about this environment that leads to this much duplication in the
>     cache.
>
>     Thanks,
>     Frank
>
>         On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
>         <frank.klotz at alcatel-lucent.com
>         <mailto:frank.klotz at alcatel-lucent.com>>  wrote:
>
>              Hi Martin,
>
>             Thanks for your responses.
>
>             s/index hash/direct mode hash/g
>
>             Apologies - I had a brain burp and was using the wrong
>             terminology.
>
>             That aside, however, with  the advent of direct mode,
>             there ARE two hashes
>             possible for any given object file - the direct mode hash
>             (hashing all the
>             sources that go into the compilation) and the preprocessed
>             hash (hashing the
>             result of running all those sources through the
>             preprocessor).  And any time
>             there is a cache miss, ccache has computed both those
>             hashes, hasn't it?
>              (Or maybe not - if not, see discussion below.)  And it
>             appears to me that
>             in many cases, the resulting object file occurs twice in
>             the cache, once
>             under each hash.  And currently, those two occurrences are
>             two separate
>             files, which could be combined into a single inode with
>             two hard-linked
>             directory entries.
>
>             Or am I confused about how direct mode interacts with
>             preprocessed mode?  If
>             running in direct mode, does ccache never compute the
>             preprocessed hash?  If
>             not, it obviously could, and I would recommend that it
>             should.  Why?
>              Because when changes are made to a widely-used header
>             file, it very
>             commonly occurs that those changes only actually modify
>             the preprocessor
>             output of a small subset of the sources that include that
>             header file, while
>             many other sources don't use the changes (say, definition
>             of new macros or
>             new constants), so end up with the same preprocessed
>             output, and the same
>             object file, even though the input header files and direct
>             mode hash did
>             change).  In that case, ccache could still find hits in
>             the cache with the
>             preprocessed mode, even if it's a miss with the direct
>             mode hash.  If ccache
>             does not get a direct mode hit, it certainly will have to
>             RUN the
>             preprocessor to recompile the file - how much extra cost
>             to compute the
>             preprocessed hash, look it up (to avoid recompiling if it
>             is found with THAT
>             hash), and if a compile is still needed, store the
>             resulting object file
>             inode with 2 directory entries rather than just one?
>
>             The way I read the doc about how direct mode works, I
>             thought it would
>             compute the direct mode hash, and if no hit, "fall back to
>             preprocessed
>             mode".  I thought that meant it would compute the
>             preprocessed hash and look
>             for that too.   Is that incorrect - does it only compute
>             ONE hash in all
>             cases - a direct mode hash if running in direct mode and a
>             preprocessed hash
>             if not in direct mode?  If so, then let's modify my
>             suggested enhancement to
>             be that in direct mode, calculate and use the preprocessed
>             hash whenever
>             there is no hit with direct mode, and create hard links
>             using all computed
>             hashes to the one single object file inode that eventually
>             exists in the
>             ccache.  I don't think direct mode and preprocessed mode
>             HAVE to be mutually
>             exclusive - when direct mode gets a miss, preprocessed
>             mode can still often
>             provide a hit.
>
>             And if no preprocessed hash gets computed/stored when
>             running in direct
>             mode, then I suspect that the reason I see so many pairs
>             of identical object
>             files in my ccache is because of the situation I describe
>             above, where a
>             header file change has triggered a direct mode hash miss,
>             but preprocessing
>             the sources has resulted in an identical preprocessed file
>             which was then
>             passed to the compiler which produced an identical object
>             file.  But ccache
>             didn't KNOW that they were identical, because it didn't
>             compute the
>             preprocessed hash.
>
>             Looking at a ccache with about 40,000 .o files in it
>             (created with direct
>             mode turned on); of the 55 largest files, I found 11 pairs
>             and one triplet
>             of identical object files.  That's almost 25% of redundant
>             storage that
>             could have been avoided by looking at the preprocessed
>             hash when there is no
>             hit in direct mode.
>
>             Thanks,
>             Frank
>
>
>             On 11/07/2011 12:53 AM, Martin Pool wrote:
>
>                 On 5 November 2011 11:12, Frank
>                 Klotz<frank.klotz at alcatel-lucent.com
>                 <mailto:frank.klotz at alcatel-lucent.com>>
>                  wrote:
>
>                      I used ccache at my previous employer, and was
>                     very convinced of its
>                     value.
>                      Now that I have started a new job, I am in the
>                     process of trying to
>                     bring
>                     the new shop on board with ccache, so I have been
>                     doing lots of test runs
>                     and looking at things.  Here is one thing I am
>                     thinking could add some
>                     value.
>
>                     Looking through the ccache, I find many pairs of
>                     files which have
>                     different
>                     names (different hashes), but exactly identical
>                     content.  This actually
>                     makes sense, as each file would have an index hash
>                     and a preprocessed
>                     hash,
>                     and since ccache needs to be able to find a match
>                     on either, then both
>                     need
>                     to be in the cache.
>
>                 What is an index hash?
>
>                      (Actually, thinking about it, I'm a little surprised
>                     that there are any files in the ccache that DON'T
>                     appear twice -
>                     shouldn't
>                     EVERY compilation have 2 hashes?)
>
>                 I don't understand why you would expect that.
>
>                 It seems like you expect there is another indirection
>                 layer by which
>                 ccache tries to find jobs that produce identical
>                 output.  I don't
>                 think there is one at present.  I don't think this
>                 would happen very
>                 often in reality, except perhaps for trivial cases
>                 like compiling
>                 empty files, and that's not so important to
>                 accelerate, and it will
>                 not use up much disk space.
>
>                 If you're getting duplicated cache files due to for
>                 instance doing
>                 builds in different directories or from different
>                 trees that produce
>                 identical output you could change the ccache options
>                 to make it less
>                 stringent.
>
>                     But it seems to me that it would make a lot of
>                     sense to store the data of
>                     these 2 files only once, by hard-linking the 2
>                     names to the same inode.
>                      (For filesystems that support hard links, of
>                     course!)  Every time ccache
>                     does an actual compilation and stores a file in
>                     the cache, it should
>                     store
>                     it under hard links for BOTH hashes - the indexed
>                     hash and the
>                     proprocessed
>                     hash.  And if it gets a hash miss on the indexed
>                     hash but a hit on the
>                     preprocessed hash, then it should add the missed
>                     index hash as a hard
>                     link
>                     to the file found.  So a given file (inode) in the
>                     cache could actually
>                     be
>                     referenced by MANY directory entries: one
>                     preprocessed hash, and multiple
>                     index hashes for various different combinations of
>                     source files and
>                     header
>                     files which end up producing the same output when
>                     passed through the
>                     preprocessor.
>
>                 This mail is the first time google has heard of
>                 "ccache indexed hash"...
>
>                     This could increase the storage efficiency of the
>                     ccache.
>
>                     Of course, since not every filesystem supports
>                     hard links, the simplest
>                     solution was of course just to have multiple file
>                     copies.  So I guess
>                     adding
>                     code to do this would require some way to
>                     determine if the filesystem the
>                     cache is on can in fact support hardlinks.
>
>                     If you think this sounds like a good idea, but
>                     don't have bandwidth to do
>                     it, I would be willing to give it a try.  Any
>                     hints on where to start
>                     would
>                     of course be welcome.
>
>                     Thanks,
>                     Frank Klotz
>                     _______________________________________________
>                     ccache mailing list
>                     ccache at lists.samba.org <mailto:ccache at lists.samba.org>
>                     https://lists.samba.org/mailman/listinfo/ccache
>
>
>             _______________________________________________
>             ccache mailing list
>             ccache at lists.samba.org <mailto:ccache at lists.samba.org>
>             https://lists.samba.org/mailman/listinfo/ccache
>
>