[ccache] Duplicate object files in the ccache - possible optimization?
Frank Klotz
frank.klotz at alcatel-lucent.com
Tue Nov 8 10:55:41 MST 2011
On 11/08/2011 01:36 AM, Martin Pool wrote:
>
> Maybe you should try adding an option to check both hashes, and see
> how that performs.
>
Yes, it looks like that's where I'm headed. I was hoping I might find
someone else who had already tried this, or thought about it; but it
looks like I'm the pioneer here, so if/as I get time to experiment with
it I will try it out.
Thanks much,
Frank
> On Nov 8, 2011 1:03 PM, "Frank Klotz" <frank.klotz at alcatel-lucent.com
> <mailto:frank.klotz at alcatel-lucent.com>> wrote:
>
> On 11/07/2011 10:55 AM, Justin Lebar wrote:
>
> Looking at a ccache with about 40,000 .o files in it
> (created with direct
> mode turned on); of the 55 largest files, I found 11 pairs
> and one triplet
> of identical object files. That's almost 25% of redundant
> storage that
> could have been avoided by looking at the preprocessed
> hash when there is no
> hit in direct mode.
>
> It's much more interesting to look at the whole cache, I think.
>
> $ find -name '*.o' -type f | wc -l
> 39312
> jlebar at turing:~/.ccache$ find -name '*.o' -type f | xargs -P16
> sha1sum
> | cut -f 1 -d ' ' | sort | uniq -d | wc -l
> 1230
>
> So it looks like there's some duplication on my machine, but not a
> ton. I'd be curious if you got significantly different numbers.
>
> Hi Justin,
>
> Here's what I got:
>
> > find -name '*.o' -type f | wc -l
> 43507
> > find -name '*.o' -type f | xargs -P16 sha1sum | cut -f 1 -d ' '
> | sort | uniq -d | wc -l
> 13087
>
> I would say the difference is significant - you got 3%, while I
> got 30%.
>
> This is in a project environment where we have fairly fast churn
> in a number of widely-used header files, and I am guessing that
> the changes in those files often consist of addition of or changes
> to macros and constants that are not used in all the files where
> the header files get included.
>
> Now I am more than ready to agree that there is room for
> improvement in the design and implementation of the header file
> layouts here, and we are working on that; but at the same time it
> looks to me like a fairly straightforward enhancement to ccache
> could give us some performance boost here. Since ccache already
> knows how to do preprocessed hashes, I think all I am looking to
> do is to have it use that existing code when the direct mode hash
> doesn't get a hit. Then put both hash names on the resulting
> object file, and voila: better hit rates and more unique files in
> the cache.
>
> It would be interesting to know if many others see anything at all
> similar to the numbers I have here (i.e., lots of people could use
> the enhancement I am suggesting), of if there is something unique
> about this environment that leads to this much duplication in the
> cache.
>
> Thanks,
> Frank
>
> On Mon, Nov 7, 2011 at 12:49 PM, Frank Klotz
> <frank.klotz at alcatel-lucent.com
> <mailto:frank.klotz at alcatel-lucent.com>> wrote:
>
> Hi Martin,
>
> Thanks for your responses.
>
> s/index hash/direct mode hash/g
>
> Apologies - I had a brain burp and was using the wrong
> terminology.
>
> That aside, however, with the advent of direct mode,
> there ARE two hashes
> possible for any given object file - the direct mode hash
> (hashing all the
> sources that go into the compilation) and the preprocessed
> hash (hashing the
> result of running all those sources through the
> preprocessor). And any time
> there is a cache miss, ccache has computed both those
> hashes, hasn't it?
> (Or maybe not - if not, see discussion below.) And it
> appears to me that
> in many cases, the resulting object file occurs twice in
> the cache, once
> under each hash. And currently, those two occurrences are
> two separate
> files, which could be combined into a single inode with
> two hard-linked
> directory entries.
>
> Or am I confused about how direct mode interacts with
> preprocessed mode? If
> running in direct mode, does ccache never compute the
> preprocessed hash? If
> not, it obviously could, and I would recommend that it
> should. Why?
> Because when changes are made to a widely-used header
> file, it very
> commonly occurs that those changes only actually modify
> the preprocessor
> output of a small subset of the sources that include that
> header file, while
> many other sources don't use the changes (say, definition
> of new macros or
> new constants), so end up with the same preprocessed
> output, and the same
> object file, even though the input header files and direct
> mode hash did
> change). In that case, ccache could still find hits in
> the cache with the
> preprocessed mode, even if it's a miss with the direct
> mode hash. If ccache
> does not get a direct mode hit, it certainly will have to
> RUN the
> preprocessor to recompile the file - how much extra cost
> to compute the
> preprocessed hash, look it up (to avoid recompiling if it
> is found with THAT
> hash), and if a compile is still needed, store the
> resulting object file
> inode with 2 directory entries rather than just one?
>
> The way I read the doc about how direct mode works, I
> thought it would
> compute the direct mode hash, and if no hit, "fall back to
> preprocessed
> mode". I thought that meant it would compute the
> preprocessed hash and look
> for that too. Is that incorrect - does it only compute
> ONE hash in all
> cases - a direct mode hash if running in direct mode and a
> preprocessed hash
> if not in direct mode? If so, then let's modify my
> suggested enhancement to
> be that in direct mode, calculate and use the preprocessed
> hash whenever
> there is no hit with direct mode, and create hard links
> using all computed
> hashes to the one single object file inode that eventually
> exists in the
> ccache. I don't think direct mode and preprocessed mode
> HAVE to be mutually
> exclusive - when direct mode gets a miss, preprocessed
> mode can still often
> provide a hit.
>
> And if no preprocessed hash gets computed/stored when
> running in direct
> mode, then I suspect that the reason I see so many pairs
> of identical object
> files in my ccache is because of the situation I describe
> above, where a
> header file change has triggered a direct mode hash miss,
> but preprocessing
> the sources has resulted in an identical preprocessed file
> which was then
> passed to the compiler which produced an identical object
> file. But ccache
> didn't KNOW that they were identical, because it didn't
> compute the
> preprocessed hash.
>
> Looking at a ccache with about 40,000 .o files in it
> (created with direct
> mode turned on); of the 55 largest files, I found 11 pairs
> and one triplet
> of identical object files. That's almost 25% of redundant
> storage that
> could have been avoided by looking at the preprocessed
> hash when there is no
> hit in direct mode.
>
> Thanks,
> Frank
>
>
> On 11/07/2011 12:53 AM, Martin Pool wrote:
>
> On 5 November 2011 11:12, Frank
> Klotz<frank.klotz at alcatel-lucent.com
> <mailto:frank.klotz at alcatel-lucent.com>>
> wrote:
>
> I used ccache at my previous employer, and was
> very convinced of its
> value.
> Now that I have started a new job, I am in the
> process of trying to
> bring
> the new shop on board with ccache, so I have been
> doing lots of test runs
> and looking at things. Here is one thing I am
> thinking could add some
> value.
>
> Looking through the ccache, I find many pairs of
> files which have
> different
> names (different hashes), but exactly identical
> content. This actually
> makes sense, as each file would have an index hash
> and a preprocessed
> hash,
> and since ccache needs to be able to find a match
> on either, then both
> need
> to be in the cache.
>
> What is an index hash?
>
> (Actually, thinking about it, I'm a little surprised
> that there are any files in the ccache that DON'T
> appear twice -
> shouldn't
> EVERY compilation have 2 hashes?)
>
> I don't understand why you would expect that.
>
> It seems like you expect there is another indirection
> layer by which
> ccache tries to find jobs that produce identical
> output. I don't
> think there is one at present. I don't think this
> would happen very
> often in reality, except perhaps for trivial cases
> like compiling
> empty files, and that's not so important to
> accelerate, and it will
> not use up much disk space.
>
> If you're getting duplicated cache files due to for
> instance doing
> builds in different directories or from different
> trees that produce
> identical output you could change the ccache options
> to make it less
> stringent.
>
> But it seems to me that it would make a lot of
> sense to store the data of
> these 2 files only once, by hard-linking the 2
> names to the same inode.
> (For filesystems that support hard links, of
> course!) Every time ccache
> does an actual compilation and stores a file in
> the cache, it should
> store
> it under hard links for BOTH hashes - the indexed
> hash and the
> proprocessed
> hash. And if it gets a hash miss on the indexed
> hash but a hit on the
> preprocessed hash, then it should add the missed
> index hash as a hard
> link
> to the file found. So a given file (inode) in the
> cache could actually
> be
> referenced by MANY directory entries: one
> preprocessed hash, and multiple
> index hashes for various different combinations of
> source files and
> header
> files which end up producing the same output when
> passed through the
> preprocessor.
>
> This mail is the first time google has heard of
> "ccache indexed hash"...
>
> This could increase the storage efficiency of the
> ccache.
>
> Of course, since not every filesystem supports
> hard links, the simplest
> solution was of course just to have multiple file
> copies. So I guess
> adding
> code to do this would require some way to
> determine if the filesystem the
> cache is on can in fact support hardlinks.
>
> If you think this sounds like a good idea, but
> don't have bandwidth to do
> it, I would be willing to give it a try. Any
> hints on where to start
> would
> of course be welcome.
>
> Thanks,
> Frank Klotz
> _______________________________________________
> ccache mailing list
> ccache at lists.samba.org <mailto:ccache at lists.samba.org>
> https://lists.samba.org/mailman/listinfo/ccache
>
>
> _______________________________________________
> ccache mailing list
> ccache at lists.samba.org <mailto:ccache at lists.samba.org>
> https://lists.samba.org/mailman/listinfo/ccache
>
>
More information about the ccache
mailing list