[ccache] Duplicate object files in the ccache - possible optimization?

Tue Nov 8 15:57:51 MST 2011

  On 11/08/2011 02:24 PM, Joel Rosdahl wrote:
> On 7 November 2011 18:49, Frank Klotz<frank.klotz at alcatel-lucent.com>  wrote:
>> [...] That aside, however, with the advent of direct mode, there ARE two
>> hashes possible for any given object file - the direct mode hash (hashing all
>> the sources that go into the compilation) and the preprocessed hash (hashing
>> the result of running all those sources through the preprocessor).
> Well, yes and no. There's only one hash for a given object file: the hash of
> the output of the preprocessor. This hash is used to look up the object file in
> the cache (i.e., the object file is named after the hash).
>
> Then, for the direct mode, there is one hash for each combination of source
> code files (i.e., the file to compile and all its include files) and compiler
> flags that results in the same preprocessor output. The mapping between
> different source code hashes and the resulting preprocessor hashes is stored in
> .manifest files in the cache. A manifest file is looked up using (and thus
> named after) a hash of only the main file and compilation flags.
>
>> And any time there is a cache miss, ccache has computed both those hashes,
>> hasn't it?
> As mentioned above, it starts by computing a hash of the input source file and
> the command line options. It then looks up the manifest file, continues hashing
> include file sets found in the manifest and compares them with the actual
> include files. If there's a match, the object file name (i.e., the preprocessor
> hash) can be read in the manifest.
>
> This is documented in the manual under "The direct mode":
> http://ccache.samba.org/manual.html#_the_direct_mode
>
> If it's hard to understand, I would be happy for any suggestions on how to
> improve it. :-)

Umm, well, the fact that I didn't get it doesn't mean there is a problem 
with the documentation - maybe just that I am not too good at 
understanding it!

I guess I would ask/suggest that it be made clearer that the 'data 
structure called “manifest”' is just another file in the cache, named 
with its hash and the suffix ".manifest"; and also that the "references 
to cached compilation results" in the manifest files ARE the 
proprocessor hashes (that is, if in fact they ARE - I'm still not 100% 
sure.)

It's good to know that any object file stored in the cache IS 
identified/named by the hash of its preprocessor output - the direct 
mode is just a quick way to decide that the given set of source files 
would get the same preprocessor output if cpp were actually run. (Am I 
starting to get it now?)

>> [...] And it appears to me that in many cases, the resulting object file
>> occurs twice in the cache, once under each hash.
> Well, the object file is only stored once for a given preprocessor hash.
>
>> And currently, those two occurrences are two separate files, which could be
>> combined into a single inode with two hard-linked directory entries.
> If there are multiple object files in the cache with the same content, then
> that's because different preprocessor outputs have resulted in identical object
> files.

Hmmm. Shouldn't that be hard to do? Evidently it's not, given that 30% 
of the files in my cache have twins (or triplets or whatever). Ok, so 
it's not so hard as that - while unused macros and constants are dropped 
during preprocessing, unused structure definitions and other language 
constructs cannot be, so I guess it is not so hard after all to create 
different preprocessed files which generate identical .o files.

And of course in that case, ccache itself has no way of knowing that the 
resultant files are identical.

>   I can imagine two ways of storing identical object files only once:
>
> - Introduce an object file store indexed by the object file hash. Entries in
>    the manifest files would then refer directly to those file names and
>    the files would also be stored under their preprocessor hash name. However,
>    on a cache miss, there will be extra performance penalty since the hash of
>    the object file needs to be calculated as well. That's probably measurably
>    bad.
> - Or: Create a compactation tool which can be run on the cache once in a while.
>    I think a good search engine term for this would be "data deduplication".

Agreed.

Thanks!
Frank
> -- Joel