[ccache] Duplicate object files in the ccache - possible optimization?
Joel Rosdahl
joel at rosdahl.net
Tue Nov 8 15:24:48 MST 2011
On 7 November 2011 18:49, Frank Klotz <frank.klotz at alcatel-lucent.com> wrote:
> [...] That aside, however, with the advent of direct mode, there ARE two
> hashes possible for any given object file - the direct mode hash (hashing all
> the sources that go into the compilation) and the preprocessed hash (hashing
> the result of running all those sources through the preprocessor).
Well, yes and no. There's only one hash for a given object file: the hash of
the output of the preprocessor. This hash is used to look up the object file in
the cache (i.e., the object file is named after the hash).
Then, for the direct mode, there is one hash for each combination of source
code files (i.e., the file to compile and all its include files) and compiler
flags that results in the same preprocessor output. The mapping between
different source code hashes and the resulting preprocessor hashes is stored in
.manifest files in the cache. A manifest file is looked up using (and thus
named after) a hash of only the main file and compilation flags.
> And any time there is a cache miss, ccache has computed both those hashes,
> hasn't it?
As mentioned above, it starts by computing a hash of the input source file and
the command line options. It then looks up the manifest file, continues hashing
include file sets found in the manifest and compares them with the actual
include files. If there's a match, the object file name (i.e., the preprocessor
hash) can be read in the manifest.
This is documented in the manual under "The direct mode":
http://ccache.samba.org/manual.html#_the_direct_mode
If it's hard to understand, I would be happy for any suggestions on how to
improve it. :-)
> [...] And it appears to me that in many cases, the resulting object file
> occurs twice in the cache, once under each hash.
Well, the object file is only stored once for a given preprocessor hash.
> And currently, those two occurrences are two separate files, which could be
> combined into a single inode with two hard-linked directory entries.
If there are multiple object files in the cache with the same content, then
that's because different preprocessor outputs have resulted in identical object
files. I can imagine two ways of storing identical object files only once:
- Introduce an object file store indexed by the object file hash. Entries in
the manifest files would then refer directly to those file names and
the files would also be stored under their preprocessor hash name. However,
on a cache miss, there will be extra performance penalty since the hash of
the object file needs to be calculated as well. That's probably measurably
bad.
- Or: Create a compactation tool which can be run on the cache once in a while.
I think a good search engine term for this would be "data deduplication".
-- Joel
More information about the ccache
mailing list