[ccache] direct mode design bug

Mon Nov 5 06:53:30 MST 2012

On 04/11/12 19:10, Joel Rosdahl wrote:
> The direct mode, which was introduced in version 3.0 almost three years
> ago, has a design bug. The essence of the problem is that in the direct
> mode, ccache records header files that were used by the compiler, but it
> doesn't record header files that were not used but could have been used if
> they existed. So, when ccache checks if a result could be taken from
> the cache, it can't check if the existence of a new header file should
> invalidate the result.

My first reaction to this issue, rightly or wrongly, is that it's more 
of a documentation issue than a real bug. I mean, it can only occur if 
two people share a cache, or if the user installs new software and then 
reuses an old cache. If the documentation simply says that you have to 
wipe your cache whenever you do that sort of thing then does that solve 
the problem?

A similar issue, albeit not so interesting, perhaps, is what happens 
when a user changes some part of the toolchain, but does not alter the 
"gcc" binary. Ccache won't notice a new back-end compiler, a new 
assembler, a new linker, a new default specs file or anything like that. 
Chances are that any differences in the output are harmless, but the 
cached objects are technically invalid.

Having said all that, if Ccache Just Worked, that would be no bad thing.

[In fact, I have a use-case in which I have multiple users sharing a 
cache, and I wanted to be able to uniquely identify the same toolchain 
across all the installations. The mtime etc. varies from machine to 
machine, as might the exact tool mix, so I have some local patches to do 
a much deeper hash of the toolchain binaries, and include those in the 
object hashes. Even then, in the interests of performance, those 
toolchain IDs are cached according to the location and mtime, so 
changing the binutils will cause temporarily wrong toolchain hashes. 
Would you be interested in such a feature upstream?]

> 1. ccache could use strace or similar ways of monitoring the compiler and
> tracing the performed system calls to find out where headers were probed. I
> haven't measured, but I suspect that this would be slow.

The ptrace is quite easy to use, but it would be slow, and not terribly 
portable, plus you'd have to ignore all the other stat gubbins that a 
toolchain indulges in.

> 2. ccache could override strategic functions using LD_PRELOAD, thus
> snooping on system calls without involving the kernel. This should be
> possible and quite fast, but it's tricky to get right, and it's not very
> portable. (By the way: This is what
> http://audited-objects.sourceforge.netdoes, although I don't know if
> it monitors and acts on probes of
> nonexistent files.)

Faster, but more fragile, and I still don't like it.

> 3. ccache could try to imitate what the preprocessor does. That is, read
> the source code file and follow #include statements instead of looking at
> the preprocessor output. This essentially means implementing a dumbed down
> version of a preprocessor, a task that doesn't sound trivial: It has to be
> significantly faster than the real preprocessor to make a difference, it
> will be more coupled to the behavior of the compiler and its various
> options (-I, -idirafter, -isystem, etc), and it probably has to know the
> compiler's default include directories.

Yuck. If you can program a faster preprocessor I'm sure the GCC folks 
would love to see it. You wouldn't get to dumb much down unless you're 
fine with running both your own preprocessor and then the real one for 
the preprocessor mode cache check. Even if you only wanted to look for 
#include statements you'd still need to evaluate all the #if directives. 
You could make it faster by ignoring the tokenization pass, but then 
you'd get other subtle bugs.

> Anybody got other ideas?

Running the compiler with -v prints the header search directories. You 
could use that to do your own scan. It would be difficult to 
differentiate files specified by the user with absolute paths from files 
found by the compiler.

I suggest it would be better to do just the minimum to determine if a 
cached file is unsafe. Perhaps you could hash the directory stat for the 
include directories listed by "gcc -v"? (I've checked, and there doesn't 
seem to be a "-print-..." option for the include path.)

E.g. "gcc -v -c hello.c" gives:
.....
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory 
"/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
  /usr/lib/gcc/x86_64-linux-gnu/4.7/include
  /usr/local/include
  /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed
  /usr/include/x86_64-linux-gnu
  /usr/include
End of search list.
......

so, you could stat the directories listed, and disallow direct mode if 
the mtime has changed since the manifest was last written. The paths to 
stat could be cached in the manifest.

Extra points if direct mode only fails when a path *earlier* on the 
search patch is changed.

BTW, gcc has an option "--trace-includes" that might be faster than 
scanning the preprocessor output, although the compiler still has to do 
all the same work. Like this: "gcc -E hello.c -o /dev/null".

> Since a quick fix likely isn't possible in the short term, and I would like
> to release ccache 3.2 soon, we have to decide whether the direct mode
> should default to off or on. Please share any opinions!

Please leave it on. The difference is like night and day, and the bug is 
rare and avoidable.

Andrew