[ccache] gmake and ccache conspiring together in creating gremlins

Joel Rosdahl joel at rosdahl.net
Sun Feb 7 12:59:12 UTC 2021


On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache
<ccache at lists.samba.org> wrote:

> I've been chasing down non-deterministic, random build fails.

Thanks for the thorough investigation!

> So, now what? Both ccache, and the linker are contributing to this situation.
> ccache leaks the file descriptor (to distcc apparently?).

Yes, or rather for anyone who wants to use it. Icecream also uses it.

> The linker invokes make for its own use, but it doesn't sanitize the
> environment. make closes the job server file descriptors, but not removing
> them from MAKEFLAGS.

>From my perspective, I don't see anything wrong with ccache dup-ing stderr – I
guess the bug can be triggered by the user running something like

  exec 3>&2

in the shell as well.

I can think of an easy workaround for the problem: only dup2 stderr when running
the preprocessor or compiler. Done in
<https://github.com/ccache/ccache/commit/c4284c78e9ff286ee5208f449af4c2aaba062d37>.

-- Joel

On Sun, 7 Feb 2021 at 12:26, Sam Varshavchik via ccache
<ccache at lists.samba.org> wrote:
>
> I've been chasing down non-deterministic, random build fails. make
> randomly blows up with
>
> write error: stdout
>
> A lot of stracing, experimentation, and pains, determined that this is
> due to a sequence of the following events:
>
> 1) how make uses the job server file descriptor
>
> 2) an intentional leak of the standard error file descriptor in ccache
>
> 3) LTO linking results in linker recursively invoking make (that was a
> new one to me)
>
> 4) A rather nuanced behavior of the Linux PTY driver that can be best
> explained with the following short demo:
>
> #include <sys/types.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/stat.h>
>
> int main()
> {
>         struct stat state_buf;
>
>         for (int i=0; i<8; ++i)
>         {
>                 if (fstat(i, &state_buf) == 0)
>                 {
>                         printf("%d: %d/%d\n", i,
>                                (int)state_buf.st_dev,
>                                (int)state_buf.st_ino);
>                 }
>         }
>
>         printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n");
>
>         fcntl(2, F_SETFL, O_NONBLOCK);
>
>         printf("NON BLOCKING: %s\n", fcntl(0, F_GETFL) & O_NONBLOCK ? "y":"n");
>         return 0;
> }
>
> I'm sshed into my server, and the output of the above is, for me:
>
> 0: 25/3
> 1: 25/3
> 2: 25/3
> NON BLOCKING: n
> NON BLOCKING: y
>
> My stdin, stdout, and stderr is the same dev/ino, and setting non-blocking
> mode one any one of them puts them all in non-blocking mode. I'm setting non-
> blocking on standard error, and my standard input also becomes non-blocking.
>
> After sifting through sources, and straces, this is what I see is happening.
>
> Starting with make, and how it sets up its job file descriptors:
>
> jobserver_parse_auth (const char *auth)
> {
>   /* Given the command-line parameter, parse it.  */
>   if (sscanf (auth, "%d,%d", &job_fds[0], &job_fds[1]) != 2)
>
>   // …
>
>   /* Make sure our pipeline is valid, and (possibly) create a duplicate pipe,
>      that will be closed in the SIGCHLD handler.  If this fails with EBADF,
>      the parent has closed the pipe on us because it didn't think we were a
>      submake.  If so, warn and default to -j1.  */
>
>   if (!FD_OK (job_fds[0]) || !FD_OK (job_fds[1]) || make_job_rfd () < 0)
> }
>
> The TLDR of the above: make reads the job server's file descriptors
> from the MAKEFLAGS environment variable, then checks here if they
> actually exist. If they don't exist, make will create the job server
> pipe. Important: by default they will be file descriptors 3 and 4.
> This becomes a key player in this mystery, a little bit later.
>
> When make spawns a child job (other than a recursive make) the job
> server file descriptors get closed (marked O_CLOEXEC before the actual
> execve), but MAKEFLAGS remains in the environment. They remain open
> for a recursive make call, and the recursive instance picks up the
> ball from here.
>
> The problem with this approach is that if, somehow, a child process
> runs make, and there happens to be a pair of file descriptors here,
> make will assume that they're the job server file descriptors.
>
> Another thing that make does is that it sets one of the file
> descriptors to non-blocking mode. This is in jobserver_setup:
>
>   /* When using pselect() we want the read to be non-blocking.  */
>   set_blocking (job_fds[0], 0);
>
> Now we get to ccache which, amongst things, does the following, in its
> ccache.cpp:
>
> // Make a copy of stderr that will not be cached, so things like distcc can
> // send networking errors to it.
> static void
> set_up_uncached_err()
> {
>   int uncached_fd =
>     dup(STDERR_FILENO); // The file descriptor is intentionally leaked.
>   if (uncached_fd == -1) {
>     log("dup(2) failed: {}", strerror(errno));
>     throw Failure(Statistic::internal_error);
>   }
>
>   Util::setenv("UNCACHED_ERR_FD", fmt::format("{}", uncached_fd));
> }
>
> TLDR: it intentionally leaks a file descriptor. A basic standalone
> strace of a standalone cache invocation showed ccache leaking file
> descriptor 2 to 3. Now, somewhere else I was also getting file
> descriptor 4, I haven't tracked down where, exactly. But it was there,
>
> The final ingredient to this perfect storm, is LTO linking. I have
> verified that -flto=auto -ffat-lto-objects results in ld running make.
> If you compile helloworld, or something, with -flto=auto
> -ffat-lto-objects, and link it:
>
> strace -s 256 -o z -f gcc -flto=auto -ffat-lto-objects -o t t.o
>
> strace will show that, somehow, make gets involved:
>
> 123302 execve("/usr/bin/make", ["make", "-f", "/tmp/ccyRcNmv.mk", "-j32",
> "all"], 0x234ed40 /* 47 vars */ <unfinished …>
>
> And now, let's go back to what make is doing, here:
>
> static void
> close_stdout (void)
> {
>   int prev_fail = ferror (stdout);
>   int fclose_fail = fclose (stdout);
>
>   if (prev_fail || fclose_fail)
>     {
>       if (fclose_fail)
>         perror_with_name (_("write error: stdout"), "");
>       else
>         O (error, NILF, _("write error: stdout"));
>       exit (MAKE_TROUBLE);
>     }
> }
>
> So, the grand total, this is what was happening when I was building a
> large hairball, using make, distcc, cache, and when using LTO:
>
> 1) The original instance of make came up, and set up a job server on
> file descriptors 3 and 4, and put them into MAKEFLAGS.
>
> 2) This was a large build, in recursive directories, with recursive
> make invocations.
>
> 3) A link command was executed, invoking ccache. Before exec-ing it,
> make set CLOEXEC on both file descriptors 3 and 4, and ccache did not
> inherit them.
>
> 4) ccache duped file descriptor 2 to 3, and invoked distcc, which then
> invoked the real gcc to finally do the LTO link. At some point along
> the line something got leaked to file descriptor 4. That's the only
> piece of the puzzle I've yet to track down, but it didn't seem like it
> was very important to figure outwhere it came from.
>
> 5) gcc invoked the linker, which invoking make again for LTO.
>
> 6) make reads MAKEFLAGS, and sees that it should have job server file
> descriptors. make sees something on file descriptors 3 and 4, thinks its the
> original job server file descriptors (spoiler: they aren't).
>
> 7) this clusterfark wasn't fatal enough to blow up this link. It succeeded,
> but:
>
>  set_blocking (job_fds[0], 0);
>
> put the fake job file descriptor in non-blocking mode. This file
> descriptor was really the original terminal standard error, courtesy
> of ccache. Now, both standard input, output, and error, are in
> non-blocking mode.
>
> 8) My original build was -j 40. They rolled merrily along, spewing
> tons of output, a good chunk of it was from make itself. One of make's
> write()s to standard output was EAGAIN-ed. This resulted in the
> original "write error: stdout" that prompted me to burn almost a month
> chasing this down.
>
> So, now what? Both ccache, and the linker are contributing to this
> situation. ccache leaks the file descriptor (to distcc apparently?).
> The linker invokes make for its own use, but it doesn't sanitize the
> environment. make closes the job server file descriptors, but not
> removing them from MAKEFLAGS.
>
> In the meantime, -Orecursive seems to be a temporary bandaid. I think
> it works around this frankenbug.
>
> _______________________________________________
> ccache mailing list
> ccache at lists.samba.org
> https://lists.samba.org/mailman/listinfo/ccache



More information about the ccache mailing list