[distcc] distccd creates zombie gcc processes, which are never reaped

George Cox george.cox at gmail.com
Tue Mar 28 23:53:14 UTC 2023


Hello,

I am using distcc 3.4 (compiled by me from source) on CentOS (CentOS
Linux release 7.9.2009 (Core)). Successful compilations work OK, but
interrupted compilations (where one presses ctrl-C on the client
machine, interrupting the make or whatever process), lead to errors in
the server-side distccd log, and zombie compiler processes remaining
on the servers. This is concerning because they appear to be
permanently using up worker slots, eventually leading to a situation
where none are left and no remote compilation is possible. I am *not*
using "distcc-pump" mode.

I am configuring distcc like this:
    export DISTCC_HOSTS="build01.example.com/40,lzo
build03.example.com/40,lzo build05.example.com/40,lzo
build06.example.com/32,lzo build07.example.com/32,lzo"
    export DISTCC_DIR="/var/tmp/distcc.${LOGNAME}"

I am running distcc like this:
    /opt/distcc/3.4/bin/distcc /opt/gcc/7.3.0/bin/g++ [...compiler
arguments elided...]

I am starting distccd like this:
    /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure
--allow 10.101.201.0/24 --daemon --log-file
/var/tmp/distccd.log--log-level debug

I am running distccd in Docker, but I see the same behaviour when I
run it under systemd.

What I'm seeing in the distccd.log is
    distccd[17] compile from RuntimeInfo.cpp to RuntimeInfo.cpp.o
    distccd[17] (dcc_run_job) output file
CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o
    distccd[17] (dcc_input_tmpnam) input file
/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp
    distccd[17] (dcc_r_token_int) got DOTI001175cd
    distccd[17] (dcc_r_bulk_lzo1x) decompressed 1144269 bytes to
4869619 bytes: 23%
    distccd[17] (dcc_r_file) received 1144269 bytes to file
/tmp/distccd_fcf8c291.ii
    distccd[17] (dcc_r_file_timed) 1144269 bytes received in
0.015365s, rate 72727kB/s
    distccd[17] (dcc_set_input) changed input from
"/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp" to
"/tmp/distccd_fcf8c291.ii"
    distccd[17] (dcc_set_input) command after: /opt/gcc/7.3.0/bin/g++
-g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o
CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o -c
/tmp/distccd_fcf8c291.ii
    distccd[17] (dcc_set_output) changed output from
"CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o" to
"/tmp/distccd_fcbcc291.o"
    distccd[17] (dcc_set_output) command after: /opt/gcc/7.3.0/bin/g++
-g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o
/tmp/distccd_fcbcc291.o -c /tmp/distccd_fcf8c291.ii
    distccd[17] (dcc_spawn_child) forking to execute:
/opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive
-Wno-narrowing -std=c++1z -o /tmp/distccd_fcbcc291.o -c
/tmp/distccd_fcf8c291.ii
    distccd[17] (dcc_spawn_child) child started as pid72
    distccd[17] (dcc_collect_child) ERROR: Client fd disconnected, killing job
    distccd[17] (dcc_x_token_int) send DONE00000002
    distccd[17] (dcc_x_token_int) send STAT00006b00
    distccd[17] (dcc_writex) ERROR: failed to write: Broken pipe
    distccd[17] /opt/gcc/7.3.0/bin/g++
/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp on localhost failed
with exit code 107
    distccd[17] job complete
    distccd[17] (dcc_cleanup_tempfiles_inner) deleted 5 temporary files
    distccd[17] (dcc_job_summary) client: 10.101.201.171:51212
CLI_DISCONN exit:107 sig:0 core:0 ret:107 time:6545ms
    distccd[17] (dcc_cleanup_tempfiles_inner) deleted 0 temporary files

What I see on the remote hosts is:
    root     15995  0.0  0.0 712432  6440 ?        Sl   18:49   0:00
/usr/bin/containerd-shim-runc-v2 -namespace moby -id
ab40c598131e195767b36c9795c964e9ae477a1a86bda39c43aba8376a674519
-address /run/containerd/containerd.sock
    nobody   16016  0.0  0.0   1120     4 ?        Ss   18:49   0:00
\_ /sbin/docker-init -- /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
    nobody   16110  0.0  0.0   7052   772 ?        SN   18:49   0:00
   \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure
--allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file
/var/tmp/distccd.log --log-level debug
    nobody   16111  0.0  0.0  20440  8604 ?        SN   18:49   0:00
       \_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
    nobody   16195  0.0  0.0      0     0 ?        ZN   18:49   0:00
       |   \_ [g++] <defunct>
    nobody   17479  0.0  0.0      0     0 ?        ZN   18:55   0:00
       |   \_ [g++] <defunct>
    nobody   20346  0.0  0.0      0     0 ?        ZN   19:12   0:00
       |   \_ [g++] <defunct>
    nobody   16112  0.0  0.0  20436  8604 ?        SN   18:49   0:00
       \_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
    nobody   17486  0.0  0.0      0     0 ?        ZN   18:55   0:00
       |   \_ [g++] <defunct>
    nobody   20335  0.0  0.0      0     0 ?        ZN   19:12   0:00
       |   \_ [g++] <defunct>
    nobody   16113  0.0  0.0  22096 10608 ?        SN   18:49   0:00
       \_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
    nobody   16204  0.0  0.0      0     0 ?        ZN   18:49   0:00
       |   \_ [g++] <defunct>
    nobody   16114  0.0  0.0  22920 11380 ?        SN   18:49   0:00
       \_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
    nobody   17539  0.0  0.0      0     0 ?        ZN   18:56   0:00
       |   \_ [g++] <defunct>
    nobody   20369  0.0  0.0      0     0 ?        ZN   19:12   0:00
       |   \_ [g++] <defunct>

Note the STIME field on the zombie processes -- this shows they have
been lingering for a while.

>From "man distcc" and the code, I can see that exit code 107 is "I/O
Error", which is fair enough -- the client process went away
unexpectedly, but whatever happens, the child process should be
reaped.

After doing this a few times, one can see the number of zombie
compiler processes increasing (as seen in the above excerpt from the
output of "ps faux").  The fact that there are multiple zombies under
a single distccd process suggests that I should not be concerned about
running out of slots as mentioned above, but it is clear that these
compiler processes are not being reaped as they should be.  At the
very least, it looks messy in the output of "ps faux" :-)

Any and all suggestions welcome.  Thank you very much!



gjvc



More information about the distcc mailing list