[distcc] distccd creates zombie gcc processes, which are never reaped
George Cox
george.cox at gmail.com
Tue Mar 28 23:53:14 UTC 2023
Hello,
I am using distcc 3.4 (compiled by me from source) on CentOS (CentOS
Linux release 7.9.2009 (Core)). Successful compilations work OK, but
interrupted compilations (where one presses ctrl-C on the client
machine, interrupting the make or whatever process), lead to errors in
the server-side distccd log, and zombie compiler processes remaining
on the servers. This is concerning because they appear to be
permanently using up worker slots, eventually leading to a situation
where none are left and no remote compilation is possible. I am *not*
using "distcc-pump" mode.
I am configuring distcc like this:
export DISTCC_HOSTS="build01.example.com/40,lzo
build03.example.com/40,lzo build05.example.com/40,lzo
build06.example.com/32,lzo build07.example.com/32,lzo"
export DISTCC_DIR="/var/tmp/distcc.${LOGNAME}"
I am running distcc like this:
/opt/distcc/3.4/bin/distcc /opt/gcc/7.3.0/bin/g++ [...compiler
arguments elided...]
I am starting distccd like this:
/opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure
--allow 10.101.201.0/24 --daemon --log-file
/var/tmp/distccd.log--log-level debug
I am running distccd in Docker, but I see the same behaviour when I
run it under systemd.
What I'm seeing in the distccd.log is
distccd[17] compile from RuntimeInfo.cpp to RuntimeInfo.cpp.o
distccd[17] (dcc_run_job) output file
CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o
distccd[17] (dcc_input_tmpnam) input file
/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp
distccd[17] (dcc_r_token_int) got DOTI001175cd
distccd[17] (dcc_r_bulk_lzo1x) decompressed 1144269 bytes to
4869619 bytes: 23%
distccd[17] (dcc_r_file) received 1144269 bytes to file
/tmp/distccd_fcf8c291.ii
distccd[17] (dcc_r_file_timed) 1144269 bytes received in
0.015365s, rate 72727kB/s
distccd[17] (dcc_set_input) changed input from
"/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp" to
"/tmp/distccd_fcf8c291.ii"
distccd[17] (dcc_set_input) command after: /opt/gcc/7.3.0/bin/g++
-g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o
CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o -c
/tmp/distccd_fcf8c291.ii
distccd[17] (dcc_set_output) changed output from
"CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o" to
"/tmp/distccd_fcbcc291.o"
distccd[17] (dcc_set_output) command after: /opt/gcc/7.3.0/bin/g++
-g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o
/tmp/distccd_fcbcc291.o -c /tmp/distccd_fcf8c291.ii
distccd[17] (dcc_spawn_child) forking to execute:
/opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive
-Wno-narrowing -std=c++1z -o /tmp/distccd_fcbcc291.o -c
/tmp/distccd_fcf8c291.ii
distccd[17] (dcc_spawn_child) child started as pid72
distccd[17] (dcc_collect_child) ERROR: Client fd disconnected, killing job
distccd[17] (dcc_x_token_int) send DONE00000002
distccd[17] (dcc_x_token_int) send STAT00006b00
distccd[17] (dcc_writex) ERROR: failed to write: Broken pipe
distccd[17] /opt/gcc/7.3.0/bin/g++
/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp on localhost failed
with exit code 107
distccd[17] job complete
distccd[17] (dcc_cleanup_tempfiles_inner) deleted 5 temporary files
distccd[17] (dcc_job_summary) client: 10.101.201.171:51212
CLI_DISCONN exit:107 sig:0 core:0 ret:107 time:6545ms
distccd[17] (dcc_cleanup_tempfiles_inner) deleted 0 temporary files
What I see on the remote hosts is:
root 15995 0.0 0.0 712432 6440 ? Sl 18:49 0:00
/usr/bin/containerd-shim-runc-v2 -namespace moby -id
ab40c598131e195767b36c9795c964e9ae477a1a86bda39c43aba8376a674519
-address /run/containerd/containerd.sock
nobody 16016 0.0 0.0 1120 4 ? Ss 18:49 0:00
\_ /sbin/docker-init -- /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
nobody 16110 0.0 0.0 7052 772 ? SN 18:49 0:00
\_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure
--allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file
/var/tmp/distccd.log --log-level debug
nobody 16111 0.0 0.0 20440 8604 ? SN 18:49 0:00
\_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
nobody 16195 0.0 0.0 0 0 ? ZN 18:49 0:00
| \_ [g++] <defunct>
nobody 17479 0.0 0.0 0 0 ? ZN 18:55 0:00
| \_ [g++] <defunct>
nobody 20346 0.0 0.0 0 0 ? ZN 19:12 0:00
| \_ [g++] <defunct>
nobody 16112 0.0 0.0 20436 8604 ? SN 18:49 0:00
\_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
nobody 17486 0.0 0.0 0 0 ? ZN 18:55 0:00
| \_ [g++] <defunct>
nobody 20335 0.0 0.0 0 0 ? ZN 19:12 0:00
| \_ [g++] <defunct>
nobody 16113 0.0 0.0 22096 10608 ? SN 18:49 0:00
\_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
nobody 16204 0.0 0.0 0 0 ? ZN 18:49 0:00
| \_ [g++] <defunct>
nobody 16114 0.0 0.0 22920 11380 ? SN 18:49 0:00
\_ /opt/distcc/3.4/bin/distccd --no-detach
--enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
--daemon --log-file /var/tmp/distccd.log --log-level debug
nobody 17539 0.0 0.0 0 0 ? ZN 18:56 0:00
| \_ [g++] <defunct>
nobody 20369 0.0 0.0 0 0 ? ZN 19:12 0:00
| \_ [g++] <defunct>
Note the STIME field on the zombie processes -- this shows they have
been lingering for a while.
>From "man distcc" and the code, I can see that exit code 107 is "I/O
Error", which is fair enough -- the client process went away
unexpectedly, but whatever happens, the child process should be
reaped.
After doing this a few times, one can see the number of zombie
compiler processes increasing (as seen in the above excerpt from the
output of "ps faux"). The fact that there are multiple zombies under
a single distccd process suggests that I should not be concerned about
running out of slots as mentioned above, but it is clear that these
compiler processes are not being reaped as they should be. At the
very least, it looks messy in the output of "ps faux" :-)
Any and all suggestions welcome. Thank you very much!
gjvc
More information about the distcc
mailing list