[distcc] distccd creates zombie gcc processes, which are never reaped

George Cox george.cox at gmail.com
Wed Mar 29 22:00:54 UTC 2023


Hello,

Adding a call to dcc_reap_kids() at the end of the main loop seems to
fix the problem.

$ git --no-pager diff --no-prefix src/prefork.c

diff --git src/prefork.c src/prefork.c
index d4d70d5..39f3c8a 100644
--- src/prefork.c
+++ src/prefork.c
@@ -196,6 +196,9 @@ static int dcc_preforked_child(int listen_fd)

         dcc_close(acc_fd);
         now = time(NULL);
+
+        /* wait for any children to exit */
+        dcc_reap_kids(FALSE);
     }

     rs_log_info("worn out");


On Wed, Mar 29, 2023 at 12:53 AM George Cox <george.cox at gmail.com> wrote:
>
> Hello,
>
> I am using distcc 3.4 (compiled by me from source) on CentOS (CentOS
> Linux release 7.9.2009 (Core)). Successful compilations work OK, but
> interrupted compilations (where one presses ctrl-C on the client
> machine, interrupting the make or whatever process), lead to errors in
> the server-side distccd log, and zombie compiler processes remaining
> on the servers. This is concerning because they appear to be
> permanently using up worker slots, eventually leading to a situation
> where none are left and no remote compilation is possible. I am *not*
> using "distcc-pump" mode.
>
> I am configuring distcc like this:
>     export DISTCC_HOSTS="build01.example.com/40,lzo
> build03.example.com/40,lzo build05.example.com/40,lzo
> build06.example.com/32,lzo build07.example.com/32,lzo"
>     export DISTCC_DIR="/var/tmp/distcc.${LOGNAME}"
>
> I am running distcc like this:
>     /opt/distcc/3.4/bin/distcc /opt/gcc/7.3.0/bin/g++ [...compiler
> arguments elided...]
>
> I am starting distccd like this:
>     /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure
> --allow 10.101.201.0/24 --daemon --log-file
> /var/tmp/distccd.log--log-level debug
>
> I am running distccd in Docker, but I see the same behaviour when I
> run it under systemd.
>
> What I'm seeing in the distccd.log is
>     distccd[17] compile from RuntimeInfo.cpp to RuntimeInfo.cpp.o
>     distccd[17] (dcc_run_job) output file
> CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o
>     distccd[17] (dcc_input_tmpnam) input file
> /ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp
>     distccd[17] (dcc_r_token_int) got DOTI001175cd
>     distccd[17] (dcc_r_bulk_lzo1x) decompressed 1144269 bytes to
> 4869619 bytes: 23%
>     distccd[17] (dcc_r_file) received 1144269 bytes to file
> /tmp/distccd_fcf8c291.ii
>     distccd[17] (dcc_r_file_timed) 1144269 bytes received in
> 0.015365s, rate 72727kB/s
>     distccd[17] (dcc_set_input) changed input from
> "/ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp" to
> "/tmp/distccd_fcf8c291.ii"
>     distccd[17] (dcc_set_input) command after: /opt/gcc/7.3.0/bin/g++
> -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o
> CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o -c
> /tmp/distccd_fcf8c291.ii
>     distccd[17] (dcc_set_output) changed output from
> "CMakeFiles/lib_all_objects.dir/project/foobar/RuntimeInfo.cpp.o" to
> "/tmp/distccd_fcbcc291.o"
>     distccd[17] (dcc_set_output) command after: /opt/gcc/7.3.0/bin/g++
> -g -O0 -pipe -fconcepts -fpermissive -Wno-narrowing -std=c++1z -o
> /tmp/distccd_fcbcc291.o -c /tmp/distccd_fcf8c291.ii
>     distccd[17] (dcc_spawn_child) forking to execute:
> /opt/gcc/7.3.0/bin/g++ -g -O0 -pipe -fconcepts -fpermissive
> -Wno-narrowing -std=c++1z -o /tmp/distccd_fcbcc291.o -c
> /tmp/distccd_fcf8c291.ii
>     distccd[17] (dcc_spawn_child) child started as pid72
>     distccd[17] (dcc_collect_child) ERROR: Client fd disconnected, killing job
>     distccd[17] (dcc_x_token_int) send DONE00000002
>     distccd[17] (dcc_x_token_int) send STAT00006b00
>     distccd[17] (dcc_writex) ERROR: failed to write: Broken pipe
>     distccd[17] /opt/gcc/7.3.0/bin/g++
> /ssd_r0/user/gjvc/project/foobar/RuntimeInfo.cpp on localhost failed
> with exit code 107
>     distccd[17] job complete
>     distccd[17] (dcc_cleanup_tempfiles_inner) deleted 5 temporary files
>     distccd[17] (dcc_job_summary) client: 10.101.201.171:51212
> CLI_DISCONN exit:107 sig:0 core:0 ret:107 time:6545ms
>     distccd[17] (dcc_cleanup_tempfiles_inner) deleted 0 temporary files
>
> What I see on the remote hosts is:
>     root     15995  0.0  0.0 712432  6440 ?        Sl   18:49   0:00
> /usr/bin/containerd-shim-runc-v2 -namespace moby -id
> ab40c598131e195767b36c9795c964e9ae477a1a86bda39c43aba8376a674519
> -address /run/containerd/containerd.sock
>     nobody   16016  0.0  0.0   1120     4 ?        Ss   18:49   0:00
> \_ /sbin/docker-init -- /opt/distcc/3.4/bin/distccd --no-detach
> --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
> --daemon --log-file /var/tmp/distccd.log --log-level debug
>     nobody   16110  0.0  0.0   7052   772 ?        SN   18:49   0:00
>    \_ /opt/distcc/3.4/bin/distccd --no-detach --enable-tcp-insecure
> --allow 10.101.201.0/24 --allow 10.101.100.0/24 --daemon --log-file
> /var/tmp/distccd.log --log-level debug
>     nobody   16111  0.0  0.0  20440  8604 ?        SN   18:49   0:00
>        \_ /opt/distcc/3.4/bin/distccd --no-detach
> --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
> --daemon --log-file /var/tmp/distccd.log --log-level debug
>     nobody   16195  0.0  0.0      0     0 ?        ZN   18:49   0:00
>        |   \_ [g++] <defunct>
>     nobody   17479  0.0  0.0      0     0 ?        ZN   18:55   0:00
>        |   \_ [g++] <defunct>
>     nobody   20346  0.0  0.0      0     0 ?        ZN   19:12   0:00
>        |   \_ [g++] <defunct>
>     nobody   16112  0.0  0.0  20436  8604 ?        SN   18:49   0:00
>        \_ /opt/distcc/3.4/bin/distccd --no-detach
> --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
> --daemon --log-file /var/tmp/distccd.log --log-level debug
>     nobody   17486  0.0  0.0      0     0 ?        ZN   18:55   0:00
>        |   \_ [g++] <defunct>
>     nobody   20335  0.0  0.0      0     0 ?        ZN   19:12   0:00
>        |   \_ [g++] <defunct>
>     nobody   16113  0.0  0.0  22096 10608 ?        SN   18:49   0:00
>        \_ /opt/distcc/3.4/bin/distccd --no-detach
> --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
> --daemon --log-file /var/tmp/distccd.log --log-level debug
>     nobody   16204  0.0  0.0      0     0 ?        ZN   18:49   0:00
>        |   \_ [g++] <defunct>
>     nobody   16114  0.0  0.0  22920 11380 ?        SN   18:49   0:00
>        \_ /opt/distcc/3.4/bin/distccd --no-detach
> --enable-tcp-insecure --allow 10.101.201.0/24 --allow 10.101.100.0/24
> --daemon --log-file /var/tmp/distccd.log --log-level debug
>     nobody   17539  0.0  0.0      0     0 ?        ZN   18:56   0:00
>        |   \_ [g++] <defunct>
>     nobody   20369  0.0  0.0      0     0 ?        ZN   19:12   0:00
>        |   \_ [g++] <defunct>
>
> Note the STIME field on the zombie processes -- this shows they have
> been lingering for a while.
>
> From "man distcc" and the code, I can see that exit code 107 is "I/O
> Error", which is fair enough -- the client process went away
> unexpectedly, but whatever happens, the child process should be
> reaped.
>
> After doing this a few times, one can see the number of zombie
> compiler processes increasing (as seen in the above excerpt from the
> output of "ps faux").  The fact that there are multiple zombies under
> a single distccd process suggests that I should not be concerned about
> running out of slots as mentioned above, but it is clear that these
> compiler processes are not being reaped as they should be.  At the
> very least, it looks messy in the output of "ps faux" :-)
>
> Any and all suggestions welcome.  Thank you very much!
>
>
>
> gjvc



More information about the distcc mailing list