[distcc] Using distcc with a new build system
nadim
nadim at khemir.net
Sat Dec 18 11:42:06 GMT 2004
hi,
There seems to be little traffic on this mailing list and more on the news. If
this is not the appropriate way to communicate with the distcc community,
please let me know.
I'll start with giving you some context information that will make it easier
to follow. I (Nadim Khemir) and Anders Lindgren have written a "rather"
advanced build system which is close to cons/scons/cook etc... The system is
written in perl and is fun to use (some of you, who have been on the cons
mailling list, might remember me).
Because of the lack of "real" threads in perl, we implemented the
parallelization of the build with processes which communicate through
socketpair. we consider the parallelization code to be prototype. But it's
short and not very complex so it "should" be OK.
My test setup at home is 3 linux boxes but I'd like to use the build system
with distcc at my job with a compilation farm and and heterogeneous machine
park. I am, for example, interrested to know what to expect when many users
compile on the farm at the same time. Today we have 300 developers with P4
2GHz boxes, They don't compile all the time but when they do, they'd like to
have some serious CPU available. I wonder if so many people sharing a compile
farm of , say 20-30 boxes, will see it as faster or slower than their own
box. Their on computer is running windows and installing cygwin (which is
slow) and gcc localy on 300 computer seems to me like an administrative
nightmare.
A/ We use colorgcc:
We sometimes get this output from distcc:
>>>>>>>>
#------------------------------------------------------------------------------
Node './lib/msm_terminated.o':
distcc -O2 -Wall -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align
-Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes
-Wmissing-declarations -Wredundant-decls -Wnested-externs -Winline -D_MODULE
-g -pg -ffunction-sections -I/devel/q04c/obigo/msf
-I/devel/q04c/obigo/msf/msf_lib -I/devel/q04c/obigo/msf/msf_lib/config
-I/devel/q04c/obigo/msf/msf_lib/export -I/devel/q04c/obigo/msf/msf_lib/intgr
-I/devel/q04c/obigo/msf/lib
-o /devel/q04c/obigo/msf/out_nadim_2/lib/msm_terminated.o
-c /devel/q04c/obigo/msf/lib/msm_terminated.c
cc1: warning: -ffunction-sections disabled; it makes profiling impossible
/home/nakh/tmp/distccd_ae3078de.i:1: error: stray '\33' in program
/home/nakh/tmp/distccd_ae3078de.i:1: error: parse error before '[' token
/home/nakh/tmp/distccd_ae3078de.i:1:3: invalid suffix "m" on integer constant
/home/nakh/tmp/distccd_ae3078de.i:1: error: syntax error at '#' token
/home/nakh/tmp/distccd_ae3078de.i:2: error: stray '\33' in program
/home/nakh/tmp/distccd_ae3078de.i:2:3: invalid suffix "m" on integer constant
/home/nakh/tmp/distccd_ae3078de.i:2: error: syntax error at '#' token
/home/nakh/tmp/distccd_ae3078de.i:3: error: stray '\33' in program
/home/nakh/tmp/distccd_ae3078de.i:3:3: invalid suffix "m" on integer constant
<<<<<<<<
We think this has to do with the coloring codes 'colorgcc' adds but we don't
understand what the problem is here. Is distcc using 'gcc -E' for
pre-processing? 'colorgcc' happily mixes STDOUT and STDERR. This is a rather
obvious 'colorgcc' error if the pre-processed code is output to STDOUT. If
you tell us how the preprocessor is used, we can' see how to modify
'colorgcc' so all distcc users can use it too.
B/Sometimes, distcc/our build system hangs in a few strange ways
(this is a problem I have fixed in our code, still I wonder why I got the
messages before)
We get this:
Mon Dec 13 14:18:06 2004
31080 Receive hdi_widgettextinputcreate.c 172.31.4.103[0]
Mon Dec 13 14:18:07 2004
31080 Receive hdi_widgettextinputcreate.c 172.31.4.103[0]
Mon Dec 13 14:18:08 2004
31080 Receive hdi_widgettextinputcreate.c 172.31.4.103[0]
Mon Dec 13 14:18:10 2004
Mon Dec 13 14:18:11 2004
but the build system is still waiting for the command to finish!
Again this was an error in the build system communication with build
processes. What I don't understand is why distccmon-text writes 'receive'.
This was also surprising:
[ali at khemir obigo]$ ps aux | grep cc
ali 1740 0.0 0.1 1972 516 pts0 SN 12:55 0:00 distcc -O2
-Wall -Wshadow -Wpointer-arith -I/devel/q04c/obigo/msf/msf_lib/intgr
-I/devel/q04c/obigo/msf/lib
-o /devel/q04c/obigo/projects/ali_grisar_runt/out_ali/msf/lib/hdi_widgetbargetvalues.o
-c /devel/q04c/obigo/msf/lib/hdi_widgetbargetvalues.c
ali 1741 0.0 0.0 0 0 pts0 ZN 12:55 0:00 [cc] <defunct>
ali 2845 0.0 0.1 1944 668 pts1 SN 12:57 0:00 grep cc
Is distcc waiting for a zombie process here? The first idea we had was that
SIG_CHILD wasn't handled properly. But while we were talking about it,
thinking the build was dead in the water, we were surprised to see the build
complete!!!!
After fixing the communication between PBS and the build processes, I made a
test run on my three boxes compiling around 200K lines of code in 500 source
files. It build in 45s with my little cluster (3GHz + 1 GHz + 700 MHz) while
taking 70s on the 3Ghz only. I ran the test 200 times while looking at the
news. I got a surprise though. I monitored the CPU loads and I was surprised
to see that one of the boxes didn't compile after some time. distcc did the
right thing as it would not use that node on the next build. The problem was
that distcc had filled my $TMP directories with 4000 rti file (those produced
by gcc). If i removed the files, the compiling just ran smoothly again. This
didn't hapend on the other 2 computers. Does any one have an idea why this
occurred?
Our build system knows how to distribute build with the help of a distributed
filesystem. I saw that you had not figures to show (in the FAQ). I could
provide some if you wish.
Cheers, Nadim and Anders.
PS: Our monitoring tool :-)
#! /usr/bin/perl -w
while(1)
{
print scalar(gmtime()) . "\n" ;
print `distccmon-text` ;
sleep(1) ;
}
More information about the distcc
mailing list