[distcc] Contributing to distCC / Massively parallel compilations

Tue Dec 14 15:18:11 GMT 2004

Assaf, et al.,

Sometime back I posted a message about "My big plans for distcc".  I have
implemented most of those plans and they are in use here at Marconi.  I have a
compilation farm of linux and solaris boxes of varying CPUs speeds and with
varying numbers of cpus each.  I have also designated one solaris box as my
"host server" and have a program running there that gives out hosts to
compilations, keeps track of which machines ("hosts") in the compilation farm
are available/accessible, gives out status information to other programs, etc.

If you look at that previous posting, you'll see my goals for the project,
which include (as I recall):

o having a heterogenous system (solaris, linux, etc.)
o having the system be load-balancing.
o being able to add and remove hosts from the compilation according to the
machine's load average, whether or not it is in use as a desktop, etc.
o giving out the fastest machines first, so that compilations are the fastest
possible.
o supporting many compilations simultaneously from multiple machines.

These goals have all been met by my implementation.

Now, here is the crucial part that may interest you, Assaf and friends: the
code is all written in python, which is a wonderful language, IMO.  But, as Dan
Kegel has suggested, the system may be more widely used, and easier to install
if it were written in C.  I agree with him.

So, perhaps you would like to take this python code and rewrite it as C code. 
In some cases, that will be no small task.  A nice line in python like this:

random.shuffle(avail_cpu_tiers[t]) 

would probably require hundreds of lines of C code...

I've attached the python code for the various programs.  The big part is
host-server, which I'll explain in some detail here:

o host-server: the main server.  This code listens on 4 TCP ports:  It listens
for:
   o Host requests: these come from the "gethost" program.  The code picks the
fastest available cpu and returns it to the requestor, marking that cpu as "in
use" and unavailable.  The cpu becomes available again when the TCP connection
from the requestor is closed.  
     The host server keeps all CPUs in a database, organized into "tiers",
where a tier is a list of available (unassigned) CPUs on machines that have
roughly- equivalent compilation speed.  (I've found that compilation speed is
roughly equivalent to CPU speed (i.e., MHz).  But, it does differ somewhat
based on the OS running on the machine.  This is another matter for research --
to document what impacts a machine's compilation speed.)
     The host server gets the speed of a machine and the number of CPUs on a
machine from a static host-info file.  I've included my sample host-info file
as well.  Also, the machine's speed and number of CPUs can be provided to the
host server when the machine first becomes available as a compilation host, via
the addhost script.  You'll see that below.
     Now, back to the 2nd of the 4 TCP ports the host-server listens on.
   o It listens for load average reports from compilation hosts.  Each host
that is running the distccd daemons also runs the program "loadavg", which
reports the current load averages for the machine to the host-server.  The host
server can then move heavily-used hosts down in the tiers, so that fewer
compilations are sent there.  This is necessary because we do not have an
exclusive set of machines that are compilation machines -- they are generally
available for running xterms, xemacs, etc., for software engineers.  Sometimes
people log in to a compilation host and run a non-distributed compilation. 
This adds load to the machine, making it less attractive as a compilation host.

      An open issue for research here is how the algorithm should look like for
moving a host down in the tiers when the host's load average goes up.  You can
look at the code in host-server (see routine getTier()) to see how I do it. 
This is very much an open issue.
   o It listens for status messages.  There are two programs, addhost and
remhost, which send small status messages to the host server indicating the
availability or unavailability of a compilation host.  As I noted above, the
addhost routine also may send the number of cpus and the "power index" for the
new host.  The "power index" is a value >= 1 that indicates a machine's speed
in doing compiles.  
     I've been computing a machine's power index by running some compilations
on the machine and timing the results, trying to subtract the time taken for
overhead (the time taken by gmake (or in our case here at Marconi, pcons) to
determine the file dependencies, etc.).  I then take all the timing results and
make our slowest compilation machine have the power index of 1.  The rest of
the machines then get values of 1 or higher, depending on how many times faster
than the slowest machine they are.  In our case, our slowest Solaris box gets
the value 1, and our fastest (linux) box gets the value 15 -- as it compiles
files 15 times faster than our slowest machine.  (!)
   o It listens for "monitor" requests.  I have attached mon-host-server, which
is a management application.  It connects to the host-server and requests the
cpu availability information.  Then, it prints it out in a nice form.  So, i
can  monitor how many machines are in use, how many are available, which
machines have been moved down to a lower tier because they are busy, etc.

Next: how distcc comes into play:

In your Makefile/SConstruct/Construct file, you replace 

CC = gcc

with

CC = gethost distcc gcc

"gethost" is a python script that connects to the host-server, gets a host from
the server, puts it in the DISTCC_HOSTS environment variable, and then runs
"distcc gcc <args>".  Thus, distcc gets its list of hosts from the environment
variable.  What I like about this solution is that it required no changes to
distcc at all.

Here are short explanations of the attached files:

o addhost and remhost: you can use this to dynamically add or remove
hosts from the compilation farm.

o gethost: explained above.

o loadavg: run this on each compilation host.  It is typically launched via
enable-host and killed via disable-host.

o mon-host-server: monitors the host-server.

o hosts-info: my hosts-info file: the format is simple: "hostname"
"number-of-cpus", "power-index".  Note: my swbuild-linux1 machine has only 4
cpus, but it is SO much faster than the other machines that I have 10 in there.

o enable-host: calls addhost to add the host to the host-server.  Then launches
distccd's and runs loadavg.

o disable-host: calls remhost to remove the host from the host-server.  Then
kills the distccd's and loadavg programs.  This program has some problems, and
outputs some error messages.  But, it seems to work.

o watch-ssaver: this connects to the xscreensaver that may be running.  When
the screensaver comes on, it runs enable-host, thus making an unused desktop
available for use.  When the user comes back to his/her machine, it runs
disable-host, and the desktop machine is no longer available for compilations. 
Cool, huh?!

NOTE: of course, all code is AS IS and without warranty.  I should probably
have this in each file, but I don't...  :-)

BIG NOTE: I am running host-server on my Solaris Ultra 5 box -- a very slow
desktop machine.  But, it works fine.  The reason I did this is that I had to
change the max number of fd's that can be open by a process.  This is done by
adding

set rlim_fd_cur = 1024
set rlim_fd_max = 1024 

in the /etc/system file, and then rebooting.  (I'm not sure if both lines were
needed, but I added them both, and everything seems to work.)

NOTE: please, again, read the old posting I made about goals, etc.

So: summary: Here are some items you might do/research:

o rewrite the system in C/C++.
o investigate how to handle hosts when their load averages go up: how should
they be moved down in the tier system.  Or is there a better way than the tier
system?  Remember: the goal is always to produce the fastest compile, which
generally means using the fastest cpus that are available.
o investigate how to compute a machine's power index.  How much difference does
the OS make?  You could try compiling on a machine running solaris, and then
the same machine running linux (gentoo for sparc, e.g.).  What difference does
it make?
o investigate how much network bandwidth is used in the system.  Does it affect
a compilation's time?

Vic

__________________________________ 
Do you Yahoo!? 
Dress up your holiday email, Hollywood style. Learn more. 
http://celebrity.mail.yahoo.com