[distcc] Results: massively parallel builds, load balancing, etc.

Mon Oct 11 14:12:03 GMT 2004

All,

Over the last couple of weeks I've been testing distcc (with pcons (parallel
cons)) to see how it performs when multiple builds are done at once using a
single, shared, heterogeneous compilation farm, with varying levels of
parallelism (-j values) and varying numbers of simultaneous builds, and varying
numbers of hosts in the compilation farm.  This posting contains my results. 
But, first, some info about my setup.

o My tree contains 2480 .c/.cc files, in 148 directories.  There are 97
directories that have 20 or fewer files in them each.  The rest have more than
20 files in them, with 2 directories have over 200 each, and one directory
having 548 files in it.

o My compilation farm has 40 CPUs in it, on 18 machines.  The machines all run
Solaris 2.x, where x is 6, 7, or 8.  One machine has 8 cpus, one has 4, and the
rest have 2 or 1.  They vary from pretty fast to pretty slow.

o For one of the tests, I used only my fastest machines: 23 CPUs in 8 machines.

o I use gcc 2.95.2 -- a very old version of the compiler.  Don't ask why we
don't upgrade.  It is a long boring story.  :-(

o I used distcc 2.16, without changes (not actually true, since we always call
"gcc -x c++", so I changed distcc to allow it).

o pcons spawns parallel builds per-library.  In other words, it parallelizes
the compilations of all files that will be collected into a library, before
starting on the next library.

o When I do a single old build with -j1 and NO distcc, the time to compile is
4:02:11 (4 hrs, 2 minutes, 11 seconds).

OK: there is the setup.  Now, some data.  I've attached an Excel spreadsheet
which contains some numbers and very informative graphs.  If someone wants the
data in a different format (that I can generate), please let me know.  I'd be
happy to do it, for the good of the cause.

Explanations of the data/graphs:

o "6 simultaneous" means that I ran 6 independent compiles, each from a
different machine, each with their own DISTCC_DIR set, all using the same
compilation farm.

o "Parallelism" is the -j value for the build.

o "1 at a time" means I ran only 1 compilation simultaneously, using the
compilation farm.

o "6 with only fastest hosts" means I ran the builds with only the fastest host
set.  Dan Kegel has been suggesting that I should just be rid of the slower
build hosts.  I wanted to test his opinion.  :-)

My conclusions from the top set of data:

o all the builds are MUCH faster using parallelization than without: recall
that with -j1, the time was about 4 hours.  Thank you, Martin Pool, for distcc.

o -j20 produced the fastest builds, of the -j values that I tried.  This result
may counter some opinions that say that "massively parallel" builds (with -j >
10) do not produce any benefit.  For me, it did.  See the FAQ:
http://distcc.samba.org/faq.html#dash-j for more info.

o -j24 was slower than -j20 because more time was spent blocked waiting for
CPUs to become free.  This implies that it is good to have more hosts in the
compilation farm.

o Comparing the "6 with only fastest hosts" to "6 simultaneously" data, we see
that if you are guaranteed to be the only person using the compilation farm,
then indeed it is best to have only the fastest hosts in the farm.  However, as
soon as more than 1 compilation is run at once, it is better to have the slower
hosts in the compilation farm.  This is because with only the fastest hosts in
the farm, there is significant contention for CPUs with multiple simultaneous
builds.

Looking at the second set of data:

o Having concluded that for me, the -j20 value produced the best compilation
times, I wanted to see if I could make multiple simultaneous builds cooperate
in sharing the compilation farm cpus, so that contention for CPUs would be
minimized.  The only way I could figure out how to do this was to use <<<>>>>'s
suggestion, and build a single "host-server" that would keep track of the hosts
in the compilation farm, and hand them out to distcc runs.  So, I built a
system  similar to what <<<>>> described in his email, and for which he gave
tcl code.  But, I've worked with Tcl for many years and am not fond of it.  So,
I rewrote it in my new favorite language, python.  My host server reads the
"hosts" file and a "hosts-info" file to get the list of hosts, how many CPUs
each has, and how fast each host is (what I call a "power index").  It also
listens on 4 well-known TCP ports: 1) for reports from the hosts indicating
their current load average; 2) for indications that a host is up or down
(available or unavailable); 3) for requests by a host-server monitoring
program.  To these requests it sends its current host database.  4) for
requests for hosts from compilation executions.  It responds to these requests
with a hostname of an available host (a host with an available CPU).

o Each compilation is "gethost.py distcc gcc -x c++ args args args". 
gethost.py connects to the host server and gets a hostname in response.  It
sets the hostname in DISTCC_HOSTS and calls "distcc gcc -x c++ args args
args...".  When the host server has no CPUs available, gethost.py will get ""
back, and thus, distcc will run the compilation on localhost.

o The host server always hands out the fastest available compilation hosts
(actually CPUs) first. The host server uses the load average messages from the
compilation hosts to adjust the power of the CPU in its database.  When a host
is running at full capacity (load average / number of cpus > 1.0), the host is
avoided for a while, until the load average shows the machine is able to handle
some more compilations again.  

o I've found that my algorithm keeps the fastest hosts very busy -- with
adjusted load averages hanging just below capacity for the most part.  I've
very happy with that result.  The algorithm also takes into account machines
that are being used for other purposes -- as our compilation machines are
generally available for people to use running Xvnc, xterms, emacs, etc.

OK: on to the results.  I wanted to see how scalable my solution was.  I'm
thrilled with my early results:

o you can see from the graph that the basic build, with one simultaneous
compilation at -j20, takes about the same time as in the previous graph (~45
minutes vs. 46 minutes.  So this confirms (in my mind, at least) that the data
is pretty solid.

o As I scale up from 1 simultaneous build to 6 simultaneous builds, the times
go consistently up, as I expected.  Again, this confirms that my testing setup
is pretty reliable.  And, good news!, the times go up quite slowly.  Yeah!

o The scalability of the system seems really good to me.  ~45 minutes when
there is one person using the compilation farm vs. 57 minutes when there are 6
simultaneous builds.  That is better than I expected.

o The "number of times no machine is available" data seem to indicate that
adding more machines to the compilation farm, especially when the farm is quite
heavily loaded, would benefit build times.  This would reduce the number of
times that each CPU in the system is in use, and would reduce the number of
localhost builds.

o (Note: I did one build on the localhost, with -j10, and it took 1:18:43, so
this gives you some feel for how fast the localhost is.  (It has 4
processors.))

o I would like to try more scalability tests yet, with more than 6 simultaneous
builds happening.  This might show us some more interesting data.

I'd love to hear was others think of this data, and my conclusions.  And, I
will be posting my Python code as soon as I iron out some ugliness, etc.  If
you want it with its ugliness intact, please let me know.

Vic

__________________________________
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: parallel-build-times.xls
Type: application/vnd.ms-excel
Size: 19456 bytes
Desc: parallel-build-times.xls
Url : http://lists.samba.org/archive/distcc/attachments/20041011/9bb186c1/parallel-build-times.xls