[distcc] Results: massively parallel builds, load balancing, etc.

Mon Oct 11 14:15:45 GMT 2004

Oops: substitute "Scott Lystig Fritchie" for "<<<>>>" in the message.  Sorry,
Scott.

Vic

--- Victor Norman <vtnpgh at yahoo.com> wrote:

> All,
> 
> Over the last couple of weeks I've been testing distcc (with pcons (parallel
> cons)) to see how it performs when multiple builds are done at once using a
> single, shared, heterogeneous compilation farm, with varying levels of
> parallelism (-j values) and varying numbers of simultaneous builds, and
> varying
> numbers of hosts in the compilation farm.  This posting contains my results. 
> But, first, some info about my setup.
> 
> o My tree contains 2480 .c/.cc files, in 148 directories.  There are 97
> directories that have 20 or fewer files in them each.  The rest have more
> than
> 20 files in them, with 2 directories have over 200 each, and one directory
> having 548 files in it.
> 
> o My compilation farm has 40 CPUs in it, on 18 machines.  The machines all
> run
> Solaris 2.x, where x is 6, 7, or 8.  One machine has 8 cpus, one has 4, and
> the
> rest have 2 or 1.  They vary from pretty fast to pretty slow.
> 
> o For one of the tests, I used only my fastest machines: 23 CPUs in 8
> machines.
> 
> o I use gcc 2.95.2 -- a very old version of the compiler.  Don't ask why we
> don't upgrade.  It is a long boring story.  :-(
> 
> o I used distcc 2.16, without changes (not actually true, since we always
> call
> "gcc -x c++", so I changed distcc to allow it).
> 
> o pcons spawns parallel builds per-library.  In other words, it parallelizes
> the compilations of all files that will be collected into a library, before
> starting on the next library.
> 
> o When I do a single old build with -j1 and NO distcc, the time to compile is
> 4:02:11 (4 hrs, 2 minutes, 11 seconds).
> 
> OK: there is the setup.  Now, some data.  I've attached an Excel spreadsheet
> which contains some numbers and very informative graphs.  If someone wants
> the
> data in a different format (that I can generate), please let me know.  I'd be
> happy to do it, for the good of the cause.
> 
> Explanations of the data/graphs:
> 
> o "6 simultaneous" means that I ran 6 independent compiles, each from a
> different machine, each with their own DISTCC_DIR set, all using the same
> compilation farm.
> 
> o "Parallelism" is the -j value for the build.
> 
> o "1 at a time" means I ran only 1 compilation simultaneously, using the
> compilation farm.
> 
> o "6 with only fastest hosts" means I ran the builds with only the fastest
> host
> set.  Dan Kegel has been suggesting that I should just be rid of the slower
> build hosts.  I wanted to test his opinion.  :-)
> 
> My conclusions from the top set of data:
> 
> o all the builds are MUCH faster using parallelization than without: recall
> that with -j1, the time was about 4 hours.  Thank you, Martin Pool, for
> distcc.
> 
> o -j20 produced the fastest builds, of the -j values that I tried.  This
> result
> may counter some opinions that say that "massively parallel" builds (with -j
> >
> 10) do not produce any benefit.  For me, it did.  See the FAQ:
> http://distcc.samba.org/faq.html#dash-j for more info.
> 
> o -j24 was slower than -j20 because more time was spent blocked waiting for
> CPUs to become free.  This implies that it is good to have more hosts in the
> compilation farm.
> 
> o Comparing the "6 with only fastest hosts" to "6 simultaneously" data, we
> see
> that if you are guaranteed to be the only person using the compilation farm,
> then indeed it is best to have only the fastest hosts in the farm.  However,
> as
> soon as more than 1 compilation is run at once, it is better to have the
> slower
> hosts in the compilation farm.  This is because with only the fastest hosts
> in
> the farm, there is significant contention for CPUs with multiple simultaneous
> builds.
> 
> 
> Looking at the second set of data:
> 
> o Having concluded that for me, the -j20 value produced the best compilation
> times, I wanted to see if I could make multiple simultaneous builds cooperate
> in sharing the compilation farm cpus, so that contention for CPUs would be
> minimized.  The only way I could figure out how to do this was to use
> <<<>>>>'s
> suggestion, and build a single "host-server" that would keep track of the
> hosts
> in the compilation farm, and hand them out to distcc runs.  So, I built a
> system  similar to what <<<>>> described in his email, and for which he gave
> tcl code.  But, I've worked with Tcl for many years and am not fond of it. 
> So,
> I rewrote it in my new favorite language, python.  My host server reads the
> "hosts" file and a "hosts-info" file to get the list of hosts, how many CPUs
> each has, and how fast each host is (what I call a "power index").  It also
> listens on 4 well-known TCP ports: 1) for reports from the hosts indicating
> their current load average; 2) for indications that a host is up or down
> (available or unavailable); 3) for requests by a host-server monitoring
> program.  To these requests it sends its current host database.  4) for
> requests for hosts from compilation executions.  It responds to these
> requests
> with a hostname of an available host (a host with an available CPU).
> 
> o Each compilation is "gethost.py distcc gcc -x c++ args args args". 
> gethost.py connects to the host server and gets a hostname in response.  It
> sets the hostname in DISTCC_HOSTS and calls "distcc gcc -x c++ args args
> args...".  When the host server has no CPUs available, gethost.py will get ""
> back, and thus, distcc will run the compilation on localhost.
> 
> o The host server always hands out the fastest available compilation hosts
> (actually CPUs) first. The host server uses the load average messages from
> the
> compilation hosts to adjust the power of the CPU in its database.  When a
> host
> is running at full capacity (load average / number of cpus > 1.0), the host
> is
> avoided for a while, until the load average shows the machine is able to
> handle
> some more compilations again.  
> 
> o I've found that my algorithm keeps the fastest hosts very busy -- with
> adjusted load averages hanging just below capacity for the most part.  I've
> very happy with that result.  The algorithm also takes into account machines
> that are being used for other purposes -- as our compilation machines are
> generally available for people to use running Xvnc, xterms, emacs, etc.
> 
> OK: on to the results.  I wanted to see how scalable my solution was.  I'm
> thrilled with my early results:
> 
> o you can see from the graph that the basic build, with one simultaneous
> compilation at -j20, takes about the same time as in the previous graph (~45
> minutes vs. 46 minutes.  So this confirms (in my mind, at least) that the
> data
> is pretty solid.
> 
> o As I scale up from 1 simultaneous build to 6 simultaneous builds, the times
> go consistently up, as I expected.  Again, this confirms that my testing
> setup
> is pretty reliable.  And, good news!, the times go up quite slowly.  Yeah!
> 
> o The scalability of the system seems really good to me.  ~45 minutes when
> there is one person using the compilation farm vs. 57 minutes when there are
> 6
> simultaneous builds.  That is better than I expected.
> 
> o The "number of times no machine is available" data seem to indicate that
> adding more machines to the compilation farm, especially when the farm is
> quite
> heavily loaded, would benefit build times.  This would reduce the number of
> times that each CPU in the system is in use, and would reduce the number of
> localhost builds.
> 
> o (Note: I did one build on the localhost, with -j10, and it took 1:18:43, so
> this gives you some feel for how fast the localhost is.  (It has 4
> processors.))
> 
> o I would like to try more scalability tests yet, with more than 6
> simultaneous
> builds happening.  This might show us some more interesting data.
> 
> 
> I'd love to hear was others think of this data, and my conclusions.  And, I
> will be posting my Python code as soon as I iron out some ugliness, etc.  If
> you want it with its ugliness intact, please let me know.
> 
> Vic
> 
> 
> 		
> __________________________________
> Do you Yahoo!?
> Take Yahoo! Mail with you! Get it on your mobile phone.
> http://mobile.yahoo.com/maildemo 

> ATTACHMENT part 2 application/vnd.ms-excel name=parallel-build-times.xls
> __ 
> distcc mailing list            http://distcc.samba.org/
> To unsubscribe or change options: 
> http://lists.samba.org/mailman/listinfo/distcc

_______________________________
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com