[distcc] Results: massively parallel builds, load balancing, etc.
Victor Norman
vtnpgh at yahoo.com
Mon Oct 11 14:15:45 GMT 2004
Oops: substitute "Scott Lystig Fritchie" for "<<<>>>" in the message. Sorry,
Scott.
Vic
--- Victor Norman <vtnpgh at yahoo.com> wrote:
> All,
>
> Over the last couple of weeks I've been testing distcc (with pcons (parallel
> cons)) to see how it performs when multiple builds are done at once using a
> single, shared, heterogeneous compilation farm, with varying levels of
> parallelism (-j values) and varying numbers of simultaneous builds, and
> varying
> numbers of hosts in the compilation farm. This posting contains my results.
> But, first, some info about my setup.
>
> o My tree contains 2480 .c/.cc files, in 148 directories. There are 97
> directories that have 20 or fewer files in them each. The rest have more
> than
> 20 files in them, with 2 directories have over 200 each, and one directory
> having 548 files in it.
>
> o My compilation farm has 40 CPUs in it, on 18 machines. The machines all
> run
> Solaris 2.x, where x is 6, 7, or 8. One machine has 8 cpus, one has 4, and
> the
> rest have 2 or 1. They vary from pretty fast to pretty slow.
>
> o For one of the tests, I used only my fastest machines: 23 CPUs in 8
> machines.
>
> o I use gcc 2.95.2 -- a very old version of the compiler. Don't ask why we
> don't upgrade. It is a long boring story. :-(
>
> o I used distcc 2.16, without changes (not actually true, since we always
> call
> "gcc -x c++", so I changed distcc to allow it).
>
> o pcons spawns parallel builds per-library. In other words, it parallelizes
> the compilations of all files that will be collected into a library, before
> starting on the next library.
>
> o When I do a single old build with -j1 and NO distcc, the time to compile is
> 4:02:11 (4 hrs, 2 minutes, 11 seconds).
>
> OK: there is the setup. Now, some data. I've attached an Excel spreadsheet
> which contains some numbers and very informative graphs. If someone wants
> the
> data in a different format (that I can generate), please let me know. I'd be
> happy to do it, for the good of the cause.
>
> Explanations of the data/graphs:
>
> o "6 simultaneous" means that I ran 6 independent compiles, each from a
> different machine, each with their own DISTCC_DIR set, all using the same
> compilation farm.
>
> o "Parallelism" is the -j value for the build.
>
> o "1 at a time" means I ran only 1 compilation simultaneously, using the
> compilation farm.
>
> o "6 with only fastest hosts" means I ran the builds with only the fastest
> host
> set. Dan Kegel has been suggesting that I should just be rid of the slower
> build hosts. I wanted to test his opinion. :-)
>
> My conclusions from the top set of data:
>
> o all the builds are MUCH faster using parallelization than without: recall
> that with -j1, the time was about 4 hours. Thank you, Martin Pool, for
> distcc.
>
> o -j20 produced the fastest builds, of the -j values that I tried. This
> result
> may counter some opinions that say that "massively parallel" builds (with -j
> >
> 10) do not produce any benefit. For me, it did. See the FAQ:
> http://distcc.samba.org/faq.html#dash-j for more info.
>
> o -j24 was slower than -j20 because more time was spent blocked waiting for
> CPUs to become free. This implies that it is good to have more hosts in the
> compilation farm.
>
> o Comparing the "6 with only fastest hosts" to "6 simultaneously" data, we
> see
> that if you are guaranteed to be the only person using the compilation farm,
> then indeed it is best to have only the fastest hosts in the farm. However,
> as
> soon as more than 1 compilation is run at once, it is better to have the
> slower
> hosts in the compilation farm. This is because with only the fastest hosts
> in
> the farm, there is significant contention for CPUs with multiple simultaneous
> builds.
>
>
> Looking at the second set of data:
>
> o Having concluded that for me, the -j20 value produced the best compilation
> times, I wanted to see if I could make multiple simultaneous builds cooperate
> in sharing the compilation farm cpus, so that contention for CPUs would be
> minimized. The only way I could figure out how to do this was to use
> <<<>>>>'s
> suggestion, and build a single "host-server" that would keep track of the
> hosts
> in the compilation farm, and hand them out to distcc runs. So, I built a
> system similar to what <<<>>> described in his email, and for which he gave
> tcl code. But, I've worked with Tcl for many years and am not fond of it.
> So,
> I rewrote it in my new favorite language, python. My host server reads the
> "hosts" file and a "hosts-info" file to get the list of hosts, how many CPUs
> each has, and how fast each host is (what I call a "power index"). It also
> listens on 4 well-known TCP ports: 1) for reports from the hosts indicating
> their current load average; 2) for indications that a host is up or down
> (available or unavailable); 3) for requests by a host-server monitoring
> program. To these requests it sends its current host database. 4) for
> requests for hosts from compilation executions. It responds to these
> requests
> with a hostname of an available host (a host with an available CPU).
>
> o Each compilation is "gethost.py distcc gcc -x c++ args args args".
> gethost.py connects to the host server and gets a hostname in response. It
> sets the hostname in DISTCC_HOSTS and calls "distcc gcc -x c++ args args
> args...". When the host server has no CPUs available, gethost.py will get ""
> back, and thus, distcc will run the compilation on localhost.
>
> o The host server always hands out the fastest available compilation hosts
> (actually CPUs) first. The host server uses the load average messages from
> the
> compilation hosts to adjust the power of the CPU in its database. When a
> host
> is running at full capacity (load average / number of cpus > 1.0), the host
> is
> avoided for a while, until the load average shows the machine is able to
> handle
> some more compilations again.
>
> o I've found that my algorithm keeps the fastest hosts very busy -- with
> adjusted load averages hanging just below capacity for the most part. I've
> very happy with that result. The algorithm also takes into account machines
> that are being used for other purposes -- as our compilation machines are
> generally available for people to use running Xvnc, xterms, emacs, etc.
>
> OK: on to the results. I wanted to see how scalable my solution was. I'm
> thrilled with my early results:
>
> o you can see from the graph that the basic build, with one simultaneous
> compilation at -j20, takes about the same time as in the previous graph (~45
> minutes vs. 46 minutes. So this confirms (in my mind, at least) that the
> data
> is pretty solid.
>
> o As I scale up from 1 simultaneous build to 6 simultaneous builds, the times
> go consistently up, as I expected. Again, this confirms that my testing
> setup
> is pretty reliable. And, good news!, the times go up quite slowly. Yeah!
>
> o The scalability of the system seems really good to me. ~45 minutes when
> there is one person using the compilation farm vs. 57 minutes when there are
> 6
> simultaneous builds. That is better than I expected.
>
> o The "number of times no machine is available" data seem to indicate that
> adding more machines to the compilation farm, especially when the farm is
> quite
> heavily loaded, would benefit build times. This would reduce the number of
> times that each CPU in the system is in use, and would reduce the number of
> localhost builds.
>
> o (Note: I did one build on the localhost, with -j10, and it took 1:18:43, so
> this gives you some feel for how fast the localhost is. (It has 4
> processors.))
>
> o I would like to try more scalability tests yet, with more than 6
> simultaneous
> builds happening. This might show us some more interesting data.
>
>
> I'd love to hear was others think of this data, and my conclusions. And, I
> will be posting my Python code as soon as I iron out some ugliness, etc. If
> you want it with its ugliness intact, please let me know.
>
> Vic
>
>
>
> __________________________________
> Do you Yahoo!?
> Take Yahoo! Mail with you! Get it on your mobile phone.
> http://mobile.yahoo.com/maildemo
> ATTACHMENT part 2 application/vnd.ms-excel name=parallel-build-times.xls
> __
> distcc mailing list http://distcc.samba.org/
> To unsubscribe or change options:
> http://lists.samba.org/mailman/listinfo/distcc
_______________________________
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com
More information about the distcc
mailing list