[distcc] DISTCC on Scyld cluster

Mon Oct 22 16:24:00 GMT 2007

Hi everyone,

I'm trying find a good way to run "distcc" on a cluster that is running 
Scyld ClusterWare from Penguin Computing. This architecture consists of 
several compute nodes which are hidden from the external network behind 
a single master node which is responsible for managing a work queue and 
dispatching jobs to appropriate compute nodes. The master and the 
compute nodes are on a private network and can see each other, but the 
only external access is to the master node. The "proper" way to use the 
system is to submit jobs via  the queuing system. I manged to come up 
with job script that does just that... it submits a job which reserves 
several nodes, and when it get scheduled, it runs "distccd" on the 
assigned nodes, and then does a "distcc" compile on the master node. 
This works, but there are several disadvantages. First, it's not all 
that interactive. Submitting a compile job and having to wait some 
indeterminate amount of time for it to execute is sort of perverse... 
developers might as well compile on their own machines. Second, and I 
think this is the most frustrating problem, is that since "distcc" is 
running on the head node, the head node must have access to all the 
source code. Which means developers must upload their code to the head 
node, or put it on an NFS drive. Doing either of this defeats distcc's 
best feature, namely the transport protocol that it provides to allow 
you to compile stuff on your *own* desktop machine using local storage.

So, we've been searching for alternatives. Two ideas came up, but both 
are rather iffy, so I thought I would ping this group before spending a 
lot of time fiddling with them. The first idea was to "daisy-chain" 
distcc. The idea is that we would run "distccd" on each of the worker 
nodes (outside of the queuing system), and run another "distccd" on the 
head node. The daemon on the head node would accept connections from the 
outside world, and when it tried to run "gcc", it would really be 
running "distcc", which would forward the request to a "distccd" on a 
worker node. So, in the outside network, developers would run "distcc", 
but set their DISTCC_HOSTS to only one machine -- the head node of the 
cluster.

So that's the first idea. The second idea is similar in that the head 
node runs "distccd", and that developers have only that machine in their 
DISTCC_HOSTS, but now, when the "distccd" runs "gcc", it instead runs a 
wrapper script which calls "gcc" using Scyld ClusterWare "bpsh" wrapper, 
which starts the job on the head node, then migrates it to a compute 
node. My concern with this approach is that there is the possibility of 
their being a lot of overhead in migrating hundreds of small "gcc" tasks 
to  the compute nodes one at a time.

So those are the two ideas. None of them seem ideal. Since I doubt 
anyone can comment on the second idea (it's likely something we have to 
try), my question to the group is 1) whether there is third, better, 
alternative someone has come up with, 2) whether I should even attempt 
the "daisy-chaining" approach (would distcc even be able to handle this, 
or would it get hopelessly confused?).

Any thought would be very much appreciated! Thank you!

-- Marcio