[distcc] distcc over slow net links

Mon Aug 25 02:31:42 GMT 2003

On Sun, 24 Aug 2003 16:01:27 +0200
Timothee Besset <ttimo at idsoftware.com> wrote:

> The transfer issue is not a big deal when I rebuild everything,
> obviously remote doesn't have time to do as much as the local stuff,
> but it still gives me a non-negligible speedup.

That's a very interesting suggestion.  Thanks. 

Could you please post some numbers on 

 - time to build whole tree with only nearby machines
 - ditto with all machines
 - time for one file on a nearby machine
 - ditto with a faraway machine

Just so we know roughly what ratios we're talking about?

As with many scheduling problems, we have the constraint that we do not
know what work will arrive in the future.

Cancelling or ignoring a remote job and running it somewhere else is a
novel way to cope with having made a bad decision.  I hadn't thought of
that before.  Once we've decided that running a job on a remote machine
was a bad idea, cancelling it and running it locally is a relatively
simple matter.

(Though not quite straightforward; I'm not completely sure that the
complexity would be worthwhile but we'll see about the numbers.)

Let's assume that we can estimate from historical data how long it would
take to run a job on each machine.  It might be good to split out
transit and processing time.

I am not really sure that this is true, because it is hard to have
really up-to-date information on the load of many remote machines.  If
somebody starts heavily using their workstation it might suddenly go
from being a very good to a very bad choice...

So it seems like the main question is, how and when can we make that
decision?  It's all about opportunity costs.  We should ignore the sunk
costs.  The client needs to ask itself, is re-running the job locally or
on a nearby machine going to be quicker than waiting for it to complete
remotely?  We know how long we expect the remote machine to take and
how long it has been running, and therefore can estimate the time to
complete.  We can also estimate the time to complete on any other
machine; if any of them are lower then we ought to reschedule it.

However, running the job somewhere else will make that other machine
unavailable for jobs that might arrive in the future, so we pay an
opportunity cost.  distcc unfortunately doesn't have knowledge about
what jobs Make may give to use in the future.  It can be the case that
we wind down towards one task as we approach a big link phase, or
finish a directory in a recursive build.  In that case we start to
care much more about latency than throughput.

However, we might also go down towards a small number of compiler jobs
when e.g. Make is doing non-cc work.  In these cases, it might be
better to let the remote compile finish slowly on a remote machine
than to use up cycles locally that can be put to other use.

> What I'm thinking, is that once local hosts are starved, distcc should
> find out that there is stuff running on slow hosts, and dupe the
> compile work on the local hosts, sending back whatever finishes first.

When distcc uses the term "starved" at the moment, it means that jobs
cannot get a CPU to run on, rather than the other way around.  So this
is the opposite case: we have plenty of work and want to spread it to
as many machines as possible.

To fit in with the way distcc is invoked, I think we would have to
have a distcc client get "impatient" with a remote server and decide
to run it locally.  There is no per-client process that can move the
task.  (The process needs to want to be moved.)

Another way that is perhaps more elegant is for the client just to
make a better decision in the first place.  At the moment it starts
off by trying to get one task on all machines, but if some of them are
much slower that might be bad.  We might want it to use a remote
machine only when all the nearby ones are fully loaded.

-- 
Martin