[distcc] distcc over slow net links

Mon Aug 25 08:53:53 GMT 2003

I'll need a little time before I can give you some numbers. I've only
setup distccd on one of the remote machines, and disabled it after I saw
that issue on small builds. I'll go over that stuff to get some numbers
again.

My general idea was to run the compile on a closer faster machine, and to
use whichever build is fully done first (compile and transfer). We use two
compile slots instead of one, that's an opportunity cost. But since I'm
the only Linux guy in the company, I have those machines for myself
anyway. They have been nagging me about IncrediBuild on win32 for some
time .. I have more CPU power I think .. but gcc is way slower than VC7 ..
so I'm still loosing :-).

TTimo

PS: btw IncrediBuild has a very nice monitoring:

box 1 | a.cpp | d.cpp .. ->
box 2 | b.cpp   | f.cpp | ... ->
box 3 | c.cpp  | e.cpp .. ->
..    |

On Mon, 25 Aug 2003 12:31:42 +1000
Martin Pool <mbp at sourcefrog.net> wrote:

> On Sun, 24 Aug 2003 16:01:27 +0200
> Timothee Besset <ttimo at idsoftware.com> wrote:
> 
> > The transfer issue is not a big deal when I rebuild everything,
> > obviously remote doesn't have time to do as much as the local stuff,
> > but it still gives me a non-negligible speedup.
> 
> That's a very interesting suggestion.  Thanks. 
> 
> Could you please post some numbers on 
> 
>  - time to build whole tree with only nearby machines
>  - ditto with all machines
>  - time for one file on a nearby machine
>  - ditto with a faraway machine
> 
> Just so we know roughly what ratios we're talking about?
> 
> As with many scheduling problems, we have the constraint that we do not
> know what work will arrive in the future.
> 
> Cancelling or ignoring a remote job and running it somewhere else is a
> novel way to cope with having made a bad decision.  I hadn't thought of
> that before.  Once we've decided that running a job on a remote machine
> was a bad idea, cancelling it and running it locally is a relatively
> simple matter.
> 
> (Though not quite straightforward; I'm not completely sure that the
> complexity would be worthwhile but we'll see about the numbers.)
> 
> Let's assume that we can estimate from historical data how long it would
> take to run a job on each machine.  It might be good to split out
> transit and processing time.
> 
> I am not really sure that this is true, because it is hard to have
> really up-to-date information on the load of many remote machines.  If
> somebody starts heavily using their workstation it might suddenly go
> from being a very good to a very bad choice...
> 
> So it seems like the main question is, how and when can we make that
> decision?  It's all about opportunity costs.  We should ignore the sunk
> costs.  The client needs to ask itself, is re-running the job locally or
> on a nearby machine going to be quicker than waiting for it to complete
> remotely?  We know how long we expect the remote machine to take and
> how long it has been running, and therefore can estimate the time to
> complete.  We can also estimate the time to complete on any other
> machine; if any of them are lower then we ought to reschedule it.
> 
> However, running the job somewhere else will make that other machine
> unavailable for jobs that might arrive in the future, so we pay an
> opportunity cost.  distcc unfortunately doesn't have knowledge about
> what jobs Make may give to use in the future.  It can be the case that
> we wind down towards one task as we approach a big link phase, or
> finish a directory in a recursive build.  In that case we start to
> care much more about latency than throughput.
> 
> However, we might also go down towards a small number of compiler jobs
> when e.g. Make is doing non-cc work.  In these cases, it might be
> better to let the remote compile finish slowly on a remote machine
> than to use up cycles locally that can be put to other use.
> 
> > What I'm thinking, is that once local hosts are starved, distcc should
> > find out that there is stuff running on slow hosts, and dupe the
> > compile work on the local hosts, sending back whatever finishes first.
> 
> When distcc uses the term "starved" at the moment, it means that jobs
> cannot get a CPU to run on, rather than the other way around.  So this
> is the opposite case: we have plenty of work and want to spread it to
> as many machines as possible.
> 
> To fit in with the way distcc is invoked, I think we would have to
> have a distcc client get "impatient" with a remote server and decide
> to run it locally.  There is no per-client process that can move the
> task.  (The process needs to want to be moved.)
> 
> Another way that is perhaps more elegant is for the client just to
> make a better decision in the first place.  At the moment it starts
> off by trying to get one task on all machines, but if some of them are
> much slower that might be bad.  We might want it to use a remote
> machine only when all the nearby ones are fully loaded.
> 
> -- 
> Martin
>