[distcc] Remote Fallback

Thu May 22 06:18:53 GMT 2003

On 21 May 2003, Thomas Walker <Thomas.Walker at morganstanley.com> wrote:

> Unfortunately this doesn't really work in our environment (updating things in
> /etc on thousands of potential client machines is very much
> non-trivial).

Can I ask how many server machines you think you'll have?

I suspect you will hit a plateau at ~20 where either the network is
saturated or the client cannot issue enough jobs.  But I'm not really
sure, and of course it will depend on the load.  It would be
interesting to verify (in the same way as verifying "money can't buy
happiness." :-)

> I was looking at making the host list static and just simply removing a server
> for that execution but, with the addition of the backoff, will probably just
> mark the server as bad via the method you've provided.  This could, however,
> cause a problem if it takes you more than <backoff time> (currently 60sec) to go
> through the list completely.  Is there a reason why you picked 60 seconds aside
> from convenience (I don't know about your environment, but in mine, if I lose a
> machine, it takes a lot more than a minute for it to return - even a panic
> followed by an automatic reboot takes at least 2 min to come back completely,
> maybe you were thinking of other sorts of temporary problems).

Yes, if the machine is off then I agree it will take >1min.

I arrived at the number like this: in sixty seconds, if the client is
running flat out, then you can issue on the order of a hundred jobs.
(As always this will vary enormously depending on the particular task
load and hardware, but it ought to be in [15..200].)  If one of the
machines is down, then about 1% of the jobs will fail to distribute
and need to be rescheduled locally.  An extra 1% running locally, and
the extra time to discover that the machine is down should be no great
burden.  

So in your case if it takes 4 minutes to boot then only about four
jobs will go to that machine, which shouldn't be a problem.

If the user discovers the problem and e.g. puts back the kicked-out
network cable or restarts distccd or whatever then we don't want them
to have to wait too long for the machine to be reused.  Making it much
longer than 1min means that when the problem is resolved people would
want to manually reset the timer, which would be ugly.

I guess 1min will be annoying if you are doing interactive development
and rebuilding every few minutes, and every time you build you get a
warning.  We might try some kind of decaying average, but I'm not sure
it would be better.

Since I couldn't think of a universally suitable algorithm then one
minute sounded like as good a number as any.

If you have dedicated build machines then it might be good to run
something like Nagios to check that they're all up.

-- 
Martin