It seems like if there's nowhere to execute the job, we want the client program to just pause, before using too many resources, until it gets unqueued by a server ready to do the job. (Or, by a local slot being available.)<br><br><div class="gmail_quote">On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <<a href="mailto:lukasz@tasz.eu">lukasz@tasz.eu</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Martin,<br>

<br>

Lets assume that you can trigger more compilation tasks executors then you have.<br>

In this scenario you are facing situation that cluster is saturated.<br>

When such a compilation will be triggered by two developers, or two CI<br>

(e.g jenkins) jobs, then cluster is saturated twice...<br>

<br>

Default behaviour is to lock locally slot, and try to connect three<br>

times, if not, fallback, if fallback is disabled CI got failed build<br>

(fallback is not the case, since local machine cannot handle -j<br>

$(distcc -j)).<br>

<br>

consider scenario, I have 1000 objects, 500 executors,<br>

- clean build on one machine takes<br>

  1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec,<br>

- on cluster (1000/500) * 20 sec = 40 sec<br>

<br>

Saturating cluster was impossible without pump mode, but now with pump<br>

mode after "warm up" effect, pump can dispatch many tasks, and I faced<br>

situation that saturated cluster destroys almost  every compilation.<br>

<br>

My expectation is that cluster wont reject my connect, or reject will<br>

be handled, either by client, either by server.<br>

<br>

by server:<br>

- accept every connetion,<br>

- fork child if not accepted by child,<br>

- in case of pump prepare local dir structure, receive headers<br>

- --critical section starts here-- multi value semaphore with value maxchild<br>

  - execute job<br>

- release semaphore<br>

<br>

<br>

Also what you suggested may be even better solution, since client will<br>

pick first avaliable executor instead of entering queue, so distcc<br>

could make connection already in function dcc_lock_one()<br>

<br>

I already tried to set DISTCC_DIR on a common nfs share, but in case<br>

you are triggering so many jobs, this started to be bottle neck... I<br>

won't tell about locking on nfs, and also scenario that somebody will<br>

make a lock on nfs and machine will got crash - will not work by<br>

design :)<br>

<br>

I know that scenario is not happening very often, and it has more or<br>

less picks characteristic, but we should be happy that distcc cluster<br>

is saturated and this case should be handled.<br>

<br>

hope it's more clear now!<br>

br<br>

LT<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

Łukasz Tasz<br>

<br>

<br>

2014-10-16 1:39 GMT+02:00 Martin Pool <<a href="mailto:mbp@sourcefrog.net" target="_blank">mbp@sourcefrog.net</a>>:<br>

> Can you try to explain more clearly what difference in queueing behavior you<br>

> expect from this change?<br>

><br>

> I think probably the main change that's needed is for the client to ask all<br>

> masters if they have space, to avoid needing to effectively poll by<br>

> retrying, or getting stuck waiting for a particular server.<br>

><br>

> On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <<a href="mailto:lukasz@tasz.eu" target="_blank">lukasz@tasz.eu</a>> wrote:<br>

>><br>

>> Hi Guys,<br>

>><br>

>> please correct me if I'm wrong,<br>

>> - currently distcc tries to connect server 3 times, with small delay,<br>

>> - server forks x childs and all of them are trying to accept incoming<br>

>> connection.<br>

>> If server runs out of childs (all of them are busy), client will<br>

>> fallback, and within next 60 sec will not try this machine.<br>

>><br>

>> What do you think about redesigning distcc in a way that master server<br>

>> will always accept inconing connection, fork a child, but in a same<br>

>> time only x of them will be able to enter compilation<br>

>> task(dcc_spawn_child)? (mayby preforking still could be used?)<br>

>><br>

>> This may create kind of queue, client always can decide by his own, if<br>

>> can wait some  time, or maximum is DISTCC_IO_TIMEOUT, but still it's<br>

>> faster to wait, since probably on a cluster side it's just a pick of<br>

>> saturation then making falback to local machine.<br>

>><br>

>> currently I'm facing situation that many jobs are making fallback, and<br>

>> localmachine is being killed by make's -j calculated for distccd...<br>

>><br>

>> other trick maybe to pick different machine, if current is busy, but<br>

>> this may be much more complex in my opinion.<br>

>><br>

>> what do you think?<br>

>> regards<br>

>> Łukasz Tasz<br>

>> __<br>

>> distcc mailing list            <a href="http://distcc.samba.org/" target="_blank">http://distcc.samba.org/</a><br>

>> To unsubscribe or change options:<br>

>> <a href="https://lists.samba.org/mailman/listinfo/distcc" target="_blank">https://lists.samba.org/<u></u>mailman/listinfo/distcc</a><br>

><br>

><br>

><br>

><br>

> --<br>

> Martin<br>

</blockquote></div>