[distcc] small redesign...

Fri Oct 31 05:36:19 MDT 2014

Hi Guys,

I'm very very happy, reasons of my failures are identified.
issue is in:

--- src/srvnet.c        (wersja 177)
+++ src/srvnet.c        (kopia robocza)
@@ -99,7 +99,7 @@
     rs_log_info("listening on %s", sa_buf ? sa_buf : "UNKNOWN");
     free(sa_buf);

-    if (listen(fd, 10)) {
+    if (listen(fd, 256)) {
         rs_log_error("listen failed: %s", strerror(errno));
         close(fd);
         return EXIT_BIND_FAILED;
Index: src/io.c

queue for new connetcion was minited to 10, that's why in case that
cluster is overloaded, many connection are reseted.
aim is to even wait 5 min for cluster availability, then compile localy.

@Jarek, thanks for support!

let's discuss if we should fix it or not.

regards
Lukasz


Łukasz Tasz


2014-10-24 10:27 GMT+02:00 Łukasz Tasz <lukasz at tasz.eu>:
> Hi Martin
>
> What I have noticed.
> Client tries to connect distccd 3 times with 500ms delays in between.
> Linux kernel by default accept 128 connection.
> If client creates connection, even if no executors are avaliable,
> connection is accepted and queued by kernel running distccd.
> This leads to situation that client thinks that distccd is reserved,
> but in fact connection still waits to be accepted by distccd server.
> I suspect that then client starts communication too fast, distcc wont
> receive DIST token, and both sides waits, communication is broken, and
> then timeouts are applied for client default is applied, for server
> there is no defaults.
>
> fail scenarion is:
> one distccd, and two distcc users, both of them will try to compile
> with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big
> objects, cluster is overloaded with ratio 2.
> This still should be OK, that third, and forth user will join cluster.
>
> Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20,
> this is broken configuration, but simulates overload by 20 - 20
> developers uses cluster in a same time.
> Please remember that those are exceptional situation, but developer
> can start compilation with -j 1000 from his laptop, and cluster will
> timeout, then receiving 1000 jobs on a laptop will end with memmory
> killer :D
> Those are exceptional situation, and somehow cluster should handle that.
>
> In the attachement, next to some pump changes, you can find change
> which is moving making connection to very beginning, when distcc is
> picking host, also remote connection is made. if this will fail, discc
> follow default behaviour, goes sleep for one sec, and will pick host
> again. But this requires additional administration change on distccd
> machine:
> iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above
> <NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with
> tcp-reset
> which accept only number of connection which equals to number of executors.
>
> So far so good!
> remark, patch is done on top of arankine_distcc_issue16-r335, since
> his pump changes are making pump mode working on my environment.
> But distccd allocation I tested also on latest official distcc release.
>
> let me know what you think!
>
> with best regards
> Lukasz
>
>
>
> Łukasz Tasz
>
>
> 2014-10-24 2:42 GMT+02:00 Martin Pool <mbp at sourcefrog.net>:
>> It seems like if there's nowhere to execute the job, we want the client
>> program to just pause, before using too many resources, until it gets
>> unqueued by a server ready to do the job. (Or, by a local slot being
>> available.)
>>
>>
>> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <lukasz at tasz.eu> wrote:
>>>
>>> Hi Martin,
>>>
>>> Lets assume that you can trigger more compilation tasks executors then you
>>> have.
>>> In this scenario you are facing situation that cluster is saturated.
>>> When such a compilation will be triggered by two developers, or two CI
>>> (e.g jenkins) jobs, then cluster is saturated twice...
>>>
>>> Default behaviour is to lock locally slot, and try to connect three
>>> times, if not, fallback, if fallback is disabled CI got failed build
>>> (fallback is not the case, since local machine cannot handle -j
>>> $(distcc -j)).
>>>
>>> consider scenario, I have 1000 objects, 500 executors,
>>> - clean build on one machine takes
>>>   1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec,
>>> - on cluster (1000/500) * 20 sec = 40 sec
>>>
>>> Saturating cluster was impossible without pump mode, but now with pump
>>> mode after "warm up" effect, pump can dispatch many tasks, and I faced
>>> situation that saturated cluster destroys almost  every compilation.
>>>
>>> My expectation is that cluster wont reject my connect, or reject will
>>> be handled, either by client, either by server.
>>>
>>> by server:
>>> - accept every connetion,
>>> - fork child if not accepted by child,
>>> - in case of pump prepare local dir structure, receive headers
>>> - --critical section starts here-- multi value semaphore with value
>>> maxchild
>>>   - execute job
>>> - release semaphore
>>>
>>>
>>> Also what you suggested may be even better solution, since client will
>>> pick first avaliable executor instead of entering queue, so distcc
>>> could make connection already in function dcc_lock_one()
>>>
>>> I already tried to set DISTCC_DIR on a common nfs share, but in case
>>> you are triggering so many jobs, this started to be bottle neck... I
>>> won't tell about locking on nfs, and also scenario that somebody will
>>> make a lock on nfs and machine will got crash - will not work by
>>> design :)
>>>
>>> I know that scenario is not happening very often, and it has more or
>>> less picks characteristic, but we should be happy that distcc cluster
>>> is saturated and this case should be handled.
>>>
>>> hope it's more clear now!
>>> br
>>> LT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Łukasz Tasz
>>>
>>>
>>> 2014-10-16 1:39 GMT+02:00 Martin Pool <mbp at sourcefrog.net>:
>>> > Can you try to explain more clearly what difference in queueing behavior
>>> > you
>>> > expect from this change?
>>> >
>>> > I think probably the main change that's needed is for the client to ask
>>> > all
>>> > masters if they have space, to avoid needing to effectively poll by
>>> > retrying, or getting stuck waiting for a particular server.
>>> >
>>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <lukasz at tasz.eu> wrote:
>>> >>
>>> >> Hi Guys,
>>> >>
>>> >> please correct me if I'm wrong,
>>> >> - currently distcc tries to connect server 3 times, with small delay,
>>> >> - server forks x childs and all of them are trying to accept incoming
>>> >> connection.
>>> >> If server runs out of childs (all of them are busy), client will
>>> >> fallback, and within next 60 sec will not try this machine.
>>> >>
>>> >> What do you think about redesigning distcc in a way that master server
>>> >> will always accept inconing connection, fork a child, but in a same
>>> >> time only x of them will be able to enter compilation
>>> >> task(dcc_spawn_child)? (mayby preforking still could be used?)
>>> >>
>>> >> This may create kind of queue, client always can decide by his own, if
>>> >> can wait some  time, or maximum is DISTCC_IO_TIMEOUT, but still it's
>>> >> faster to wait, since probably on a cluster side it's just a pick of
>>> >> saturation then making falback to local machine.
>>> >>
>>> >> currently I'm facing situation that many jobs are making fallback, and
>>> >> localmachine is being killed by make's -j calculated for distccd...
>>> >>
>>> >> other trick maybe to pick different machine, if current is busy, but
>>> >> this may be much more complex in my opinion.
>>> >>
>>> >> what do you think?
>>> >> regards
>>> >> Łukasz Tasz
>>> >> __
>>> >> distcc mailing list            http://distcc.samba.org/
>>> >> To unsubscribe or change options:
>>> >> https://lists.samba.org/mailman/listinfo/distcc
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Martin