<p dir="ltr"><br>

On 1 Nov 2014 15:08, "Łukasz Tasz" <<a href="mailto:lukasz@tasz.eu">lukasz@tasz.eu</a>> wrote:<br>

><br>

> Sure, I just made quick fix to test my test case,  and immediately share it with you.</p>

<p dir="ltr">Sure, understood -- that's great, thanks.</p>

<p dir="ltr">> I will try to send more polite fix:)<br>

> Regards<br>

> lt<br>

><br>

> 1 lis 2014 09:06 "Fergus Henderson" <<a href="mailto:fergus@google.com">fergus@google.com</a>> napisał(a):<br>

><br>

>> Well, perhaps it would be a good idea to add a distccd flag or environment variable to control the queue length rather than hard-coding 10 or 256?<br>

>><br>

>> On 31 Oct 2014 11:37, "Łukasz Tasz" <<a href="mailto:lukasz@tasz.eu">lukasz@tasz.eu</a>> wrote:<br>

>>><br>

>>> Hi Guys,<br>

>>><br>

>>> I'm very very happy, reasons of my failures are identified.<br>

>>> issue is in:<br>

>>> --- src/srvnet.c        (wersja 177)<br>

>>> +++ src/srvnet.c        (kopia robocza)<br>

>>> @@ -99,7 +99,7 @@<br>

>>>      rs_log_info("listening on %s", sa_buf ? sa_buf : "UNKNOWN");<br>

>>>      free(sa_buf);<br>

>>><br>

>>> -    if (listen(fd, 10)) {<br>

>>> +    if (listen(fd, 256)) {<br>

>>>          rs_log_error("listen failed: %s", strerror(errno));<br>

>>>          close(fd);<br>

>>>          return EXIT_BIND_FAILED;<br>

>>> Index: src/io.c<br>

>>><br>

>>> queue for new connetcion was minited to 10, that's why in case that<br>

>>> cluster is overloaded, many connection are reseted.<br>

>>> aim is to even wait 5 min for cluster availability, then compile localy.<br>

>>><br>

>>> @Jarek, thanks for support!<br>

>>><br>

>>> let's discuss if we should fix it or not.<br>

>>><br>

>>> regards<br>

>>> Lukasz<br>

>>><br>

>>><br>

>>> Łukasz Tasz<br>

>>><br>

>>><br>

>>> 2014-10-24 10:27 GMT+02:00 Łukasz Tasz <<a href="mailto:lukasz@tasz.eu">lukasz@tasz.eu</a>>:<br>

>>> > Hi Martin<br>

>>> ><br>

>>> > What I have noticed.<br>

>>> > Client tries to connect distccd 3 times with 500ms delays in between.<br>

>>> > Linux kernel by default accept 128 connection.<br>

>>> > If client creates connection, even if no executors are avaliable,<br>

>>> > connection is accepted and queued by kernel running distccd.<br>

>>> > This leads to situation that client thinks that distccd is reserved,<br>

>>> > but in fact connection still waits to be accepted by distccd server.<br>

>>> > I suspect that then client starts communication too fast, distcc wont<br>

>>> > receive DIST token, and both sides waits, communication is broken, and<br>

>>> > then timeouts are applied for client default is applied, for server<br>

>>> > there is no defaults.<br>

>>> ><br>

>>> > fail scenarion is:<br>

>>> > one distccd, and two distcc users, both of them will try to compile<br>

>>> > with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big<br>

>>> > objects, cluster is overloaded with ratio 2.<br>

>>> > This still should be OK, that third, and forth user will join cluster.<br>

>>> ><br>

>>> > Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20,<br>

>>> > this is broken configuration, but simulates overload by 20 - 20<br>

>>> > developers uses cluster in a same time.<br>

>>> > Please remember that those are exceptional situation, but developer<br>

>>> > can start compilation with -j 1000 from his laptop, and cluster will<br>

>>> > timeout, then receiving 1000 jobs on a laptop will end with memmory<br>

>>> > killer :D<br>

>>> > Those are exceptional situation, and somehow cluster should handle that.<br>

>>> ><br>

>>> > In the attachement, next to some pump changes, you can find change<br>

>>> > which is moving making connection to very beginning, when distcc is<br>

>>> > picking host, also remote connection is made. if this will fail, discc<br>

>>> > follow default behaviour, goes sleep for one sec, and will pick host<br>

>>> > again. But this requires additional administration change on distccd<br>

>>> > machine:<br>

>>> > iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above<br>

>>> > <NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with<br>

>>> > tcp-reset<br>

>>> > which accept only number of connection which equals to number of executors.<br>

>>> ><br>

>>> > So far so good!<br>

>>> > remark, patch is done on top of arankine_distcc_issue16-r335, since<br>

>>> > his pump changes are making pump mode working on my environment.<br>

>>> > But distccd allocation I tested also on latest official distcc release.<br>

>>> ><br>

>>> > let me know what you think!<br>

>>> ><br>

>>> > with best regards<br>

>>> > Lukasz<br>

>>> ><br>

>>> ><br>

>>> ><br>

>>> > Łukasz Tasz<br>

>>> ><br>

>>> ><br>

>>> > 2014-10-24 2:42 GMT+02:00 Martin Pool <<a href="mailto:mbp@sourcefrog.net">mbp@sourcefrog.net</a>>:<br>

>>> >> It seems like if there's nowhere to execute the job, we want the client<br>

>>> >> program to just pause, before using too many resources, until it gets<br>

>>> >> unqueued by a server ready to do the job. (Or, by a local slot being<br>

>>> >> available.)<br>

>>> >><br>

>>> >><br>

>>> >> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <<a href="mailto:lukasz@tasz.eu">lukasz@tasz.eu</a>> wrote:<br>

>>> >>><br>

>>> >>> Hi Martin,<br>

>>> >>><br>

>>> >>> Lets assume that you can trigger more compilation tasks executors then you<br>

>>> >>> have.<br>

>>> >>> In this scenario you are facing situation that cluster is saturated.<br>

>>> >>> When such a compilation will be triggered by two developers, or two CI<br>

>>> >>> (e.g jenkins) jobs, then cluster is saturated twice...<br>

>>> >>><br>

>>> >>> Default behaviour is to lock locally slot, and try to connect three<br>

>>> >>> times, if not, fallback, if fallback is disabled CI got failed build<br>

>>> >>> (fallback is not the case, since local machine cannot handle -j<br>

>>> >>> $(distcc -j)).<br>

>>> >>><br>

>>> >>> consider scenario, I have 1000 objects, 500 executors,<br>

>>> >>> - clean build on one machine takes<br>

>>> >>>   1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec,<br>

>>> >>> - on cluster (1000/500) * 20 sec = 40 sec<br>

>>> >>><br>

>>> >>> Saturating cluster was impossible without pump mode, but now with pump<br>

>>> >>> mode after "warm up" effect, pump can dispatch many tasks, and I faced<br>

>>> >>> situation that saturated cluster destroys almost  every compilation.<br>

>>> >>><br>

>>> >>> My expectation is that cluster wont reject my connect, or reject will<br>

>>> >>> be handled, either by client, either by server.<br>

>>> >>><br>

>>> >>> by server:<br>

>>> >>> - accept every connetion,<br>

>>> >>> - fork child if not accepted by child,<br>

>>> >>> - in case of pump prepare local dir structure, receive headers<br>

>>> >>> - --critical section starts here-- multi value semaphore with value<br>

>>> >>> maxchild<br>

>>> >>>   - execute job<br>

>>> >>> - release semaphore<br>

>>> >>><br>

>>> >>><br>

>>> >>> Also what you suggested may be even better solution, since client will<br>

>>> >>> pick first avaliable executor instead of entering queue, so distcc<br>

>>> >>> could make connection already in function dcc_lock_one()<br>

>>> >>><br>

>>> >>> I already tried to set DISTCC_DIR on a common nfs share, but in case<br>

>>> >>> you are triggering so many jobs, this started to be bottle neck... I<br>

>>> >>> won't tell about locking on nfs, and also scenario that somebody will<br>

>>> >>> make a lock on nfs and machine will got crash - will not work by<br>

>>> >>> design :)<br>

>>> >>><br>

>>> >>> I know that scenario is not happening very often, and it has more or<br>

>>> >>> less picks characteristic, but we should be happy that distcc cluster<br>

>>> >>> is saturated and this case should be handled.<br>

>>> >>><br>

>>> >>> hope it's more clear now!<br>

>>> >>> br<br>

>>> >>> LT<br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>><br>

>>> >>> Łukasz Tasz<br>

>>> >>><br>

>>> >>><br>

>>> >>> 2014-10-16 1:39 GMT+02:00 Martin Pool <<a href="mailto:mbp@sourcefrog.net">mbp@sourcefrog.net</a>>:<br>

>>> >>> > Can you try to explain more clearly what difference in queueing behavior<br>

>>> >>> > you<br>

>>> >>> > expect from this change?<br>

>>> >>> ><br>

>>> >>> > I think probably the main change that's needed is for the client to ask<br>

>>> >>> > all<br>

>>> >>> > masters if they have space, to avoid needing to effectively poll by<br>

>>> >>> > retrying, or getting stuck waiting for a particular server.<br>

>>> >>> ><br>

>>> >>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <<a href="mailto:lukasz@tasz.eu">lukasz@tasz.eu</a>> wrote:<br>

>>> >>> >><br>

>>> >>> >> Hi Guys,<br>

>>> >>> >><br>

>>> >>> >> please correct me if I'm wrong,<br>

>>> >>> >> - currently distcc tries to connect server 3 times, with small delay,<br>

>>> >>> >> - server forks x childs and all of them are trying to accept incoming<br>

>>> >>> >> connection.<br>

>>> >>> >> If server runs out of childs (all of them are busy), client will<br>

>>> >>> >> fallback, and within next 60 sec will not try this machine.<br>

>>> >>> >><br>

>>> >>> >> What do you think about redesigning distcc in a way that master server<br>

>>> >>> >> will always accept inconing connection, fork a child, but in a same<br>

>>> >>> >> time only x of them will be able to enter compilation<br>

>>> >>> >> task(dcc_spawn_child)? (mayby preforking still could be used?)<br>

>>> >>> >><br>

>>> >>> >> This may create kind of queue, client always can decide by his own, if<br>

>>> >>> >> can wait some  time, or maximum is DISTCC_IO_TIMEOUT, but still it's<br>

>>> >>> >> faster to wait, since probably on a cluster side it's just a pick of<br>

>>> >>> >> saturation then making falback to local machine.<br>

>>> >>> >><br>

>>> >>> >> currently I'm facing situation that many jobs are making fallback, and<br>

>>> >>> >> localmachine is being killed by make's -j calculated for distccd...<br>

>>> >>> >><br>

>>> >>> >> other trick maybe to pick different machine, if current is busy, but<br>

>>> >>> >> this may be much more complex in my opinion.<br>

>>> >>> >><br>

>>> >>> >> what do you think?<br>

>>> >>> >> regards<br>

>>> >>> >> Łukasz Tasz<br>

>>> >>> >> __<br>

>>> >>> >> distcc mailing list            <a href="http://distcc.samba.org/">http://distcc.samba.org/</a><br>

>>> >>> >> To unsubscribe or change options:<br>

>>> >>> >> <a href="https://lists.samba.org/mailman/listinfo/distcc">https://lists.samba.org/mailman/listinfo/distcc</a><br>

>>> >>> ><br>

>>> >>> ><br>

>>> >>> ><br>

>>> >>> ><br>

>>> >>> > --<br>

>>> >>> > Martin<br>

>>> __<br>

>>> distcc mailing list            <a href="http://distcc.samba.org/">http://distcc.samba.org/</a><br>

>>> To unsubscribe or change options:<br>

>>> <a href="https://lists.samba.org/mailman/listinfo/distcc">https://lists.samba.org/mailman/listinfo/distcc</a></p>