[distcc] small redesign...

Łukasz Tasz lukasz at tasz.eu
Sat Nov 1 09:08:47 MDT 2014


Sure, I just made quick fix to test my test case,  and immediately share it
with you. I will try to send more polite fix:)
Regards
lt
1 lis 2014 09:06 "Fergus Henderson" <fergus at google.com> napisał(a):

> Well, perhaps it would be a good idea to add a distccd flag or environment
> variable to control the queue length rather than hard-coding 10 or 256?
> On 31 Oct 2014 11:37, "Łukasz Tasz" <lukasz at tasz.eu> wrote:
>
>> Hi Guys,
>>
>> I'm very very happy, reasons of my failures are identified.
>> issue is in:
>> --- src/srvnet.c        (wersja 177)
>> +++ src/srvnet.c        (kopia robocza)
>> @@ -99,7 +99,7 @@
>>      rs_log_info("listening on %s", sa_buf ? sa_buf : "UNKNOWN");
>>      free(sa_buf);
>>
>> -    if (listen(fd, 10)) {
>> +    if (listen(fd, 256)) {
>>          rs_log_error("listen failed: %s", strerror(errno));
>>          close(fd);
>>          return EXIT_BIND_FAILED;
>> Index: src/io.c
>>
>> queue for new connetcion was minited to 10, that's why in case that
>> cluster is overloaded, many connection are reseted.
>> aim is to even wait 5 min for cluster availability, then compile localy.
>>
>> @Jarek, thanks for support!
>>
>> let's discuss if we should fix it or not.
>>
>> regards
>> Lukasz
>>
>>
>> Łukasz Tasz
>>
>>
>> 2014-10-24 10:27 GMT+02:00 Łukasz Tasz <lukasz at tasz.eu>:
>> > Hi Martin
>> >
>> > What I have noticed.
>> > Client tries to connect distccd 3 times with 500ms delays in between.
>> > Linux kernel by default accept 128 connection.
>> > If client creates connection, even if no executors are avaliable,
>> > connection is accepted and queued by kernel running distccd.
>> > This leads to situation that client thinks that distccd is reserved,
>> > but in fact connection still waits to be accepted by distccd server.
>> > I suspect that then client starts communication too fast, distcc wont
>> > receive DIST token, and both sides waits, communication is broken, and
>> > then timeouts are applied for client default is applied, for server
>> > there is no defaults.
>> >
>> > fail scenarion is:
>> > one distccd, and two distcc users, both of them will try to compile
>> > with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big
>> > objects, cluster is overloaded with ratio 2.
>> > This still should be OK, that third, and forth user will join cluster.
>> >
>> > Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20,
>> > this is broken configuration, but simulates overload by 20 - 20
>> > developers uses cluster in a same time.
>> > Please remember that those are exceptional situation, but developer
>> > can start compilation with -j 1000 from his laptop, and cluster will
>> > timeout, then receiving 1000 jobs on a laptop will end with memmory
>> > killer :D
>> > Those are exceptional situation, and somehow cluster should handle that.
>> >
>> > In the attachement, next to some pump changes, you can find change
>> > which is moving making connection to very beginning, when distcc is
>> > picking host, also remote connection is made. if this will fail, discc
>> > follow default behaviour, goes sleep for one sec, and will pick host
>> > again. But this requires additional administration change on distccd
>> > machine:
>> > iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above
>> > <NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with
>> > tcp-reset
>> > which accept only number of connection which equals to number of
>> executors.
>> >
>> > So far so good!
>> > remark, patch is done on top of arankine_distcc_issue16-r335, since
>> > his pump changes are making pump mode working on my environment.
>> > But distccd allocation I tested also on latest official distcc release.
>> >
>> > let me know what you think!
>> >
>> > with best regards
>> > Lukasz
>> >
>> >
>> >
>> > Łukasz Tasz
>> >
>> >
>> > 2014-10-24 2:42 GMT+02:00 Martin Pool <mbp at sourcefrog.net>:
>> >> It seems like if there's nowhere to execute the job, we want the client
>> >> program to just pause, before using too many resources, until it gets
>> >> unqueued by a server ready to do the job. (Or, by a local slot being
>> >> available.)
>> >>
>> >>
>> >> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <lukasz at tasz.eu> wrote:
>> >>>
>> >>> Hi Martin,
>> >>>
>> >>> Lets assume that you can trigger more compilation tasks executors
>> then you
>> >>> have.
>> >>> In this scenario you are facing situation that cluster is saturated.
>> >>> When such a compilation will be triggered by two developers, or two CI
>> >>> (e.g jenkins) jobs, then cluster is saturated twice...
>> >>>
>> >>> Default behaviour is to lock locally slot, and try to connect three
>> >>> times, if not, fallback, if fallback is disabled CI got failed build
>> >>> (fallback is not the case, since local machine cannot handle -j
>> >>> $(distcc -j)).
>> >>>
>> >>> consider scenario, I have 1000 objects, 500 executors,
>> >>> - clean build on one machine takes
>> >>>   1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec,
>> >>> - on cluster (1000/500) * 20 sec = 40 sec
>> >>>
>> >>> Saturating cluster was impossible without pump mode, but now with pump
>> >>> mode after "warm up" effect, pump can dispatch many tasks, and I faced
>> >>> situation that saturated cluster destroys almost  every compilation.
>> >>>
>> >>> My expectation is that cluster wont reject my connect, or reject will
>> >>> be handled, either by client, either by server.
>> >>>
>> >>> by server:
>> >>> - accept every connetion,
>> >>> - fork child if not accepted by child,
>> >>> - in case of pump prepare local dir structure, receive headers
>> >>> - --critical section starts here-- multi value semaphore with value
>> >>> maxchild
>> >>>   - execute job
>> >>> - release semaphore
>> >>>
>> >>>
>> >>> Also what you suggested may be even better solution, since client will
>> >>> pick first avaliable executor instead of entering queue, so distcc
>> >>> could make connection already in function dcc_lock_one()
>> >>>
>> >>> I already tried to set DISTCC_DIR on a common nfs share, but in case
>> >>> you are triggering so many jobs, this started to be bottle neck... I
>> >>> won't tell about locking on nfs, and also scenario that somebody will
>> >>> make a lock on nfs and machine will got crash - will not work by
>> >>> design :)
>> >>>
>> >>> I know that scenario is not happening very often, and it has more or
>> >>> less picks characteristic, but we should be happy that distcc cluster
>> >>> is saturated and this case should be handled.
>> >>>
>> >>> hope it's more clear now!
>> >>> br
>> >>> LT
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Łukasz Tasz
>> >>>
>> >>>
>> >>> 2014-10-16 1:39 GMT+02:00 Martin Pool <mbp at sourcefrog.net>:
>> >>> > Can you try to explain more clearly what difference in queueing
>> behavior
>> >>> > you
>> >>> > expect from this change?
>> >>> >
>> >>> > I think probably the main change that's needed is for the client to
>> ask
>> >>> > all
>> >>> > masters if they have space, to avoid needing to effectively poll by
>> >>> > retrying, or getting stuck waiting for a particular server.
>> >>> >
>> >>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <lukasz at tasz.eu>
>> wrote:
>> >>> >>
>> >>> >> Hi Guys,
>> >>> >>
>> >>> >> please correct me if I'm wrong,
>> >>> >> - currently distcc tries to connect server 3 times, with small
>> delay,
>> >>> >> - server forks x childs and all of them are trying to accept
>> incoming
>> >>> >> connection.
>> >>> >> If server runs out of childs (all of them are busy), client will
>> >>> >> fallback, and within next 60 sec will not try this machine.
>> >>> >>
>> >>> >> What do you think about redesigning distcc in a way that master
>> server
>> >>> >> will always accept inconing connection, fork a child, but in a same
>> >>> >> time only x of them will be able to enter compilation
>> >>> >> task(dcc_spawn_child)? (mayby preforking still could be used?)
>> >>> >>
>> >>> >> This may create kind of queue, client always can decide by his
>> own, if
>> >>> >> can wait some  time, or maximum is DISTCC_IO_TIMEOUT, but still
>> it's
>> >>> >> faster to wait, since probably on a cluster side it's just a pick
>> of
>> >>> >> saturation then making falback to local machine.
>> >>> >>
>> >>> >> currently I'm facing situation that many jobs are making fallback,
>> and
>> >>> >> localmachine is being killed by make's -j calculated for distccd...
>> >>> >>
>> >>> >> other trick maybe to pick different machine, if current is busy,
>> but
>> >>> >> this may be much more complex in my opinion.
>> >>> >>
>> >>> >> what do you think?
>> >>> >> regards
>> >>> >> Łukasz Tasz
>> >>> >> __
>> >>> >> distcc mailing list            http://distcc.samba.org/
>> >>> >> To unsubscribe or change options:
>> >>> >> https://lists.samba.org/mailman/listinfo/distcc
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Martin
>> __
>> distcc mailing list            http://distcc.samba.org/
>> To unsubscribe or change options:
>> https://lists.samba.org/mailman/listinfo/distcc
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/distcc/attachments/20141101/36fe669e/attachment-0001.html>


More information about the distcc mailing list