[distcc] small redesign...

Łukasz Tasz lukasz at tasz.eu
Fri Oct 24 02:27:20 MDT 2014


Hi Martin

What I have noticed.
Client tries to connect distccd 3 times with 500ms delays in between.
Linux kernel by default accept 128 connection.
If client creates connection, even if no executors are avaliable,
connection is accepted and queued by kernel running distccd.
This leads to situation that client thinks that distccd is reserved,
but in fact connection still waits to be accepted by distccd server.
I suspect that then client starts communication too fast, distcc wont
receive DIST token, and both sides waits, communication is broken, and
then timeouts are applied for client default is applied, for server
there is no defaults.

fail scenarion is:
one distccd, and two distcc users, both of them will try to compile
with DISTCC_HOSTS=distccd/1,cpp,lzo, both users have lot of big
objects, cluster is overloaded with ratio 2.
This still should be OK, that third, and forth user will join cluster.

Easy reproducer is to set one distcc, and set distcc_hosts=distccd/20,
this is broken configuration, but simulates overload by 20 - 20
developers uses cluster in a same time.
Please remember that those are exceptional situation, but developer
can start compilation with -j 1000 from his laptop, and cluster will
timeout, then receiving 1000 jobs on a laptop will end with memmory
killer :D
Those are exceptional situation, and somehow cluster should handle that.

In the attachement, next to some pump changes, you can find change
which is moving making connection to very beginning, when distcc is
picking host, also remote connection is made. if this will fail, discc
follow default behaviour, goes sleep for one sec, and will pick host
again. But this requires additional administration change on distccd
machine:
iptables -I INPUT -p tcp --dport 3632 -m connlimit --connlimit-above
<NUMBER OF DISTCCD> --connlimit-mask 0 -j REJECT --reject-with
tcp-reset
which accept only number of connection which equals to number of executors.

So far so good!
remark, patch is done on top of arankine_distcc_issue16-r335, since
his pump changes are making pump mode working on my environment.
But distccd allocation I tested also on latest official distcc release.

let me know what you think!

with best regards
Lukasz



Łukasz Tasz


2014-10-24 2:42 GMT+02:00 Martin Pool <mbp at sourcefrog.net>:
> It seems like if there's nowhere to execute the job, we want the client
> program to just pause, before using too many resources, until it gets
> unqueued by a server ready to do the job. (Or, by a local slot being
> available.)
>
>
> On Thu Oct 16 2014 at 2:43:35 AM Łukasz Tasz <lukasz at tasz.eu> wrote:
>>
>> Hi Martin,
>>
>> Lets assume that you can trigger more compilation tasks executors then you
>> have.
>> In this scenario you are facing situation that cluster is saturated.
>> When such a compilation will be triggered by two developers, or two CI
>> (e.g jenkins) jobs, then cluster is saturated twice...
>>
>> Default behaviour is to lock locally slot, and try to connect three
>> times, if not, fallback, if fallback is disabled CI got failed build
>> (fallback is not the case, since local machine cannot handle -j
>> $(distcc -j)).
>>
>> consider scenario, I have 1000 objects, 500 executors,
>> - clean build on one machine takes
>>   1000 * 20 sec (one obj) = 20000 / 16 processors = 1000 sec,
>> - on cluster (1000/500) * 20 sec = 40 sec
>>
>> Saturating cluster was impossible without pump mode, but now with pump
>> mode after "warm up" effect, pump can dispatch many tasks, and I faced
>> situation that saturated cluster destroys almost  every compilation.
>>
>> My expectation is that cluster wont reject my connect, or reject will
>> be handled, either by client, either by server.
>>
>> by server:
>> - accept every connetion,
>> - fork child if not accepted by child,
>> - in case of pump prepare local dir structure, receive headers
>> - --critical section starts here-- multi value semaphore with value
>> maxchild
>>   - execute job
>> - release semaphore
>>
>>
>> Also what you suggested may be even better solution, since client will
>> pick first avaliable executor instead of entering queue, so distcc
>> could make connection already in function dcc_lock_one()
>>
>> I already tried to set DISTCC_DIR on a common nfs share, but in case
>> you are triggering so many jobs, this started to be bottle neck... I
>> won't tell about locking on nfs, and also scenario that somebody will
>> make a lock on nfs and machine will got crash - will not work by
>> design :)
>>
>> I know that scenario is not happening very often, and it has more or
>> less picks characteristic, but we should be happy that distcc cluster
>> is saturated and this case should be handled.
>>
>> hope it's more clear now!
>> br
>> LT
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Łukasz Tasz
>>
>>
>> 2014-10-16 1:39 GMT+02:00 Martin Pool <mbp at sourcefrog.net>:
>> > Can you try to explain more clearly what difference in queueing behavior
>> > you
>> > expect from this change?
>> >
>> > I think probably the main change that's needed is for the client to ask
>> > all
>> > masters if they have space, to avoid needing to effectively poll by
>> > retrying, or getting stuck waiting for a particular server.
>> >
>> > On Wed, Oct 15, 2014 at 12:53 PM, Łukasz Tasz <lukasz at tasz.eu> wrote:
>> >>
>> >> Hi Guys,
>> >>
>> >> please correct me if I'm wrong,
>> >> - currently distcc tries to connect server 3 times, with small delay,
>> >> - server forks x childs and all of them are trying to accept incoming
>> >> connection.
>> >> If server runs out of childs (all of them are busy), client will
>> >> fallback, and within next 60 sec will not try this machine.
>> >>
>> >> What do you think about redesigning distcc in a way that master server
>> >> will always accept inconing connection, fork a child, but in a same
>> >> time only x of them will be able to enter compilation
>> >> task(dcc_spawn_child)? (mayby preforking still could be used?)
>> >>
>> >> This may create kind of queue, client always can decide by his own, if
>> >> can wait some  time, or maximum is DISTCC_IO_TIMEOUT, but still it's
>> >> faster to wait, since probably on a cluster side it's just a pick of
>> >> saturation then making falback to local machine.
>> >>
>> >> currently I'm facing situation that many jobs are making fallback, and
>> >> localmachine is being killed by make's -j calculated for distccd...
>> >>
>> >> other trick maybe to pick different machine, if current is busy, but
>> >> this may be much more complex in my opinion.
>> >>
>> >> what do you think?
>> >> regards
>> >> Łukasz Tasz
>> >> __
>> >> distcc mailing list            http://distcc.samba.org/
>> >> To unsubscribe or change options:
>> >> https://lists.samba.org/mailman/listinfo/distcc
>> >
>> >
>> >
>> >
>> > --
>> > Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tasz_lukasz.patch
Type: text/x-patch
Size: 14174 bytes
Desc: not available
URL: <http://lists.samba.org/pipermail/distcc/attachments/20141024/b45c4e28/attachment-0001.bin>


More information about the distcc mailing list