[distcc] Working on several distcc enhancements (take 2)
Laurent Calburtin
laurent.calburtin at free.fr
Sun Nov 27 18:15:05 GMT 2005
Hi,
We just started to use distcc at our company (a dozen developers on
same project), and I'm pleased to see the distcc project is active!
I read your enhancements proposal with attention, especially point 4
and 5 and I would like to share with you some ideas.
On Friday 18 November 2005 23.39, Daniel Kegel wrote:
> 4. When a distccd server is full up on active jobs, and other nearby
> servers are not, it's a shame that clients which connect to the
> wrong server have to wait
...
> 5. If Alice has already compiled everything on client A, and Bob
starts a job
> to compile the same everything on client B, it's a shame that
Bob has to wait;
> perhaps distccd (or a load balancer!) should (carefully) cache
results.
I'm very interested about combining ccache and distcc. I think that
would make a huge performance improvement in our situation where many
developers are compiling each day roughly the same set of files.
But installing ccache on each distccd host means as many separate
caches.
Since filling a cache has a price (compiling!), I would prefer
filling as few caches as possible with the same content. I know that
some are using a file server to share the cache but that means
network communications and we may avoid that.
As far as I know, ccache is using an md4 hash on the pre-processor
output as a short signature of a file to compile. Why not sending
only this hash to hosts instead of the full pre-processor output?
If the host has the result in its cache, we win, else the client can
decide to try another host or to continue with this host by sending
him the pre-processor output to compile and store in its cache for
the next developer to come in.
An objection to this scenario could be the network overhead needed to
ask all hosts the one who has the cached output. That brings me to
another area where distcc may need improvement: host selection.
For now host selection, as far as I know, don't use any serverside
status information (such as current load or number of pending
connections) and since every clients use the same algorithm to select
a host from its hosts list, chances are that distcc clients will all
tend to connect to the same servers.
Imagine a 10 servers farm and 15 developers distributing 5 files to
compile. 15 distcc clients will try to connect to the first server
while 5 servers will remain idle (if I'm wrong with this scenario
please tell me). I saw a patch submission that, as I understood, tend
to eliminate this problem by randomizing the host selection. That may
solve the pb in this case.
Ultimately we may want a way for the client to select the "best" host
based on different criteria:
- does it have already the output in cache?
- does it have a slot available?
- is it the most powerful?
So why not just shout out what we need by broadcasting (or
multicasting) the md4 hash code in just one udp packet. Available
hosts would reply by describing their status (availability, cached
output available, power ratio,...) so that the distcc client could
choose the best one.
UDP is not reliable but reliability is not mandatory in this case
since we use it only as a way to improve the host selection. And
distcc clients will only wait for answers for a very limited time (an
additional way to select the most reactive server).
The work needed for all that would be some merging of ccache code
into distcc so that distcc and distccd can exchange only the hash
code instead of the whole pre-processed file.
The host selection protocol can be totally separated and should have
minimal impact on existing distcc source code.
Additionnaly, if distccd hosts can reply to broadcasting, one may
want to take this opportunity to implement automatic detection of
available distccd servers. But I personnally think zeroconf would be
better suited for this, and I saw a patch submission for it.
thank you for reading!
Laurent
-------------- next part --------------
HTML attachment scrubbed and removed
More information about the distcc
mailing list