[distcc] homogeneous environments

Fergus Henderson fergus.henderson at gmail.com
Wed Apr 29 17:19:48 GMT 2009


On Tue, Apr 28, 2009 at 4:28 PM, Robert W. Anderson <
anderson110 at poptop.llnl.gov> wrote:

>
> I have an environment where we have many nodes potentially available for
> compilation, and all of them see the same file spaces via NFS.  We are
> seeing decent performance out of distcc 3.1 using pump mode, but from
> reading the docs there may be big performance gains left to wring out in
> this special(?) case.
>
> If I understand correctly, distcc's pump mode finds a set of header files
> necessary to send along with the source file to enable compilation on a
> remote node.  In a homogeneous environment, it seems both steps here are
> unnecessary if the master and slave nodes are more or less indistinguishable
> in terms of compiler, sources, and headers.
>
> I think we could really achieve some screaming compile times (over
> thousands of source files) if these steps could be bypassed with the user's
> explicit acknowledgement that he is making assumptions about the homogeneity
> of his build server machines.
>
> How extensive would the modifications be to support such an optimization?
>  It was not clear to me after a few minutes of poking around in the source,
> and thought I'd seek an expert opinion first.


Typically NFS is a lot slower than local file access.
So it's not clear that this approach would actually improve overall
performance.

Distcc can work faster than NFS, because it sends all of the source files at
once, requiring only one round-trip between the client and the distcc server
for each compilation.  With NFS, you need a round-trip between the distcc
server and the NFS server for each header file that is included (directly or
indirectly) from the source file being compiled.

Of course with distcc, if your source files are on NFS, the client needs to
do the same round-trips to the NFS server to fetch the files, but this is
not as bad as having the distcc servers do that, because the distcc client
need only fetch each file once for the whole build, not once for each
compilation in which it is referenced, and after that the file will probably
be cached.  In addition, the client machine is more likely to have source
files cached from previous builds, since on the client machine you're
probably compiling the same sources that you compiled last time, whereas on
the distcc server machines they are serving lots of different users who may
be compiling very different programs.

Another issue with this approach is that there may also be additional
security considerations.  Currently distcc servers normally run as user
"distcc", which may not have access to the user's NFS files, so this
approach would not work if the source files are not world-readable.  Of
course it would be possible to address this issue by having the distcc
server authenticate the user, and then access the user's files on NFS as
that user, but that would require additional authentication, which would
have a performance impact.  For example one way to do it would be to use
distcc's ssh mode, but that mode has a major performance impact. (The
recently posted patches for GSSAPI support have less performance impact, but
there is still a significant impact.)

For the approach that you are considering, you may not need to use distcc at
all;
a simple script using ssh may be sufficient, though the overheads of ssh may
be prohibitive (ssh connection sharing may help with that, although that has
security concerns of its own).
If you do want to modify distcc, I'd guess that the modifications needed
would be moderate in scope.

Cheers,
  Fergus.

-- 
Fergus Henderson <fergus at google.com>
-------------- next part --------------
HTML attachment scrubbed and removed


More information about the distcc mailing list