[distcc] garbage transferred instead of preprocessed source

Zdenek Behan rain at matfyz.cz
Thu Apr 13 16:51:00 GMT 2006

Martin Pool wrote:

> On 11/04/2006, at 7:27 PM, Zdenek Behan wrote:
>> Hi,
>> I encountered a very strange problem with distcc. Let me explain:
>> I have 2 machines (both gentoo). One is i686 (fast) and the other  is 
>> ppc (slow). I have a working(tested) ppc crosscompiler on i686  and 
>> native compiler on ppp [same versions - 3.4.6]
>> I emerged exactly the same version of distcc on both (tried  multiple 
>> versions), and ran them with:
>> (slow machine)
>> PATH="/usr/powerpc-unknown-linux-gnu/bin:/usr/powerpc-unknown-linux- 
>> gnu/gcc-bin/3.4.6/" /usr/bin/distccd -p 55555 -N 10 --allow  
>> --listen= --no-detach --user distcc -- 
>> log-stderr
>> (fast machine)
>> PATH="/usr/powerpc-unknown-linux-gnu/bin:/usr/powerpc-unknown-linux- 
>> gnu/gcc-bin/3.4.6/" /usr/bin/distccd -p 55555 -N 10 --allow  
>> --listen= --no-detach --user distcc -- 
>> log-stderr
>> I put both hosts (201, 15) into /etc/distcc/hosts.
>> Daemons work fine until i try to actually compile something.
>> -- 
>> #include <stdio.h>
>> int main( int argc, char ** argv )
>> {
>>         printf("Hello world!\n");
>>         return 1;
>> }
>> -- 
>> I created simple hello.c to demonstrate. The command used is:
>> distcc powerpc-unknown-linux-gnu-gcc -c -o hello.o hello.c
>> Now i have 4 variants of using distcc. Fast to Fast (localhost)  Fast 
>> to Slow, Slow to Fast and Slow to Slow.
>> When doing any of the pointless variants (F->S, F->F, S->S),  distccd 
>> creates /tmp/distccd_key.i on the local machine containing  
>> preprocessed source and within fraction of a second, it's done.  
>> Verbose distccd output says something like:
> When you say "/tmp/distccd_key.i " I presume the "key" is actually  
> some random hex characters?

Naturally :)

>> distccd[12392] (dcc_check_client) connection from
>> distccd[12392] compile from hello.c to hello.o
>> distccd[12392] (dcc_r_file_timed) 16695 bytes received in  0.001372s, 
>> rate 11883kB/s
>> distccd[12392] (dcc_collect_child) cc times: user 0.170000s, system  
>> 0.040000s, 501 minflt, 1030 majflt
>> distccd[12392] powerpc-unknown-linux-gnu-gcc hello.c on localhost  
>> completed ok
>> distccd[12392] job complete
>> In the last variant (Slow -> Fast), it creates /tmp/distccd_key.i  as 
>> well, however, what it contains can hardly be compared to  
>> preprocessed source. It's basically a binary file containing a  
>> random dump of some disk data. I have a copy of such a file in case  
>> anyone wants to see it, but there's not much to see, really.
>> Naturally this fails with megabytes long error log going as following
>> /tmp/distccd_237f6a6d.i:122: error: stray '\242' in program
>> /tmp/distccd_237f6a6d.i:122: error: stray '\160' in program
>> /tmp/distccd_237f6a6d.i:122: error: stray '\195' in program
>> /tmp/distccd_237f6a6d.i:122: error: stray '\242' in program
>> ...
>> Output looks like this:
>> distccd[11797] (dcc_check_client) connection from
>> distccd[11797] compile from hello.c to hello.o
>> distccd[11797] (dcc_r_file_timed) 16695 bytes received in  0.002057s, 
>> rate 7926kB/s
>> distccd[11797] (dcc_collect_child) cc times: user 0.425935s, system  
>> 0.989849s, 889 minflt, 0 majflt
>> distccd[11797] powerpc-unknown-linux-gnu-gcc hello.c on localhost  
>> failed
>> distccd[11797] job complete
>> Notice the size file size actually being the same. It's the content  
>> that is scrambled, for reason completely unknown to me. Neither  side 
>> does crash, only report the huge error log and then go on.
>> Just for the record, distcc is built on both systems natively with  
>> native compiler (same version - 3.4.6), glibc versions are not the  
>> same, but i can hardly imagine that being a problem.
>> Can anyone help me, or at least point me to where i should be  
>> looking for the problem? This seems to be purely distcc issue, as  it 
>> never gets to actually compiling anything, besides, i believe my  
>> crosscompiler setup is correct.
>> My first guess was endianity swap (ppc is big endian), but since  
>> there is some totally out of place text mixed up with garbage  binary 
>> data in the temporary file, i think that's not the solution.  So now 
>> i'm left with being completely clueless, and any help will  be 
>> appreciated.
> I suspect you have a kernel bug on the ppc machine which is making it  
> transmit the wrong data across the network.  To check it, please run  
> on the ppc host
>   tcpdump -w distcc.pcap 'tcp port 2622'
> and compile a file.  Then stop tcpdump and post the capture file to  
> me, or have a look at it in ethereal if you like.  I suspect we will  
> see garbage in the DOTI field because sendfile isn't working  
> properly.  What kernel are you running there?  Do you have a known  
> good one you could try?
You were absolutely right there. I checked with ethereal, and it's 
obviously being transmitted badly. I'm not sure whether it's kernel bug 
or not, but i suspect so, since i replaced everything else already, with 
no help. I even discovered 1 more application misbehaving in a similar 
way. It's python's bzip2 library. Every file zipped with that ends up 
with random contents of the disk instead of the data. I also rebuilt the 
whole system, and it still didn't fix it, so it's not just some dynamic 
linking issue. That would show that it may be a problem of certain types 
of file descriptors in the kernel, but who knows what sort of obscure 
problem it really is.

The machine is a network dist storage box to which i'm trying to port 
some reasonable linux distribution for development directly on target 
platform. It has ppcboot, supplied kernel(2.4.20) and busybox based 
initrd in flash already. Unfortunately it's a very slow machine, which 
is why i was installing distcc in there in the first place, to build the 
system and new kernel. Generally crosscompiling those on a faster 
machine is a pain, some makefiles simply do if crosscompiling; then die; 
fi, or in the worse cases, simply use the native host compiler without 
saying a word, much unlike distcc which can be easilly limited to the 
right toolchain by PATH variable, and generally works without problems. 
So, distcc failing was really a painful blow. :)

However, i discovered a workaround for the problem. After digging some 
through distcc source, i found out some comments about ssh and local 
connections being connected differently, i did not examine the problem a 
whole lot more, instead i just tunelled one local port to the target 
distcc on fast machine, and set the remote host to localhost instead, 
and it just worked (TM). I already rebuilt the system, and the problem 
did not go away, but does not bother me that much anymore. I'd still be 
interested in what could be the reason for this bug, however, because i 
always thought remote network connections are generally being treated 
equally, and in this case connecting to own ip adress (not 127.0.01, the 
outer interface) works, but connecting elsewhere fails. Also, all other 
system services (sshd) work perfectly, and didn't even crash a single 
time, so it's not exactly a network problem. I also have issues 
replacing the (possibly buggy) kernel in flash, and i'm stuck with using 
the old one with my new root image for now, so knowing what could 
potentially go wrong in the source would be very valuable info. 
Hopefully even for someone else who might bump into the same problem 
later. :)


