[distcc] More debugging info for FIN_WAIT1 bug with RH 6

Hien D. Ngo hien at moses.xp.com
Thu Sep 5 03:58:00 GMT 2002


I found the snippet of the distcc log for a FIN_WAIT1 connection.  The child process 
that is spawned exits so it tries to compile locally.  Both the remote and local 
compiles exit with the same error code.  The file being compiled actually exits with 
a real compile error in the log output (usually foo.cpp is trying to #include a 
header that doesn't exist or some such error.)

Hien

---- Original Message ----
From:		Martin Pool
Date:		Wed 9/4/02 20:35
To:		Hien D. Ngo
Cc:		distcc at lists.samba.org
Subject:	Re: FIN_WAIT1 bug with RH 6 (Re: [distcc] distcc 0.9 released)

On  4 Sep 2002, "Hien D. Ngo" <hien at moses.xp.com> wrote:
Content-Description: Mail message body
> 
> distcc continues to run on my RH 6 test boxes, but now leaves a ton
> of FIN_WAIT1 processes around (284 total at last count.)  My RH
> 7.2/7.3 boxes don't exhibit this problem and are still running
> without problems as of this writing.

I'm happy to hear about the 7.x machines working.

> =======
> distcc
> =======
> ngoh at build03.foo.com $ netstat -to | grep 3568
> tcp        0     69 build03.foo.com:3568 build04.foo.com:4200 FIN_WAIT1 off 
(0.00/0/0)
> ngoh at build03.foo.com $ lsof -i:3568

(Let me step through it to be clear in my own mind.)

This is a client; it has a socket open to the server, and it has
closed the local end and is waiting for a FIN from the server.  Also,
there are 69 bytes still buffered, waiting to be either ACKd by the
server, or retransmitted.

I am a little surprised that there is no timer running, because the
client ought to be retransmitting the queued data in an attempt to get
the server to ACK the last 69 bytes.

According to lsof, no program has the socket open, which would explain
why it's closed.  According to your log from the server, the server is
waiting to receive the compiler arguments, so the client should not
normally have exited at that point.  

So I wonder if the client either crashed, or exited abnormally?  It
would be interesting to either look for client-side core files (making
sure they're enabled), or look at the verbose client log to see why
the client went away, or failing that what it managed to do before it
left.

> =======
> distccd
> =======
> ngoh at build04.foo.com $ netstat -to | grep 3568
> tcp        0      0 build04.foo.com:4200 build03.foo.com:3568
> ESTABLISHED off (0.00/0/0)

It looks like everything is fine on the server side; it's trying to
read more data.  And isn't getting any.

So overall I am inclined to suspect that there is a kernel bug
relating to FIN_WAIT1 on RH6.2, and also that something yet to be
determined is causing distcc to quit early.

-- 
Martin 
_______________________________________________
distcc mailing list
distcc at lists.samba.org
http://lists.samba.org/cgi-bin/mailman/listinfo/distcc

-------------- next part --------------
=======
distcc
=======
ngoh at build03.foo.com $ netstat -to | grep 4883
tcp        0     44 build03.foo.com:4883 build05.foo.com:4200 FIN_WAIT1
off (0.00/0/0)
ngoh at build03.foo.com $ lsof -i:4883

=======
distccd
=======
ngoh at build05.foo.com $ netstat -to | grep 4883
tcp        0      0 build05.foo.com:4200 build03.foo.com:4883
ESTABLISHED off (0.00/0/0)
ngoh at build05.foo.com $ lsof -i:4883
COMMAND   PID USER   FD   TYPE  DEVICE SIZE NODE NAME
distccd 26417 ngoh    5u  inet 1549188       TCP
build05.foo.com:4200->build03.foo.com:4883 (ESTABLISHED)
ngoh at build05.foo.com $ strace -p26417
about to attach 6731
read(5,  <unfinished ...>

ngoh at build03.foo.com $ grep 23470 /tmp/distcc.log
distcc[23470] (dcc_scan_args) scanning arguments: g++ -fPIC -g -O -Wall -pipe -pthread -Wno-non-template-friend -Wwrite-strings -ffor-scope -I./shadow/linux -I../../corba_util/linux.bld -I. -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -DRW_NO_STL -ftemplate-depth-50 -D_POSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -D_REENTRANT -DACE_HAS_AIO_CALLS -DACE_HAS_EXCEPTIONS -I/usr/local/ACE-5.2 -I/usr/local/ACE-5.2/TAO -I/usr/local/ACE-5.2/TAO/tao -I/usr/local/ACE-5.2/TAO/tao/PortableServer -I/usr/local/ACE-5.2/TAO/orbsvcs/orbsvcs -I/scratch/users/ngoh/ver/hdr/portable/tao -I/scratch/users/ngoh/ver/hdr/portable/tao -DCORBA_IMPL_TAO -DRW_CENTURY_REQD -DRW_MULTI_THREAD -D_REENTRANT -I/usr/local/RogueWave-7.1.1 -DMY_RW_CTLIB_=/usr/local/RogueWave-7.1.1/lib/libsdb12d.so -DCOMPAT_LAYER_NO_MIN_MAX -c -o ../../corba_util/linux.bld/BlockingResultListener.o BlockingResultListener.cpp
distcc[23470] (dcc_scan_args) found object file "../../corba_util/linux.bld/BlockingResultListener.o"
distcc[23470] (dcc_scan_args) found input file "BlockingResultListener.cpp"
distcc[23470] compile from BlockingResultListener.cpp to ../../corba_util/linux.bld/BlockingResultListener.o
distcc[23470] (dcc_parse_hosts) found tcp token "build04.foo.com"
distcc[23470] (dcc_parse_hosts) found tcp token "build05.foo.com"
distcc[23470] (dcc_parse_hosts) found tcp token "rizzo.foo.com"
distcc[23470] (dcc_parse_hosts) found tcp token "kermit.foo.com"
distcc[23470] (dcc_try_lock_host) locked /tmp/distcc_00002493/lock_build04.foo.com_0000000
distcc[23470] (dcc_pick_buildhost) building on build04.foo.com
distcc[23470] (dcc_set_output) changed output from "../../corba_util/linux.bld/BlockingResultListener.o" to "/tmp/distcc_00002493/cppout_0000023470.i"
distcc[23470] (dcc_set_output) command after: g++ -fPIC -g -O -Wall -pipe -pthread -Wno-non-template-friend -Wwrite-strings -ffor-scope -I./shadow/linux -I../../corba_util/linux.bld -I. -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -DRW_NO_STL -ftemplate-depth-50 -D_POSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -D_REENTRANT -DACE_HAS_AIO_CALLS -DACE_HAS_EXCEPTIONS -I/usr/local/ACE-5.2 -I/usr/local/ACE-5.2/TAO -I/usr/local/ACE-5.2/TAO/tao -I/usr/local/ACE-5.2/TAO/tao/PortableServer -I/usr/local/ACE-5.2/TAO/orbsvcs/orbsvcs -I/scratch/users/ngoh/ver/hdr/portable/tao -I/scratch/users/ngoh/ver/hdr/portable/tao -DCORBA_IMPL_TAO -DRW_CENTURY_REQD -DRW_MULTI_THREAD -D_REENTRANT -I/usr/local/RogueWave-7.1.1 -DMY_RW_CTLIB_=/usr/local/RogueWave-7.1.1/lib/libsdb12d.so -DCOMPAT_LAYER_NO_MIN_MAX -E -o /tmp/distcc_00002493/cppout_0000023470.i BlockingResultListener.cpp
distcc[23470] (dcc_spawn_child) forking to execute g++ -fPIC -g -O -Wall -pipe -pthread -Wno-non-template-friend -Wwrite-strings -ffor-scope -I./shadow/linux -I../../corba_util/linux.bld -I. -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -DRW_NO_STL -ftemplate-depth-50 -D_POSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -D_REENTRANT -DACE_HAS_AIO_CALLS -DACE_HAS_EXCEPTIONS -I/usr/local/ACE-5.2 -I/usr/local/ACE-5.2/TAO -I/usr/local/ACE-5.2/TAO/tao -I/usr/local/ACE-5.2/TAO/tao/PortableServer -I/usr/local/ACE-5.2/TAO/orbsvcs/orbsvcs -I/scratch/users/ngoh/ver/hdr/portable/tao -I/scratch/users/ngoh/ver/hdr/portable/tao -DCORBA_IMPL_TAO -DRW_CENTURY_REQD -DRW_MULTI_THREAD -D_REENTRANT -I/usr/local/RogueWave-7.1.1 -DMY_RW_CTLIB_=/usr/local/RogueWave-7.1.1/lib/libsdb12d.so -DCOMPAT_LAYER_NO_MIN_MAX -E -o /tmp/distcc_00002493/cppout_0000023470.i BlockingResultListener.cpp
distcc[23470] (dcc_spawn_child) child started as pid23516
distcc[23470] exec on build04.foo.com: g++ -fPIC -g -O -Wall -pipe -pthread -Wno-non-template-friend -Wwrite-strings -ffor-scope -I./shadow/linux -I../../corba_util/linux.bld -I. -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -DRW_NO_STL -ftemplate-depth-50 -D_POSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -D_REENTRANT -DACE_HAS_AIO_CALLS -DACE_HAS_EXCEPTIONS -I/usr/local/ACE-5.2 -I/usr/local/ACE-5.2/TAO -I/usr/local/ACE-5.2/TAO/tao -I/usr/local/ACE-5.2/TAO/tao/PortableServer -I/usr/local/ACE-5.2/TAO/orbsvcs/orbsvcs -I/scratch/users/ngoh/ver/hdr/portable/tao -I/scratch/users/ngoh/ver/hdr/portable/tao -DCORBA_IMPL_TAO -DRW_CENTURY_REQD -DRW_MULTI_THREAD -D_REENTRANT -I/usr/local/RogueWave-7.1.1 -DMY_RW_CTLIB_=/usr/local/RogueWave-7.1.1/lib/libsdb12d.so -DCOMPAT_LAYER_NO_MIN_MAX -c -o ../../corba_util/linux.bld/BlockingResultListener.o BlockingResultListener.cpp
distcc[23470] (dcc_open_socket_out) client got connection to build04.foo.com port 4200 on fd6
distcc[23470] (dcc_collect_child) child 23516 terminated with status 0x100
distcc[23470] (dcc_report_rusage) cpp resource usage: 0.080000s user, 0.090000s system
distcc[23470] (dcc_critique_status) ERROR: cpp on build03.foo.com failed with exit code 1
distcc[23470] (dcc_build_somewhere) Notice: failed to distribute, running locally instead
distcc[23470] exec on localhost: g++ -fPIC -g -O -Wall -pipe -pthread -Wno-non-template-friend -Wwrite-strings -ffor-scope -I./shadow/linux -I../../corba_util/linux.bld -I. -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -DRW_NO_STL -ftemplate-depth-50 -D_POSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -D_REENTRANT -DACE_HAS_AIO_CALLS -DACE_HAS_EXCEPTIONS -I/usr/local/ACE-5.2 -I/usr/local/ACE-5.2/TAO -I/usr/local/ACE-5.2/TAO/tao -I/usr/local/ACE-5.2/TAO/tao/PortableServer -I/usr/local/ACE-5.2/TAO/orbsvcs/orbsvcs -I/scratch/users/ngoh/ver/hdr/portable/tao -I/scratch/users/ngoh/ver/hdr/portable/tao -DCORBA_IMPL_TAO -DRW_CENTURY_REQD -DRW_MULTI_THREAD -D_REENTRANT -I/usr/local/RogueWave-7.1.1 -DMY_RW_CTLIB_=/usr/local/RogueWave-7.1.1/lib/libsdb12d.so -DCOMPAT_LAYER_NO_MIN_MAX -c -o ../../corba_util/linux.bld/BlockingResultListener.o BlockingResultListener.cpp
distcc[23470] (dcc_spawn_child) forking to execute g++ -fPIC -g -O -Wall -pipe -pthread -Wno-non-template-friend -Wwrite-strings -ffor-scope -I./shadow/linux -I../../corba_util/linux.bld -I. -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -I/scratch/users/ngoh/ver/hdr/shadow/linux -I/scratch/users/ngoh/ver/hdr -DRW_NO_STL -ftemplate-depth-50 -D_POSIX_THREADS -D_POSIX_THREAD_SAFE_FUNCTIONS -D_REENTRANT -DACE_HAS_AIO_CALLS -DACE_HAS_EXCEPTIONS -I/usr/local/ACE-5.2 -I/usr/local/ACE-5.2/TAO -I/usr/local/ACE-5.2/TAO/tao -I/usr/local/ACE-5.2/TAO/tao/PortableServer -I/usr/local/ACE-5.2/TAO/orbsvcs/orbsvcs -I/scratch/users/ngoh/ver/hdr/portable/tao -I/scratch/users/ngoh/ver/hdr/portable/tao -DCORBA_IMPL_TAO -DRW_CENTURY_REQD -DRW_MULTI_THREAD -D_REENTRANT -I/usr/local/RogueWave-7.1.1 -DMY_RW_CTLIB_=/usr/local/RogueWave-7.1.1/lib/libsdb12d.so -DCOMPAT_LAYER_NO_MIN_MAX -c -o ../../corba_util/linux.bld/BlockingResultListener.o BlockingResultListener.cpp
distcc[23470] (dcc_spawn_child) child started as pid23531
distcc[23470] (dcc_collect_child) child 23531 terminated with status 0x100
distcc[23470] (dcc_report_rusage) g++ resource usage: 0.990000s user, 0.210000s system
distcc[23470] (dcc_critique_status) ERROR: compile on build03.foo.com failed with exit code 1
distcc[23470] (dcc_exit) Notice: exit: code 1; self: 0.010000 user 0.010000 sys; children: 1.100000 user 0.420000 sys
distcc[23469] (dcc_spawn_child) child started as pid23470
distcc[23469] (dcc_collect_child) child 23470 terminated with status 0


More information about the distcc mailing list