FIN_WAIT1 bug with RH 6 (Re: [distcc] distcc 0.9 released)

Martin Pool mbp at samba.org
Thu Sep 5 02:35:01 GMT 2002


On  4 Sep 2002, "Hien D. Ngo" <hien at moses.xp.com> wrote:
Content-Description: Mail message body
> 
> distcc continues to run on my RH 6 test boxes, but now leaves a ton
> of FIN_WAIT1 processes around (284 total at last count.)  My RH
> 7.2/7.3 boxes don't exhibit this problem and are still running
> without problems as of this writing.

I'm happy to hear about the 7.x machines working.

> =======
> distcc
> =======
> ngoh at build03.foo.com $ netstat -to | grep 3568
> tcp        0     69 build03.foo.com:3568 build04.foo.com:4200 FIN_WAIT1 off (0.00/0/0)
> ngoh at build03.foo.com $ lsof -i:3568

(Let me step through it to be clear in my own mind.)

This is a client; it has a socket open to the server, and it has
closed the local end and is waiting for a FIN from the server.  Also,
there are 69 bytes still buffered, waiting to be either ACKd by the
server, or retransmitted.

I am a little surprised that there is no timer running, because the
client ought to be retransmitting the queued data in an attempt to get
the server to ACK the last 69 bytes.

According to lsof, no program has the socket open, which would explain
why it's closed.  According to your log from the server, the server is
waiting to receive the compiler arguments, so the client should not
normally have exited at that point.  

So I wonder if the client either crashed, or exited abnormally?  It
would be interesting to either look for client-side core files (making
sure they're enabled), or look at the verbose client log to see why
the client went away, or failing that what it managed to do before it
left.

> =======
> distccd
> =======
> ngoh at build04.foo.com $ netstat -to | grep 3568
> tcp        0      0 build04.foo.com:4200 build03.foo.com:3568
> ESTABLISHED off (0.00/0/0)

It looks like everything is fine on the server side; it's trying to
read more data.  And isn't getting any.

So overall I am inclined to suspect that there is a kernel bug
relating to FIN_WAIT1 on RH6.2, and also that something yet to be
determined is causing distcc to quit early.

-- 
Martin 



More information about the distcc mailing list