[distcc] distcc hanging after 123 days of system uptime?
dank at kegel.com
Wed Oct 26 22:11:01 GMT 2005
I'm seeing a funny problem. (This is with distcc 2.18.3.)
In fact, I saw this several months ago, and rebooting
all the servers fixed it then, but this time I looked
at it in a bit more detail.
Every other build of a big app, the client will hang on some random file;
netstat on the client shows tens of kilobytes stuck in the send queue;
the server thinks the connection has already been finished.
server# netstat -t | grep 3632
client$ netstat -t | grep 3632
tcp 0 12169 client:37867 server:3632 ESTABLISHED
The client, 14 minutes later, times out with
distcc (dcc_pump_sendfile) ERROR: sendfile failed: Connection timed out
Perhaps this timeout should be set lower?
The five times I've seen this happen just now, it was always with the third host in my host list.
The OS on that server has been up for 123 days, and distccd has been up for
There seemed to be quite a few stuck connections on this one server:
# netstat -t | grep ESTAB
tcp 57133 0 foo:distcc client1:34248 ESTABLISHED
tcp 57101 0 foo:distcc client1:34244 ESTABLISHED
tcp 57147 0 foo:distcc client1:33302 ESTABLISHED
tcp 57143 0 foo:distcc client1:33301 ESTABLISHED
tcp 0 0 foo:distcc client2:47195 ESTABLISHED
tcp 0 0 foo:distcc client2:47187 ESTABLISHED
tcp 0 59368 foo:distcc client2:47184 ESTABLISHED
tcp 0 59368 foo:distcc client2:47124 ESTABLISHED
I cleared them out with
# service distcc restart
but that didn't affect the problem.
Rebooting that node solved the problem... which then moved
to the *next* node in my list that had been up for 123 days!
Rebooting them all seems to have made the problem go away for now.
I hate to admit it, but the server is running kernel 2.2.19-smp.
Maybe it's a kernel bug...
More information about the distcc