update on hung rsyncs
Eric Whiting
ewhiting at amis.com
Thu May 30 15:29:02 EST 2002
Some new data on my rsync hangs:
I run about 1500 rsync sessions over ssh daily. In the last 8 days that
adds up to about 12k rsync sessions. Of those 12k sessions, 10 are right
now sitting in a hung state. The rsync process on the destination has
exited, but both rsync processes on the source are still
running/waiting/hung. I use a timeout of 3600 (but this doesn't seem to
work for this failure mode)
It is interesting to note that the 10 rsyncs that are hung are all
solaris 8 destinations, yet about 1/2 of my solaris destinations work
just fine.
Source is linux 2.4.18 rsync 2.5.5 + generator/timeout patch
(http://lists.samba.org/pipermail/rsync/2002-April/006976.html)
All suns are running solaris8 OpenSSH_2.9p2.
Suns that seem to fail are rsync 2.5.5 (no generator patch) and NFS
destination for files.
Suns that work are rsync 2.5.3 and local file destination.
Is the NFS difference a clue? It seems like some delay/glitch/issue with
NFS on the destination might be causing ocassional/random troubles for
my rsync processes. It seems this NFS factor is something that people
are bringing up more and more lately. Ideas? I'll try 2.5.5 with the
generator patch on the destinations.
SendQ and RecvQ are 0 on the source sockets.
strace shows the parent rsync process on source is stuck in this endless
loop:
gettimeofday({1022796482, 605543}, NULL) = 0
wait4(8783, 0xbffffc48, WNOHANG, NULL) = 0
gettimeofday({1022796482, 605602}, NULL) = 0
gettimeofday({1022796482, 605626}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 20000}) = 0 (Timeout)
gettimeofday({1022796482, 625224}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
gettimeofday({1022796482, 635262}, NULL) = 0
wait4(8783, 0xbffffc48, WNOHANG, NULL) = 0
gettimeofday({1022796482, 635316}, NULL) = 0
gettimeofday({1022796482, 635350}, NULL) = 0
strace on the child rsync process triggers both to exit:
windriver:/home/tisadmin/bin # strace -p 8783
select(7, [3 4], [], NULL, NULL) = 1 (in [4])
read(4, "", 16384) = 0
close(4) = 0
select(7, [3], [3], NULL, NULL) = 1 (out [3])
write(3, "\200\300G\223K\355\2322#\220~Y5\0\210x\206~1e\240M\250"...,
32) = 32
select(7, [3], [], NULL, NULL) = 1 (in [3])
read(3, "\226f\301\271\220\200\t6\\\177\"%\3477\336^\255\n\255I"...,
8192) = 96
brk(0x809d000) = 0x809d000
close(6) = 0
select(7, [3], [3], NULL, NULL) = 1 (out [3])
write(3, "\204\355\2058\330\360\242<\313\233QMc\311\307?\322\351"...,
32) = 32
ioctl(0, TCGETS, 0xbffffa18) = -1 EINVAL (Invalid argument)
fcntl64(0, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(0, F_SETFL, O_RDWR) = 0
ioctl(1, TCGETS, 0xbffffa18) = -1 EINVAL (Invalid argument)
fcntl64(1, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(1, F_SETFL, O_RDWR) = 0
ioctl(2, TCGETS, 0xbffffa18) = -1 ENOTTY (Inappropriate ioctl
for device)
fcntl64(2, F_GETFL) = 0x8801 (flags
O_WRONLY|O_NONBLOCK|O_LARGEFILE)
fcntl64(2, F_SETFL, O_WRONLY|O_LARGEFILE) = 0
gettimeofday({1022796581, 661010}, NULL) = 0
shutdown(3, 2 /* send and receive */) = 0
close(3) = 0
_exit(0) = ?
More information about the rsync
mailing list