update on hung rsyncs

Eric Whiting ewhiting at amis.com
Thu May 30 15:29:02 EST 2002


Some new data on my rsync hangs:

I run about 1500 rsync sessions over ssh daily. In the last 8 days that 
adds up to about 12k rsync sessions. Of those 12k sessions, 10 are right 
now sitting in a hung state. The rsync process on the destination has 
exited, but both rsync processes on the source are still 
running/waiting/hung. I use a timeout of 3600 (but this doesn't seem to 
work for this failure mode)

It is interesting to note that the 10 rsyncs that are hung are all 
solaris 8 destinations, yet about 1/2 of my solaris destinations work 
just fine.

Source is linux 2.4.18 rsync 2.5.5 + generator/timeout patch
(http://lists.samba.org/pipermail/rsync/2002-April/006976.html)

All suns are running solaris8 OpenSSH_2.9p2.
Suns that seem to fail are rsync 2.5.5 (no generator patch) and NFS 
destination for files.
Suns that work are rsync 2.5.3 and local file destination.

Is the NFS difference a clue? It seems like some delay/glitch/issue with 
NFS on the destination might be causing ocassional/random troubles for 
my rsync processes. It seems this NFS factor is something that people 
are bringing up more and more lately. Ideas? I'll try 2.5.5 with the 
generator patch on the destinations.

SendQ and RecvQ are 0 on the source sockets.
strace shows the parent rsync process on source is stuck in this endless 
loop:

gettimeofday({1022796482, 605543}, NULL) = 0
wait4(8783, 0xbffffc48, WNOHANG, NULL)  = 0
gettimeofday({1022796482, 605602}, NULL) = 0
gettimeofday({1022796482, 605626}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 20000}) = 0 (Timeout)
gettimeofday({1022796482, 625224}, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
gettimeofday({1022796482, 635262}, NULL) = 0
wait4(8783, 0xbffffc48, WNOHANG, NULL)  = 0
gettimeofday({1022796482, 635316}, NULL) = 0
gettimeofday({1022796482, 635350}, NULL) = 0

strace on the child rsync process triggers both to exit:

windriver:/home/tisadmin/bin # strace -p 8783
select(7, [3 4], [], NULL, NULL)        = 1 (in [4])
read(4, "", 16384)                      = 0
close(4)                                = 0
select(7, [3], [3], NULL, NULL)         = 1 (out [3])
write(3, "\200\300G\223K\355\2322#\220~Y5\0\210x\206~1e\240M\250"..., 
32) = 32
select(7, [3], [], NULL, NULL)          = 1 (in [3])
read(3, "\226f\301\271\220\200\t6\\\177\"%\3477\336^\255\n\255I"..., 
8192) = 96
brk(0x809d000)                          = 0x809d000
close(6)                                = 0
select(7, [3], [3], NULL, NULL)         = 1 (out [3])
write(3, "\204\355\2058\330\360\242<\313\233QMc\311\307?\322\351"..., 
32) = 32
ioctl(0, TCGETS, 0xbffffa18)            = -1 EINVAL (Invalid argument)
fcntl64(0, F_GETFL)                     = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(0, F_SETFL, O_RDWR)             = 0
ioctl(1, TCGETS, 0xbffffa18)            = -1 EINVAL (Invalid argument)
fcntl64(1, F_GETFL)                     = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl64(1, F_SETFL, O_RDWR)             = 0
ioctl(2, TCGETS, 0xbffffa18)            = -1 ENOTTY (Inappropriate ioctl 
for device)
fcntl64(2, F_GETFL)                     = 0x8801 (flags 
O_WRONLY|O_NONBLOCK|O_LARGEFILE)
fcntl64(2, F_SETFL, O_WRONLY|O_LARGEFILE) = 0
gettimeofday({1022796581, 661010}, NULL) = 0
shutdown(3, 2 /* send and receive */)   = 0
close(3)                                = 0
_exit(0)                                = ?







More information about the rsync mailing list