rsync 3.0.9 hangs when syncing from NFSv3 share - possible to retry after timeout?

Andrew Martin amartin at xes-inc.com
Fri Sep 6 21:55:58 CEST 2013


Hello,

I'm using rsync 3.0.9 to backup several NFS shares from a fileserver, mounted over NFSv3, to a local RAID on a backup server. Both servers are running Ubuntu 12.04 server LTS. The fileserver's filesystem is ext4. The NFS shares are mounted on the backup server as follows:
fileserver:/mnt/storage/share1 /mnt/share1 type nfs (ro,tcp,bg,soft,intr,addr=192.168.1.1)
fileserver:/mnt/storage/share2 /mnt/share2 type nfs (ro,tcp,bg,soft,intr,addr=192.168.1.1)
fileserver:/mnt/storage/share3 /mnt/share3 type nfs (ro,tcp,bg,soft,intr,addr=192.168.1.1)

These shares contain a large amount of files, including SVN checkouts, extracted kernel trees, etc. I've run into a problem where rsync will appear to hang or block indefinitely when backing up one particular share, share3, but occasionally it will happen with one of the other shares instead. A cron starts backing up share3 nightly at 20:15. When this blocking problem does not occur, the backup typically finishes around 20:45. However, when this problem occurs, rsync blocks indefinitely. I have configured rsync to run using the "timeout" command so that it will be killed if not finished by 9:00 the next day:
timeout -k 30s 764m rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05
The exit code is 137, which I believe is 128 (from rsync) plus 9 sent by timeout.

Here are the child rsync processes, as you can see 1915 is in uninterruptable sleep, but I believe that is normal:
root      1914  0.0  0.0  10148   492 ?        S    Sep05   0:00 timeout -k 30s 764m rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05
root      1915  0.0  0.3  81240 27784 ?        D    Sep05   0:20 rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05
root      1916  0.0  0.2 120028 19032 ?        S    Sep05   0:22 rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05
root      1917  0.0  0.3 138272 26612 ?        S    Sep05   0:07 rsync -av --modify-window=2 --link-dest=/mnt/backups/share3/2013-09-04 --exclude .svn/ /mnt/share3/ /mnt/backups/share3/2013-09-05

Running strace on the processes shows that the processes are not actively doing anything:
# strace -p 1914
Process 1914 attached - interrupt to quit
wait4(1915,

# strace -p 1915
Process 1915 attached - interrupt to quit

# strace -p 1916
Process 1916 attached - interrupt to quit
select(4, [3], [], NULL, {10, 731653}^C <unfinished ...>
Process 1916 detached

# strace -p 1917
Process 1917 attached - interrupt to quit
select(1, [0], [], NULL, {27, 691627}^C <unfinished ...>
Process 1917 detached

Based on the output in my rsync log file, I can see the last directory that it copied a file from. I ran "time find /path/to/that/dir -type f" on that directory and some other directories on share3 and all of them returned quickly; I was not able to make "find" block. The rsync crons that run for share1 and share2 typically complete successfully, and they are also mounted over NFS with the same mount options from the same fileserver.

I do not see anything obviously related in dmesg on either the the backup server or fileserver. Does anyone have an idea on what is causing rsync to hang, or if there is a way to have it retry or skip a file if there is a problem rather than blocking forever? The --timeout option seems like it will abort the entire sync, but I would like just skip over the bad section and continue with the rest of the backup. Is this possible?

Thanks,

Andrew


More information about the rsync mailing list