Reliability and robustness problems

John rsync at computerdatasafe.com.au
Mon Jun 7 23:37:32 GMT 2004


I am trying to use rsync to backup from a site we will call "office" and 
another we will call "home."

Both sites have DSL accounts provided by Arachnet.

At present the files being backed up don't all all to be backed up, but 
OTOH we wish to backup lots more files that aren't being backed up now.

First, we create a local backup on our office machine which happens to 
be called "mail." We have this directory structure:
drwxr-xr-x   20 root         4096 May 17 23:06 20040517-1500-mon
drwxr-xr-x   20 root         4096 May 18 23:06 20040518-1500-tue
drwxr-xr-x   20 root         4096 May 19 23:09 20040519-1500-wed
drwxr-xr-x   20 root         4096 May 20 23:09 20040520-1500-thu
drwxr-xr-x   20 root         4096 May 21 23:09 20040521-1500-fri
drwxr-xr-x   20 root         4096 May 22 23:10 20040522-1500-sat
drwxr-xr-x   20 root         4096 May 23 23:09 20040523-1500-sun
drwxr-xr-x   20 root         4096 May 24 23:10 20040524-1500-mon
drwxr-xr-x   20 root         4096 May 25 23:10 20040525-1500-tue
drwxr-xr-x   20 root         4096 May 26 23:10 20040526-1500-wed
drwxr-xr-x   20 root         4096 May 27 23:10 20040527-1500-thu
drwxr-xr-x   20 root         4096 May 28 23:11 20040528-1500-fri
drwxr-xr-x   20 root         4096 May 29 23:11 20040529-1500-sat
drwxr-xr-x   20 root         4096 May 30 23:10 20040530-1500-sun
drwxr-xr-x   20 root         4096 May 31 23:11 20040531-1500-mon
drwxr-xr-x    3 root         4096 Jun  1 14:10 20040601-0603-tue
drwxr-xr-x    3 root         4096 Jun  1 23:07 20040601-1500-tue
drwxr-xr-x    3 root         4096 Jun  2 07:42 20040601-2323-tue
drwxr-xr-x    3 root         4096 Jun  2 23:07 20040602-1500-wed
drwxr-xr-x    3 root         4096 Jun  3 14:04 20040603-0555-thu
drwxr-xr-x    3 root         4096 Jun  3 23:06 20040603-1500-thu
drwxr-xr-x    3 root         4096 Jun  4 23:07 20040604-1500-fri
drwxr-xr-x    3 root         4096 Jun  5 23:08 20040605-1500-sat
drwxr-xr-x    3 root         4096 Jun  7 14:19 20040607-0610-mon
drwxr-xr-x    3 root         4096 Jun  8 05:01 20040607-2054-mon
drwxr-xr-x    3 root         4096 Jun  8 05:35 20040607-2128-mon
drwxr-xr-x   20 root         4096 Jun  1 14:06 latest

The timestamps in the directory names are UTC times.

We maintain the contents of latest thus:
+ rsync --recursive --links --hard-links --perms --owner --group 
--devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete 
--delete-excluded --delete-after --max-delete=80 --relative --stats 
--numeric-ids --exclude-from=/etc/local/backup/system-backup.excludes 
/boot/ / /home/ /var/ /var/local/backups/office//latest

and create the backup-du-jour:
+ cp -rl /var/local/backups/office//latest 
/var/local/backups/office//20040607-2128-mon

That part works well, and the rsync part generally takes about seven 
minutes.

To copy office to home we try this:
+ rsync --recursive --links --hard-links --perms --owner --group 
--devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete 
--delete-excluded --delete-after --max-delete=80 --relative --stats 
--numeric-ids /var/local/backups 192.168.0.1:/var/local/backups/

Prior to this run that is in progress, we used home's external host 
name. I've created a VPN between the two sites (for other reasons) using 
OpenVPN: all the problems we've had so far occurred with, we'll say, the 
hostname is "home.arach.net.au" as that's the default way Arachnet 
assign hostnames.

I'm hoping that OpenVPN will provide a more robust recovery from network 
problems.

Problems we've had include
1. ADSL connexion at one end ot the other dropping for a while. rsync 
doesn't notice and mostly  hangs. I have seen rsync at home still 
running but with no relevant files open.

2. rsync uses an enormous amount of  virtual memory with the result the 
Linux kernel lashes out at lots of processes, mostly innocent, until it 
lucks on rsync. This can cause rsync to terminate without a useful message.
2a. Sometimes the rsync that does this is at home.
I've alleviated this at office by allocating an unreasonable amount of 
swap: unreasonable because if it gets used, performance will be truly 
dreadful.

3. rsync does not detect when its partner has vanished. I don't 
understand why this should be so: it seems to me that, at office, it 
should be able to detect by the fact {r,s}sh has terminated or by 
timeout, and at home by timeout.

3a. It'd like to see rsync have the ability to retry in the case it's 
initiated the transfer. It can take some time to collect together the 
information as to what needs to be done: if I try in its wrapper script, 
then this has to be redone whereas, I surmise, rsync doing the retry 
would not need to.

4. I've already mentioned this, but as I've had no feedback I'll try again.
As you can see from the above, the source directories for the transfer 
from office to home are chock-full of hard links. As best I can tell, 
rsync is transferring each copy fresh instead of recognising the hard 
link before the transfer and getting the destination rsync to make a new 
hard link. It is so that it _can_ do this that I present the backup 
directory as a whole and not the individual day's backup. That, and I 
have hopes that today's unfinished work will be done tomorrow.


This approach seems so far to be problematic, and I am wondering whether 
I should instead be doing one of these:
A. Create a filesystem image with
   dd if=/dev/zero of=backup .... # of suitable size
   mke2fs backup
then mount -o loop, and put my backups inside that, and then use rsync 
to sync that offsite.
Presumably this will use much less virtual memory. The question is how 
quickly it would sync the two images. I imagine my problem with hard 
links will vanish.

B.  Create a filesystem image as above
 Use jigdo to keep the images in sync.

C. Use md5sum and some home-grown scripts to decide what to transfer.

I'm not keen on C. as basically it's implementing what I think rsync 
should be doing.

btw the latest directory contains 1.5 Gbytes of data. The system is 
still calculating that today's backup contains 1.5 Gbytes, so it seems 
the startup costs are considerable.





More information about the rsync mailing list