Reliability and robustness problems
John
rsync at computerdatasafe.com.au
Mon Jun 7 23:37:32 GMT 2004
I am trying to use rsync to backup from a site we will call "office" and
another we will call "home."
Both sites have DSL accounts provided by Arachnet.
At present the files being backed up don't all all to be backed up, but
OTOH we wish to backup lots more files that aren't being backed up now.
First, we create a local backup on our office machine which happens to
be called "mail." We have this directory structure:
drwxr-xr-x 20 root 4096 May 17 23:06 20040517-1500-mon
drwxr-xr-x 20 root 4096 May 18 23:06 20040518-1500-tue
drwxr-xr-x 20 root 4096 May 19 23:09 20040519-1500-wed
drwxr-xr-x 20 root 4096 May 20 23:09 20040520-1500-thu
drwxr-xr-x 20 root 4096 May 21 23:09 20040521-1500-fri
drwxr-xr-x 20 root 4096 May 22 23:10 20040522-1500-sat
drwxr-xr-x 20 root 4096 May 23 23:09 20040523-1500-sun
drwxr-xr-x 20 root 4096 May 24 23:10 20040524-1500-mon
drwxr-xr-x 20 root 4096 May 25 23:10 20040525-1500-tue
drwxr-xr-x 20 root 4096 May 26 23:10 20040526-1500-wed
drwxr-xr-x 20 root 4096 May 27 23:10 20040527-1500-thu
drwxr-xr-x 20 root 4096 May 28 23:11 20040528-1500-fri
drwxr-xr-x 20 root 4096 May 29 23:11 20040529-1500-sat
drwxr-xr-x 20 root 4096 May 30 23:10 20040530-1500-sun
drwxr-xr-x 20 root 4096 May 31 23:11 20040531-1500-mon
drwxr-xr-x 3 root 4096 Jun 1 14:10 20040601-0603-tue
drwxr-xr-x 3 root 4096 Jun 1 23:07 20040601-1500-tue
drwxr-xr-x 3 root 4096 Jun 2 07:42 20040601-2323-tue
drwxr-xr-x 3 root 4096 Jun 2 23:07 20040602-1500-wed
drwxr-xr-x 3 root 4096 Jun 3 14:04 20040603-0555-thu
drwxr-xr-x 3 root 4096 Jun 3 23:06 20040603-1500-thu
drwxr-xr-x 3 root 4096 Jun 4 23:07 20040604-1500-fri
drwxr-xr-x 3 root 4096 Jun 5 23:08 20040605-1500-sat
drwxr-xr-x 3 root 4096 Jun 7 14:19 20040607-0610-mon
drwxr-xr-x 3 root 4096 Jun 8 05:01 20040607-2054-mon
drwxr-xr-x 3 root 4096 Jun 8 05:35 20040607-2128-mon
drwxr-xr-x 20 root 4096 Jun 1 14:06 latest
The timestamps in the directory names are UTC times.
We maintain the contents of latest thus:
+ rsync --recursive --links --hard-links --perms --owner --group
--devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete
--delete-excluded --delete-after --max-delete=80 --relative --stats
--numeric-ids --exclude-from=/etc/local/backup/system-backup.excludes
/boot/ / /home/ /var/ /var/local/backups/office//latest
and create the backup-du-jour:
+ cp -rl /var/local/backups/office//latest
/var/local/backups/office//20040607-2128-mon
That part works well, and the rsync part generally takes about seven
minutes.
To copy office to home we try this:
+ rsync --recursive --links --hard-links --perms --owner --group
--devices --times --sparse --one-file-system --rsh=/usr/bin/ssh --delete
--delete-excluded --delete-after --max-delete=80 --relative --stats
--numeric-ids /var/local/backups 192.168.0.1:/var/local/backups/
Prior to this run that is in progress, we used home's external host
name. I've created a VPN between the two sites (for other reasons) using
OpenVPN: all the problems we've had so far occurred with, we'll say, the
hostname is "home.arach.net.au" as that's the default way Arachnet
assign hostnames.
I'm hoping that OpenVPN will provide a more robust recovery from network
problems.
Problems we've had include
1. ADSL connexion at one end ot the other dropping for a while. rsync
doesn't notice and mostly hangs. I have seen rsync at home still
running but with no relevant files open.
2. rsync uses an enormous amount of virtual memory with the result the
Linux kernel lashes out at lots of processes, mostly innocent, until it
lucks on rsync. This can cause rsync to terminate without a useful message.
2a. Sometimes the rsync that does this is at home.
I've alleviated this at office by allocating an unreasonable amount of
swap: unreasonable because if it gets used, performance will be truly
dreadful.
3. rsync does not detect when its partner has vanished. I don't
understand why this should be so: it seems to me that, at office, it
should be able to detect by the fact {r,s}sh has terminated or by
timeout, and at home by timeout.
3a. It'd like to see rsync have the ability to retry in the case it's
initiated the transfer. It can take some time to collect together the
information as to what needs to be done: if I try in its wrapper script,
then this has to be redone whereas, I surmise, rsync doing the retry
would not need to.
4. I've already mentioned this, but as I've had no feedback I'll try again.
As you can see from the above, the source directories for the transfer
from office to home are chock-full of hard links. As best I can tell,
rsync is transferring each copy fresh instead of recognising the hard
link before the transfer and getting the destination rsync to make a new
hard link. It is so that it _can_ do this that I present the backup
directory as a whole and not the individual day's backup. That, and I
have hopes that today's unfinished work will be done tomorrow.
This approach seems so far to be problematic, and I am wondering whether
I should instead be doing one of these:
A. Create a filesystem image with
dd if=/dev/zero of=backup .... # of suitable size
mke2fs backup
then mount -o loop, and put my backups inside that, and then use rsync
to sync that offsite.
Presumably this will use much less virtual memory. The question is how
quickly it would sync the two images. I imagine my problem with hard
links will vanish.
B. Create a filesystem image as above
Use jigdo to keep the images in sync.
C. Use md5sum and some home-grown scripts to decide what to transfer.
I'm not keen on C. as basically it's implementing what I think rsync
should be doing.
btw the latest directory contains 1.5 Gbytes of data. The system is
still calculating that today's backup contains 1.5 Gbytes, so it seems
the startup costs are considerable.
More information about the rsync
mailing list