Is -R --link-dest really hard to use, or is it me?

Matt McCutchen matt at mattmccutchen.net
Thu Jan 29 05:23:31 GMT 2009


On Mon, 2009-01-26 at 03:48 -0500, foner-rsync at media.mit.edu wrote: 
> but eventually I'm
> going to want to migrate this ext3 to ext4, and the problem will
> recur at that point.

Incidentally, are you sure about that?  I thought one could just mount
an ext3 filesystem as type ext4 and it would be automatically converted
in place.

> > However, the more recently added --copy-dest and --link-dest:
> 
>     > [ . . . ]
> 
>     > have the IMHO more useful interpretation that the basis dir is to be
>     > used as an optimization (of network traffic and/or destination disk
>     > usage), without affecting either the itemization or the final contents
>     > of the destination.  I entered an enhancement request for this to be
>     > supported properly:
> 
>     > https://bugzilla.samba.org/show_bug.cgi?id=5645
> 
> I see where you're going with that; I assume that such an enhancement
> would, as fallout, cause itemization of created hardlinks when using
> a --dest arg.

Yes. 

> ...though on the other hand, would this dramatically clutter up the
> output of a "normal" --link-dest where, typically, one is looking to
> see which -new- files got transferred as opposed to seeing the
> creation of a zillion files that were in the basis dirs?  (Since
> you seem to advocate two different options, I guess that would allow
> users to decide either way.)

Yes.  Users could choose itemization with respect to the destination or
with respect to the basis dir, including "deletions".

> > Right.  To recap the problem: In order to transfer both b/2/ and c/2/ to
>     > the proper places under dst/ in a single run, you needed to include the
>     > "b/2/" and "c/2/" path information in the file list by using -R.  But
>     > consequently, rsync is going to look for b/2/foo and c/2/foo under
>     > whatever --link-dest dir you specify, and there's no directory on the
>     > destination side that contains files at those paths (yet).
> 
> So you're saying that there appears to be no way to tell rsync what I
> want to do in this case---I haven't missed something, and it's either
> a limitation or a design goal that it works this way.  Correct?
> [Err, except that perhaps you have a solution below; it's just that
> -R is pretty much useless with any of the --*-dests.]

It is by design that rsync looks for an alternate basis by following the
source file's full file-list path under each --*-dest dir.  What makes
it hard to specify an appropriate --*-dest dir is not -R per se but the
presence of path components in the file list that are not matched above
your existing files on the destination side.  This most commonly happens
with -R, but it can still happen without -R (but only with a single
unmatched path component):

rsync --link-dest=/backups/host1 /sources/host2 /sources/host3 /backups/ 

Conversely, some uses of -R do not present a problem:

cd /sources/host2 && rsync --link-dest=/backups/host1 -R usr/bin/rsnapshot etc/rsnapshot.conf /backups/host2

I hope the problem is perfectly clear now.  Whenever it does occur, you
can prepare a --link-dest directory on the destination side with
symlinks arranged so that rsync will find your files where it expects
them.  In your original example, you could make the following directory
structure:

/home/blah/H/
    linkdest/
        b/
            2 -> /home/blah/H/dst/a/1
        c/
            2 -> /home/blah/H/dst/a/1

and then pass --link-dest=/home/blah/H/linkdest .

This will work, but the second solution may be better.


On that topic, I'm not sure whether I'm missing your point or you're
missing mine, so I'm afraid responding point by point may lead to more
confusion; instead, let me restate my proposal more clearly in context.

Consider the simple scenario of a sequence of backups taken of a single
host over time, let's say by rsnapshot (just because I'm more familiar
with it; dirvish probably works the same way).  When rsnapshot makes
each backup, it passes a --link-dest option for the previous backup so
that unchanged files are linked from the previous backup.  This means
the hard links in the destination have a very special form: to one path
in one backup from /the same/ path in the previous backup.  Thus, each
time rsnapshot-copy migrates a backup, it can catch all those hard links
by passing --link-dest to the previous backup.

Completely separate is the issue of a hard-linked file on the source
host, such as /usr/bin/c++ a.k.a. /usr/bin/g++ .  This hard link will be
preserved within each backup if and only if rsnapshot uses -H.
Likewise, when rsnapshot-copy migrates each backup, the hard link will
be preserved if and only if rsnapshot-copy uses -H.

In every case, to correctly migrate a sequence of backups produced by
rsnapshot with certain options, rsnapshot-copy should use the same
options.  A convenient way to think about this is that rsnapshot-copy is
doing the exact same thing as rsnapshot except that its source each time
is a backup from the original sequence instead of the original host.

In the simple scenario, there was a one-dimensional sequence of backups
indexed by date.  I'm viewing your scenario as the same, but with a
two-dimensional array of backups indexed by host and date.  If you have
hosts A through F and dates 20080101 through 20080110, then I'm
proposing 60 rsync runs, one for each of the 60
host{A..F}/200801{01..10} dirs.

Just as in the one-dimensional case, you need -H if and only if you care
about intra-backup hard links, like /usr/bin/c++ and /usr/bin/g++.  But 

Now inter-backup hard links can now go along either the host dimension
or the date dimension, so you need multiple --link-dest options to catch
all of them.  When you migrate one backup, to catch the links made by
dirvish, pass --link-dest to the same host on the previous date.  To
catch those made by your runs of faster-dupemerge on 6 hosts x 2 dates
at a time, use --link-dest to each backup on the previous date and to
each already-copied backup on the same date.

That's a lot of --link-dest options, but it should catch all of the hard
links between the same path in different backups.  Just as in the
one-dimensional case, -H will catch all hard links between different
paths in the same backup, like /usr/bin/c++ and /usr/bin/g++ , if you
care about that.  Let's call this solution 1.

Since (IIUC) faster-dupemerge finds all hard links in the directories
you pass without regard to whether the path is the same, it may have
made some hard links that this approach won't catch, such as the
movement of a file from one path on one host to a different path on a
different host.  Running "rsync -H" simultaneously on all backups for
the same date via some symlink trickery is actually closer to what
faster-dupemerge did and will catch more of these "edge case" hard
links, with memory usage comparable to that of the faster-dupemerge
runs; call this solution 2.

If you can afford the memory for solution 2, great.  Otherwise, I
suggest solution 1.  If you use -H, rsync will just need enough memory
for one backup.  If not, it will be completely incremental, regardless
of the number of --link-dest options.  (That's not really true: rsync
always accumulates a list of the directories in the source, but that's
an artifact of the implementation rather than an essential feature of
how you're handling hard links.)

Whew.  I hope that explanation made sense.  Please tell me if something
is still unclear or if there's an aspect of your scenario that I'm
missing.  Now, there are a few points I'll answer individually:

> Can I rely on rsync -not- doing a complete directory scan of the
> --link-dest's?  E.g., if hostC/20080102 never mentions dir a/b/c,
> rsync won't bother investigating a/b/c on any of the link-dest's?

Yes.

> [Unfortunately, the pivoting strategy I was thinking of, and which
> rsnapshot-copy implements, still wastes a lot of time redundantly
> rescanning the former target when it becomes the new --link-dest

Not really "rescanning", but reaching in to access individual paths
corresponding to source files.

> I thought about symlinks but didn't want rsync to copy those as well,
> though if I'd been smarter about it I'd have realized that I could
> trivially delete them from the destination when everything finished
> (and I may still try it, or something like it, if the solution you
> advanced above doesn't work).

Umm, no, rsync would copy the symlinks and /not/ their targets, which is
a bigger problem.  You want rsync to follow the symlinks.
--copy-dirlinks would follow all symlinks to directories, including any
within your backups, which is bad; I assume you want to preserve
symlinks within backups as part of the backup data.  Instead, there's a
nasty trick you can use to follow just the symlinks in your
custom-crafted tree: use -R and pass each symlink as a separate source
argument with a trailing slash.  See also this thread:

http://lists.samba.org/archive/rsync/2006-February/014838.html

> [...]  if the source had a/1, b/2, and c/3 all hardlinked
> together, rsync appeared to read -all 3- files to compute their
> checksums [...]

If you want to pursue this, please start a separate thread.

-- 
Matt



More information about the rsync mailing list