Is -R --link-dest really hard to use, or is it me?

Matt McCutchen matt at mattmccutchen.net
Sun Jan 25 06:02:15 GMT 2009


I regret the slow response.  I was interested in your problem, but I
knew it would take me a while to respond thoughtfully, so I put the
message aside and didn't get back to it until now.  I hope this is still
useful.

On Sun, 2009-01-11 at 23:24 -0500, foner-rsync at media.mit.edu wrote:
> I've got a problem for which the combination of -R and --link-dest
> doesn't seem to be quite enough---and I may have discovered a few
> small bugs as well; test cases are below.
> 
> [And if someone has a scheme for doing this that doesn't involve rsync
> at all, but works okay, I'm all ears as well---I'm not the first with
> this problem.]
> 
> Here's my problem:  I unfortunately need to move a large dirvish
> vault.  This is a directory tree consisting of -many- hardlinked
> files, which means that moving it in pieces will copy many times more
> data than is actually there, but trying to move the entire thing in
> one shot consumes more RAM than is available.  [rsync on the toplevel
> dir blew up almost immediately, as I expected.  cp -a was consuming at
> least 130meg per snapshot and therefore looked likely to consume at
> least 10G of RAM to finish; it's actually possible for other reasons
> it might have been closer to 20G.  It thus got slower and slower as
> it became more and more page-bound and I eventually got tired of it
> thrashing itself to death; ETA might have been a few weeks at that
> rate.  I can't just move the underlying blocks (e.g., copy the
> partition as a partition) because the whole reason I'm moving this
> filesystem in the first place is because it has errors that fsck is
> having trouble fixing---bug or bad hardware isn't established yet.
> And I don't know if dump/restore works well on ext3 filesystems, is
> well-tested these days, will work for ext4 when I finally migrate to
> that, or produces good data if the filesystem I'm starting with has
> errors that fsck complains about (or if it, too, will consume enormous
> amounts of RAM, but I'm assuming it's not trying to cache every inode
> it dumps, so maybe that might work if I trusted it---opinions
> anyone?)]
> 
> So---rsync to the rescue, except not.  A normal dirvish backup just
> uses --link-dest against the previous host/date combo, and works fine.
> I could copy the entire set of snapshots to a new filesystem the same
> way, EXCEPT for a problem:  I took pains to hardlink files -across-
> hosts' backups that were also the same, so I didn't have a zillion
> copies of the same files that are all shared by most releases and any
> linux anyway.  E.g., in this sort of arrangement:
>   hostA/20080101
>   hostB/20080101
>   ...
>   hostF/20080101
>   ...
>   hostA/20080102
>   hostB/20080102
>   ...
>   hostF/20080102
>   ...
> 
> dirvish (well, rsync) itself hardlinked files between hostA/20080101
> and hostA/20080102 on successive runs, and then -I- ran a tool
> (faster-dupemerge) that hardlinked identical files between
> hostA/20080101 and hostB/20080101 (etc).  Once this is done across the
> very first set of dumps (e.g., 20080101 in this example), then even
> though rsync is doing --link-dest only from hostA to hostA on
> successive runs, everything stays hardlinked together across hosts
> because the same inode is being reused everywhere.  (I also run
> faster-dupemerge across all hosts for the most-recent pair of backups
> to catch files that have been -copied or moved-, either from one dir
> to another on the same host, or across hosts.  Works great.)
> 
> Unfortunately, I can't get rsync to do the right thing when I'm trying
> to copy this structure.  What I'd -like- to do is to take all of
> hostA..hostF---for a single date---and copy them all at once, using
> --link-dest to point back at the previous date's set of hosts as the
> basis.  But because of the way the directories are structured, I need
> to use -R so I get the same structure recreated, and that seems to
> break --link-dest, unless there's some syntax issue in what I'm doing.

> Small test case:
> 
> Imagine that "src" is my original filesystem, and "dst" is where I'm
> trying to move things.  (Here, they share a superior directory, but of
> course in real life they're different filesystems.)  "foo" is my test
> file; there are multiple copies of it in src that are all hardlinked
> together.  I've already done the push of the first vault's contents
> from src to dst, so --link-dest has something to work with; note that
> the inode numbers for foo in src and dst are different (since, again,
> in real life, they're on different filesystems), but that all copies
> of foo in either src or dst (so far) share the same inode.  The A, B,
> and C directories correspond to individual hosts.
> 
> 18:45:42 ~/H$ find . -name "foo" -ls
>  84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/1/foo
>  84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/2/foo
>  84426    4 -rw-r--r--   1 blah    blah           4 Jan 11 18:43 ./dst/a/1/foo
> 18:45:46 ~/H$ ~/rsync-3.0.5/rsync -aviH --link-dest=../1 src/a/2/ dst/a/2/
> sending incremental file list
> created directory dst/a/2
> cd..t...... ./
> 
> sent 61 bytes  received 15 bytes  152.00 bytes/sec
> total size is 4  speedup is 0.05
> 18:46:11 ~/H$ find . -name "foo" -ls
>  84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/1/foo
>  84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/2/foo
>  84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/1/foo
>  84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/2/foo
> 
> Okay, so the above shows that --link-dest without -R appears to work, BUT---
> how come there was no actual output from rsync when it created dst/a/2/foo?
> Correct side-effect (foo created, with correct inode), but incorrect output.

The lack of output here is by design.  That's not to say that I think
the design is a good one.

IIUC, the current interpretation of a basis dir (--*-dest) is a source
of files that "might as well be" in the corresponding places in the
destination directory.  When rsync copies a source file that has no
counterpart in the destination and selects an "alternate basis" file
from a --*-dest dir, it itemizes as if the alternate basis file were
actually present in the destination.  In your example, rsync considered
the hard-linking from dst/a/1/foo to dst/a/2/foo to not be a change, so
you would have to pass -ii to see it.

This interpretation makes complete sense with --compare-dest, which was
the first --*-dest option to be added, in 1998:

http://gitweb.samba.org/?p=rsync.git;a=commit;h=375a4556c7a1ffb9a4e7117f33fc42ed2bc4c026

However, the more recently added --copy-dest and --link-dest:

http://gitweb.samba.org/?p=rsync.git;a=commit;h=1de3e99bc5781a119c3c7a4aa176eb77a7039714
http://gitweb.samba.org/?p=rsync.git;a=commit;h=59c95e4243749273fe91f8197a39f89e4d905cb8

have the IMHO more useful interpretation that the basis dir is to be
used as an optimization (of network traffic and/or destination disk
usage), without affecting either the itemization or the final contents
of the destination.  I entered an enhancement request for this to be
supported properly:

https://bugzilla.samba.org/show_bug.cgi?id=5645

> So the story thus far:
> 
> 18:46:16 ~/H$ find . -ls
>  84408    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:44 .
>  84410    4 drwxr-xr-x   5 blah    blah        4096 Jan 11 18:42 ./src
>  84411    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:42 ./src/a
>  84412    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:43 ./src/a/1
>  84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/1/foo
>  84417    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:45 ./src/a/2
>  84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/2/foo
>  84413    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:42 ./src/b
>  84414    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/b/1
>  84418    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/b/2
>  84415    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:42 ./src/c
>  84416    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/c/1
>  84419    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/c/2
>  84421    4 drwxr-xr-x   5 blah    blah        4096 Jan 11 18:44 ./dst
>  84422    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:46 ./dst/a
>  84425    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:43 ./dst/a/1
>  84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/1/foo
>  84427    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:45 ./dst/a/2
>  84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/2/foo
>  84423    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:44 ./dst/b
>  84424    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:44 ./dst/c
> 
> Created some more hardlinks to get the ball rolling on B & C.
> 
> 19:02:46 ~/H/src$ ln ../dst/a/1/foo ../dst/b/1/foo
> 19:03:06 ~/H/src$ ln ../dst/a/1/foo ../dst/c/1/foo
> 19:03:10 ~/H/src$ find .. -name "foo" -ls
>  84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/a/1/foo
>  84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/a/2/foo
>  84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/b/1/foo
>  84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/c/1/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/1/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/2/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/b/1/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/c/1/foo
> 
> 19:04:06 ~/H/src$ ln b/1/foo b/2/foo
> 19:04:26 ~/H/src$ ln c/1/foo c/2/foo
> 19:04:29 ~/H/src$ find .. -name "foo" -ls
>  84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/1/foo
>  84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/2/foo
>  84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/1/foo
>  84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/2/foo
>  84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/1/foo
>  84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/2/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/1/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/2/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/b/1/foo
>  84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/c/1/foo
> 
> 19:05:28 ~/H/src$ ~/rsync-3.0.5/rsync -n -aviH --link-dest=/home/blah/H/dst/a/1 b/2/ c/2/ dst/
> sending incremental file list
> .d..t...... ./
> 
> sent 82 bytes  received 15 bytes  194.00 bytes/sec
> total size is 8  speedup is 0.08 (DRY RUN)
> 
> Can't really tell here what's going on 'cause of -n and output issue,
> I think.  But without -R, probably wrong thing anyway.

Without -R, the sources b/2/ and c/2/ are both placed at the root of the
file list.  The first source takes priority, so b/2/foo would be mapped
to dst/foo by means of a hard link from /home/blah/H/dst/a/1/foo .  As
in the previous example, rsync doesn't consider this worthy of output.

> 19:05:42 ~/H/src$ ~/rsync-3.0.5/rsync -n -R -aviH --link-dest=/home/blah/H/dst/a/1 b/2/ c/2/ dst/
> sending incremental file list
> cd+++++++++ b/
> cd+++++++++ b/2/
> cd+++++++++ c/
> cd+++++++++ c/2/
> >f+++++++++ c/2/foo
> hf+++++++++ b/2/foo => c/2/foo
> 
> sent 173 bytes  received 46 bytes  438.00 bytes/sec
> total size is 8  speedup is 0.04 (DRY RUN)
> 
> Added -R.  Hm.  Note that it created a new c/2/foo and then hardlinked
> b/2/foo to it.  Why didn't it hardlink c/2/foo to c/1/foo?  Well,
> proably 'cause my link-dest is bogus---I'm trying to say, "The parent
> of the dirs I'm specifying" but I think that's getting tangled up
> because b/2 and c/2 are relative to src, --link-dest is relative to
> dst (but I'm forcing a rooted path 'cause "relative to dst" is just
> too confusing here), but neither b nor c is under a.  But just
> specifying "...H/dst" isn't right, either, 'cause --link-dest
> doesn't match.  (Tried it, didn't work.)

Right.  To recap the problem: In order to transfer both b/2/ and c/2/ to
the proper places under dst/ in a single run, you needed to include the
"b/2/" and "c/2/" path information in the file list by using -R.  But
consequently, rsync is going to look for b/2/foo and c/2/foo under
whatever --link-dest dir you specify, and there's no directory on the
destination side that contains files at those paths (yet).

> So let's use multiple --link-dest's:
> 
> 21:16:08 ~/H/src$ ~/rsync-3.0.5/rsync -n -R -aviH --link-dest=/home/blah/H/dst/b/1 --link-dest=/home/blah/H/dst/c/1 b/2/ c/2/ ../dst/
> sending incremental file list
> .d..t...... b/2/
> .d..t...... c/2/
> >f+++++++++ c/2/foo
> hf+++++++++ b/2/foo => c/2/foo
> 
> sent 167 bytes  received 40 bytes  414.00 bytes/sec
> total size is 8  speedup is 0.04 (DRY RUN)
> 
> Still no dice.

Same problem: rsync is looking for /home/blah/H/dst/b/1/b/2/foo .

[Skipping to the next example...]

> Dropping -R of course puts the output in the wrong place (and what
> happened to "c/2", anyway?):
> 
> 21:18:05 ~/H/src$ ~/rsync-3.0.5/rsync -n -aviH --link-dest=/home/blah/H/dst/b/1 --link-dest=/home/blah/H/dst/c/1 b/2 c/2 ../dst/
> sending incremental file list
> cd+++++++++ 2/
> >f+++++++++ 2/foo
> 
> sent 89 bytes  received 19 bytes  216.00 bytes/sec
> total size is 8  speedup is 0.07 (DRY RUN)

Without -R, file-list path information starts at the last slash in the
source argument.  That's why you're getting "2/foo".  If the sources
were "b/2/" and "c/2/" with trailing slashes, you would get just "foo".

> I've tried adding slashes at various ends and other permutations, but
> nothing works.  I can't seem to get -R and --link-dest to play nice.
> 
> [Oh, and, btw, using ~ in --link-dest seems to confuse it; I had to
> drop back to /home/blah instead.  Eh?]

Tilde expansion is the shell's job.  The shell checks for a tilde only
at the start of each argument, since you don't want it to expand tildes
in the middle of a path like "my-file.txt~2008-01-25~" ; and it isn't
smart enough to realize that the tilde in "--link-dest=~/test/something"
*is* the start of a path.  If you want tilde expansion, use the
unsticked form of the option: "--link-dest ~/test/something".

> Operationally, this means that I have to copy every single host -and-
> date -separately-, hence multiplying the number of directory scans by
> the number of hosts -and- breaking all those carefully-created hardlinks
> among the hosts.

I think using a separate rsync run for each hostX/DATE dir is the way to
go since it's easy to specify an appropriate --link-dest dir, or more
than one.  With this approach, you don't need -H unless you want to
preserve hard links among a single host's files on a single day.

In recent months, several rsnapshot users have posted about migration
problems similar to yours but one-dimensional (dates only), and I wrote
a script called "rsnapshot-copy" to automate the process of copying the
dates one by one, each time with --link-dest to the previous date:

http://rsnapshot.cvs.sourceforge.net/viewvc/rsnapshot/rsnapshot/utils/rsnapshot-copy?view=markup

You may wish to read the thread from which rsnapshot-copy originated for
more insights:

http://sourceforge.net/mailarchive/forum.php?thread_name=47FBD95C.2080906%40cfa.harvard.edu&forum_name=rsnapshot-discuss

You could use it as a starting point for a more sophisticated script for
your scenario.  Just loop through every (host, date) pair and run rsync,
passing as many --link-dest options as you need to help rsync discover
all the inter-host and inter-date links.

My inclination would be to make the dates the outer loop and the hosts
the inner loop since you have only six hosts but presumably many more
dates.  Then, for each hostX/DATE dir, I would --link-dest to each
host's most recent existing dir on the destination.  E.g., if you go
through the hosts in alphabetical order, hostC/20080102 would
--link-dest to hostA/20080102 and hostB/20080102 (already copied for
this date) as well as hostC/20080101, hostD/20080101, hostE/20080101,
and hostF/20080101 (hosts not yet copied for 20080102).  You could try
this and adjust as you see fit.

Rsync does a linear search through the basis dirs, so you should put the
most likely ones first, e.g., hostC/20080101 in my example.  In deciding
how many dirs to pass, consider the benefit of the extra dirs versus the
time that rsync wastes checking those dirs for files that are genuinely
new in the current hostX/DATE dir.

> I -think- I might be able to finesse this by actually physically
> rearranging the directories on the source---risky given that fsck
> is complaiing about it, but maybe...  the idea would be to invert
> the organization so that every host is under a date (e.g., instead
> of hostA/date1, hostA/date2, etc, I make it date1/hostA, date1/hostB),
> and then I can specify a SINGLE dir (namely "date1") and not use -R.
> [I can't just specify a single HOST in the current arrangement because
> there are far more dates than hsots and that causes a huge directory
> scan that runs rsync out of memory.]

That would work.  To avoid physically rearranging the source, you could
create a structure of symlinks on another filesystem and point rsync to
that.  The downside is that you have to use -H to catch cross-host hard
links.

-- 
Matt



More information about the rsync mailing list