Is -R --link-dest really hard to use, or is it me?

foner-rsync at media.mit.edu foner-rsync at media.mit.edu
Mon Jan 12 04:24:46 GMT 2009


I've got a problem for which the combination of -R and --link-dest
doesn't seem to be quite enough---and I may have discovered a few
small bugs as well; test cases are below.

[And if someone has a scheme for doing this that doesn't involve rsync
at all, but works okay, I'm all ears as well---I'm not the first with
this problem.]

Here's my problem:  I unfortunately need to move a large dirvish
vault.  This is a directory tree consisting of -many- hardlinked
files, which means that moving it in pieces will copy many times more
data than is actually there, but trying to move the entire thing in
one shot consumes more RAM than is available.  [rsync on the toplevel
dir blew up almost immediately, as I expected.  cp -a was consuming at
least 130meg per snapshot and therefore looked likely to consume at
least 10G of RAM to finish; it's actually possible for other reasons
it might have been closer to 20G.  It thus got slower and slower as
it became more and more page-bound and I eventually got tired of it
thrashing itself to death; ETA might have been a few weeks at that
rate.  I can't just move the underlying blocks (e.g., copy the
partition as a partition) because the whole reason I'm moving this
filesystem in the first place is because it has errors that fsck is
having trouble fixing---bug or bad hardware isn't established yet.
And I don't know if dump/restore works well on ext3 filesystems, is
well-tested these days, will work for ext4 when I finally migrate to
that, or produces good data if the filesystem I'm starting with has
errors that fsck complains about (or if it, too, will consume enormous
amounts of RAM, but I'm assuming it's not trying to cache every inode
it dumps, so maybe that might work if I trusted it---opinions
anyone?)]

So---rsync to the rescue, except not.  A normal dirvish backup just
uses --link-dest against the previous host/date combo, and works fine.
I could copy the entire set of snapshots to a new filesystem the same
way, EXCEPT for a problem:  I took pains to hardlink files -across-
hosts' backups that were also the same, so I didn't have a zillion
copies of the same files that are all shared by most releases and any
linux anyway.  E.g., in this sort of arrangement:
  hostA/20080101
  hostB/20080101
  ...
  hostF/20080101
  ...
  hostA/20080102
  hostB/20080102
  ...
  hostF/20080102
  ...

dirvish (well, rsync) itself hardlinked files between hostA/20080101
and hostA/20080102 on successive runs, and then -I- ran a tool
(faster-dupemerge) that hardlinked identical files between
hostA/20080101 and hostB/20080101 (etc).  Once this is done across the
very first set of dumps (e.g., 20080101 in this example), then even
though rsync is doing --link-dest only from hostA to hostA on
successive runs, everything stays hardlinked together across hosts
because the same inode is being reused everywhere.  (I also run
faster-dupemerge across all hosts for the most-recent pair of backups
to catch files that have been -copied or moved-, either from one dir
to another on the same host, or across hosts.  Works great.)

Unfortunately, I can't get rsync to do the right thing when I'm trying
to copy this structure.  What I'd -like- to do is to take all of
hostA..hostF---for a single date---and copy them all at once, using
--link-dest to point back at the previous date's set of hosts as the
basis.  But because of the way the directories are structured, I need
to use -R so I get the same structure recreated, and that seems to
break --link-dest, unless there's some syntax issue in what I'm doing.

Small test case:

Imagine that "src" is my original filesystem, and "dst" is where I'm
trying to move things.  (Here, they share a superior directory, but of
course in real life they're different filesystems.)  "foo" is my test
file; there are multiple copies of it in src that are all hardlinked
together.  I've already done the push of the first vault's contents
from src to dst, so --link-dest has something to work with; note that
the inode numbers for foo in src and dst are different (since, again,
in real life, they're on different filesystems), but that all copies
of foo in either src or dst (so far) share the same inode.  The A, B,
and C directories correspond to individual hosts.

18:45:42 ~/H$ find . -name "foo" -ls
 84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/1/foo
 84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/2/foo
 84426    4 -rw-r--r--   1 blah    blah           4 Jan 11 18:43 ./dst/a/1/foo
18:45:46 ~/H$ ~/rsync-3.0.5/rsync -aviH --link-dest=../1 src/a/2/ dst/a/2/
sending incremental file list
created directory dst/a/2
cd..t...... ./

sent 61 bytes  received 15 bytes  152.00 bytes/sec
total size is 4  speedup is 0.05
18:46:11 ~/H$ find . -name "foo" -ls
 84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/1/foo
 84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/2/foo
 84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/1/foo
 84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/2/foo

Okay, so the above shows that --link-dest without -R appears to work, BUT---
how come there was no actual output from rsync when it created dst/a/2/foo?
Correct side-effect (foo created, with correct inode), but incorrect output.

So the story thus far:

18:46:16 ~/H$ find . -ls
 84408    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:44 .
 84410    4 drwxr-xr-x   5 blah    blah        4096 Jan 11 18:42 ./src
 84411    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:42 ./src/a
 84412    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:43 ./src/a/1
 84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/1/foo
 84417    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:45 ./src/a/2
 84420    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./src/a/2/foo
 84413    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:42 ./src/b
 84414    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/b/1
 84418    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/b/2
 84415    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:42 ./src/c
 84416    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/c/1
 84419    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:42 ./src/c/2
 84421    4 drwxr-xr-x   5 blah    blah        4096 Jan 11 18:44 ./dst
 84422    4 drwxr-xr-x   4 blah    blah        4096 Jan 11 18:46 ./dst/a
 84425    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:43 ./dst/a/1
 84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/1/foo
 84427    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:45 ./dst/a/2
 84426    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ./dst/a/2/foo
 84423    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:44 ./dst/b
 84424    4 drwxr-xr-x   2 blah    blah        4096 Jan 11 18:44 ./dst/c

Created some more hardlinks to get the ball rolling on B & C.

19:02:46 ~/H/src$ ln ../dst/a/1/foo ../dst/b/1/foo
19:03:06 ~/H/src$ ln ../dst/a/1/foo ../dst/c/1/foo
19:03:10 ~/H/src$ find .. -name "foo" -ls
 84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/a/1/foo
 84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/a/2/foo
 84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/b/1/foo
 84420    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../src/c/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/b/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/c/1/foo

19:04:06 ~/H/src$ ln b/1/foo b/2/foo
19:04:26 ~/H/src$ ln c/1/foo c/2/foo
19:04:29 ~/H/src$ find .. -name "foo" -ls
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/2/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/2/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/b/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/c/1/foo

19:05:28 ~/H/src$ ~/rsync-3.0.5/rsync -n -aviH --link-dest=/home/blah/H/dst/a/1 b/2/ c/2/ dst/
sending incremental file list
.d..t...... ./

sent 82 bytes  received 15 bytes  194.00 bytes/sec
total size is 8  speedup is 0.08 (DRY RUN)

Can't really tell here what's going on 'cause of -n and output issue,
I think.  But without -R, probably wrong thing anyway.

19:05:42 ~/H/src$ ~/rsync-3.0.5/rsync -n -R -aviH --link-dest=/home/blah/H/dst/a/1 b/2/ c/2/ dst/
sending incremental file list
cd+++++++++ b/
cd+++++++++ b/2/
cd+++++++++ c/
cd+++++++++ c/2/
>f+++++++++ c/2/foo
hf+++++++++ b/2/foo => c/2/foo

sent 173 bytes  received 46 bytes  438.00 bytes/sec
total size is 8  speedup is 0.04 (DRY RUN)

Added -R.  Hm.  Note that it created a new c/2/foo and then hardlinked
b/2/foo to it.  Why didn't it hardlink c/2/foo to c/1/foo?  Well,
proably 'cause my link-dest is bogus---I'm trying to say, "The parent
of the dirs I'm specifying" but I think that's getting tangled up
because b/2 and c/2 are relative to src, --link-dest is relative to
dst (but I'm forcing a rooted path 'cause "relative to dst" is just
too confusing here), but neither b nor c is under a.  But just
specifying "...H/dst" isn't right, either, 'cause --link-dest
doesn't match.  (Tried it, didn't work.)

So let's use multiple --link-dest's:

21:16:08 ~/H/src$ ~/rsync-3.0.5/rsync -n -R -aviH --link-dest=/home/blah/H/dst/b/1 --link-dest=/home/blah/H/dst/c/1 b/2/ c/2/ ../dst/
sending incremental file list
.d..t...... b/2/
.d..t...... c/2/
>f+++++++++ c/2/foo
hf+++++++++ b/2/foo => c/2/foo

sent 167 bytes  received 40 bytes  414.00 bytes/sec
total size is 8  speedup is 0.04 (DRY RUN)

Still no dice.

21:17:14 ~/H/src$ ~/rsync-3.0.5/rsync -R -aviH --link-dest=/home/blah/H/dst/b/1 --link-dest=/home/blah/H/dst/c/1 b/2/ c/2/ ../dst/
sending incremental file list
.d..t...... b/2/
.d..t...... c/2/
>f+++++++++ c/2/foo
hf+++++++++ b/2/foo => c/2/foo

sent 211 bytes  received 56 bytes  534.00 bytes/sec
total size is 8  speedup is 0.03
21:17:45 ~/H/src$ find .. -name "foo" -ls
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/2/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/2/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/b/1/foo
 84434    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ../dst/b/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/c/1/foo
 84434    4 -rw-r--r--   2 blah    blah           4 Jan 11 18:43 ../dst/c/2/foo

Yup, wrong inodes on the copy.

21:17:48 ~/H/src$ rm ../dst/b/2/foo ../dst/c/2/foo
21:18:04 ~/H/src$ find .. -name "foo" -ls
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/a/2/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/b/2/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/1/foo
 84420    4 -rw-r--r--   6 blah    blah           4 Jan 11 18:43 ../src/c/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/a/2/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/b/1/foo
 84426    4 -rw-r--r--   4 blah    blah           4 Jan 11 18:43 ../dst/c/1/foo

Dropping -R of course puts the output in the wrong place (and what
happened to "c/2", anyway?):

21:18:05 ~/H/src$ ~/rsync-3.0.5/rsync -n -aviH --link-dest=/home/blah/H/dst/b/1 --link-dest=/home/blah/H/dst/c/1 b/2 c/2 ../dst/
sending incremental file list
cd+++++++++ 2/
>f+++++++++ 2/foo

sent 89 bytes  received 19 bytes  216.00 bytes/sec
total size is 8  speedup is 0.07 (DRY RUN)

I've tried adding slashes at various ends and other permutations, but
nothing works.  I can't seem to get -R and --link-dest to play nice.

[Oh, and, btw, using ~ in --link-dest seems to confuse it; I had to
drop back to /home/blah instead.  Eh?]

Operationally, this means that I have to copy every single host -and-
date -separately-, hence multiplying the number of directory scans by
the number of hosts -and- breaking all those carefully-created hardlinks
among the hosts.  I -think- that (as when I created the vault) hardlinking
everybody in the very first vault and then going forward one host at a
time, one data at a time, might work, BUT it misses movement/copy
across the hosts (though I'm not sure if rsync could do this anyway---
I haven't worked up a test case yet; shouldn't it just do the right
thing if it's -H mode and a file moves across directories but it still
knows everyone's inode?  I'm not sure.)

I -think- I might be able to finesse this by actually physically
rearranging the directories on the source---risky given that fsck
is complaiing about it, but maybe...  the idea would be to invert
the organization so that every host is under a date (e.g., instead
of hostA/date1, hostA/date2, etc, I make it date1/hostA, date1/hostB),
and then I can specify a SINGLE dir (namely "date1") and not use -R.
[I can't just specify a single HOST in the current arrangement because
there are far more dates than hsots and that causes a huge directory
scan that runs rsync out of memory.]

The other tragic thing about this whole shebang is that, even though
I'm doing successive runs pivoting around a date (e.g., each run uses
the immediately-preceeding dir as its link-dest), I can't save the
directory scan involved---even though the rsync that's quitting knows
-exactly- what the rsync that's starting wants to know about that
tree.  If only there was a way to share state between them...  (No,
it doesn't appear to fit in the Linux filesystem cache, alas.)


More information about the rsync mailing list