--hard-links performance

Wed Jul 11 21:47:00 GMT 2007

    Date: Wed, 11 Jul 2007 01:26:18 -0400
    From: "George Georgalis" <george at galis.org>

    the program is http://www.ka9q.net/code/dupmerge/
    there are 200 lines of well commented C; however
    there may be a bug which allocates too much memory
    (one block per file); so my application runs out. :\
    If you (anyone) can work it out and/or bring it into
    rsync as a new feature, that would be great. Please
    keep the author and myself in the loop!

Do a search for "faster-dupemerge"; you'll find mentions of it in the
dirvish archives, where I describe how I routinely use it to hardlink
together filesystems in the half-terabyte-and-above range without
problems on machines that are fairly low-end these days (a gig of RAM,
a gig or so of swap, very little of which actually gets used by the
merge).  Dirvish uses -H in rsync to do most of the heavy lifting, but
large movements of files from one directory to another between backups
won't be caught by rsync*.  So I follow dirvish runs with a run of
faster-dupemerge across the last two snapshots and across every
machine being backed up (e.g., one single run that includes two
snapshots per backed-up machine); that not only catches file movements
within a single machine, but also links together backup files -across-
machines, which is quite useful when you have several machines which
share a lot of similar files (e.g., the files in the distribution
you're running), or if a file moves from one machine to another, etc,
and saves considerable space on the backup host.  [You can also trade
off speed for space, e.g., since the return on hardlinking zillions of
small files is relatively low compared to a few large ones, you can
also specify "only handle files above 100K" or whatever (or anything
else you'd like as an argument to "find") and thus considerably speed
up the run while not losing much in the way of space savings; I
believe I gave some typical figures in one my posts to the dirvish
lists.  Also, since faster-dupemerge starts off by sorting the results
of the "find" by size, you can manually abort it at any point and it
will have merged the largest files first.]

http://www.furryterror.org/~zblaxell/dupemerge/dupemerge.html is the
canonical download site, and mentions various other approaches and
their problems.  (Note that workloads such as mine will also require
at least a gig of space in some temporary directory that's used by the
sort program; fortunately, you can specify on the command line where
that temp directory will be, and it's less than 0.2% of the total
storage of the filesytem being handled.)

* [Since even fuzzy-match only looks in the current directory, I
believe, unless later versions can be told to look elsewhere as well
and I've somehow missed that---if I -have- missed that, it'd be a nice
addition to be able to specify extra directories (and/or trees) in
which fuzzy-match should look, although in the limit that might
require a great deal of temporary space and run slowly.]