--hard-links performance

George Georgalis george at galis.org
Sat Jul 14 18:47:24 GMT 2007


On Wed, Jul 11, 2007 at 05:47:00PM -0400, foner-rsync at media.mit.edu wrote:
>    Date: Wed, 11 Jul 2007 01:26:18 -0400
>    From: "George Georgalis" <george at galis.org>
>
>    the program is http://www.ka9q.net/code/dupmerge/
>    there are 200 lines of well commented C; however
>    there may be a bug which allocates too much memory
>    (one block per file); so my application runs out. :\
>    If you (anyone) can work it out and/or bring it into
>    rsync as a new feature, that would be great. Please
>    keep the author and myself in the loop!
>
>Do a search for "faster-dupemerge"; you'll find mentions of it in the
>dirvish archives, where I describe how I routinely use it to hardlink
>together filesystems in the half-terabyte-and-above range without
>problems on machines that are fairly low-end these days (a gig of RAM,
>a gig or so of swap, very little of which actually gets used by the
>merge).  Dirvish uses -H in rsync to do most of the heavy lifting, but
>large movements of files from one directory to another between backups
>won't be caught by rsync*.  So I follow dirvish runs with a run of
>faster-dupemerge across the last two snapshots and across every
>machine being backed up (e.g., one single run that includes two
>snapshots per backed-up machine); that not only catches file movements
>within a single machine, but also links together backup files -across-
>machines, which is quite useful when you have several machines which
>share a lot of similar files (e.g., the files in the distribution
>you're running), or if a file moves from one machine to another, etc,
>and saves considerable space on the backup host.  [You can also trade
>off speed for space, e.g., since the return on hardlinking zillions of
>small files is relatively low compared to a few large ones, you can
>also specify "only handle files above 100K" or whatever (or anything
>else you'd like as an argument to "find") and thus considerably speed
>up the run while not losing much in the way of space savings; I
>believe I gave some typical figures in one my posts to the dirvish
>lists.  Also, since faster-dupemerge starts off by sorting the results
>of the "find" by size, you can manually abort it at any point and it
>will have merged the largest files first.]
>
>http://www.furryterror.org/~zblaxell/dupemerge/dupemerge.html is the
>canonical download site, and mentions various other approaches and
>their problems.  (Note that workloads such as mine will also require
>at least a gig of space in some temporary directory that's used by the
>sort program; fortunately, you can specify on the command line where
>that temp directory will be, and it's less than 0.2% of the total
>storage of the filesytem being handled.)
>
>* [Since even fuzzy-match only looks in the current directory, I
>believe, unless later versions can be told to look elsewhere as well
>and I've somehow missed that---if I -have- missed that, it'd be a nice
>addition to be able to specify extra directories (and/or trees) in
>which fuzzy-match should look, although in the limit that might
>require a great deal of temporary space and run slowly.]


Thanks for the notes. I keep ./0 ./1 ./2 ./3 which are incomplete,
sub-day, daily, and weekly hardlink snapshots with a system to
move/purge the timestamp directories between them. I'm planning
to run *some*sort*of*dupmerge*, individually on ./1 ./2 ./3
each time they get updated. this is to address multiple users
downloading the same source etc. ie files not necessarily in
adjacent snapshots but space can be recovered by hardlinking
various weekly snapshots.

I'm working on a feature to preserve status of newer ctime while
linking to older mtime. http://metrg.net/pub/script/dupmerge.sh
because I'm revisiting the system on occasion of a recursive
owner/mode change caused a 15Gb hit. Maybe I'll use or bring in
faster-dupemerge.

is there a way to make rsync apply newer status on older inode,
when only that has changed?

Regards,
// George


-- 
George Georgalis, information systems scientist <IXOYE><


More information about the rsync mailing list