TODO hardlink performance optimizations

John Van Essen vanes002 at umn.edu
Sun Jan 4 12:35:03 GMT 2004


Lester,

You articulated your situation clear enough for me.  Thanks.

I'll address your issue about when rsync is running locally for /vol/N
to /vol/N_mirror syncing, it exhausts all of the RAM and swap.

If you haven't read jw schultz's "How Rsync Works" page, here the link:

  http://www.pegasys.ws/how-rsync-works.html

The sender, receiver, and generator each have a full copy of the file
list (each file's entry uses 100 bytes on average).

Additonally, the --hard-links option creates yet *another* full copy of
the file list in the receiver, so that's even more memory consumed.

So you are in a world o' hurt rsyncing an entire /vol/N internally
with --hard-links, since there will be FOUR copies of the file list.

I'd suggest breaking the /vol/N rsync up into separate rsyncs for each
of the maxdepth 1 hierarchies.  If I understand your situation correctly,
all hard link groups are self contained within each of those hierarchies
so you will be OK.

I've modified hlink.c to use a list of file struct pointers instead of
copies of the actual file structs themselves, so that will save memory.
I'll submit that patch for review in a day or two after I've tested it.
-- 
        John Van Essen  Univ of MN Alumnus  <vanes002 at umn.edu>



On Sat, 3 Jan 2004, Lester Hightower <hightowe-rsync-list at 10east.com> wrote:
> Hello,
> 
> I read with interest the mailing list thread found here:
> 
>         http://marc.10east.com/?t=107160967400007&r=1&w=2
> 
> We have a "situation" with rsync and --hard-links that was the reason for
> my search in MARC's rsync list archive that turned up the thread shown
> above.  After reading through that thread, and other information on this
> topic, I believe that sharing our situation with you will in itself prove
> to be a good contribution to rsync (which is an excellent tool, BTW).
> 
> So, here goes:
> 
> We have a process on a backup server (I called it "s" below), that each
> night rsyncs a full copy of /, /var, and /usr from a great number of
> systems.  As a rule we put /, /var, and /usr on separate partitions, but
> that detail is not important.  What is important is to understand exactly
> how we do these nightly, full system backups.  First, let me start by
> showing you what a small set of the system_backups hierarchy looks like:
> 
> root at s:/vol/6/system_backups# find . -type d -maxdepth 1
> .
> ./client1
> ./docs1.colo1
> ./docs2.colo1
> ./ipfw-internal.colo1
> ./ipfw1
> ./ipfw2
> ./docsdev1
> 
> root at s:/vol/6/system_backups# find . -type d -maxdepth 2|head -25|egrep -v '^\./[^/]+$'|sort
> .
> ./client1/20031223
> ./client1/20031224
> ./client1/20031225
> ./client1/20031226
> ./client1/20031227
> ./client1/20031229
> ./client1/20040102
> ./client1/current
> ./docs1.colo1/20031219
> ./docs1.colo1/20031223
> ./docs1.colo1/20031224
> ./docs1.colo1/20031225
> ./docs1.colo1/20031226
> ./docs1.colo1/20031227
> ./docs1.colo1/20031229
> ./docs1.colo1/20040102
> ./docs1.colo1/current
> ./docs1.colo1/image-20031218
> ./docs2.colo1/20031218
> ./docs2.colo1/20031219
> ./docs2.colo1/current
> 
> OK, that gives you an idea of how the hierarchy looks.  Here is the critical
> part, though.  The logic that creates these each night looks like this:
> 
> TODAY=<YYYYMMDD for today>
> for HOST in (<hosts>); do
>   cp -al $HOST/current $HOST/$TODAY
>   ...now rsync remote $HOST into my local $HOST/current...
> done
> 
> For those not familiar with the -l option to cp:
> 
> root at s:/vol/6/system_backups# man cp|grep -B1 -A1 'hard links instead'
>        -l, --link
>               Make hard links instead of copies  of  non-directo-
>               ries.
> 
> What we end up with is a tree that is _very_ fast to rsync each night,
> with revision history going back indefinitely, at the disk usage cost of
> only files that change (rare) and the directories (about 8MB per machine).
> Note, however, that the _vast_ majority of file entries on these file
> systems (system_backups) are hard links.  Many inodes will have 20, 30, or
> more filename entries pointing at them (depending strictly on how much
> history we choose to keep).
> 
> Keeping all that in mind, now understand that server "s" has /vol/(0..14)
> installed in its disk subsystem, and (the important part) each of those
> volumes has a slow mirror -- one rsync per day.  We do not keep those
> mirrors mounted, but you could think of /vol/0 having a /vol/0_mirror
> partner that is rsynced once every twenty-four hours.
> 
> All of this works absolutely perfectly, with one exception, the daily
> rsync of /vol/N to /vol/N_mirror for volumes that hold system_backups, and
> the reason appears to be the --hard-links flag.  Rsync, which is running
> completely locally for /vol/N to /vol/N_mirror work, exhausts all of the
> RAM and swap allocated to it in this machine (3GB), sends the machine into
> a maddening swap spiral, etc.  The issue only exists for /vol/N vols where
> we have "system_backups" stored.
> 
> I wanted to share this circumstance with you because my reading of the
> discussion on this topic, though encouraging, left me with the impression
> that some might not be thinking about situations like this one, where it
> is perfectly normal and desired to have many hard links to one inode, and
> hundreds of thousands of hard links in one file system.
> 
> To give you an idea of the type of information one can glean from such a
> backup process, here are a couple of examples.  Keep in mind that files
> with link-count of 1 changed on the date indicated by the directory:
> 
> root at s:/vol/6/system_backups/client1# find 20040102 -links 1 -type f|head -2
> 20040102/root/.bash_history
> 20040102/tmp/.803.e4a1
> 
> root at s:/vol/6/system_backups/client1# diff 20040102/root/.bash_history current/root/.bash_history
> 1d0
> < lynx http://localhost:1081 --source | grep Rebuilding | head -1 | cut 10-
> 500a500
>> ssh ljacobs at supermag
> 
> root at s:/vol/6/system_backups/client1# find 20040102 -links 1 -type f|cut -d/ -f1,2,3,4|sort |uniq -c
>       1 20040102/SYMLINKS
>       1 20040102/root/.bash_history
>       1 20040102/tmp/.803.e4a1
>       1 20040102/usr/local/BMS
>      54 20040102/usr/local/WWW
>      17 20040102/usr/local/etc
>       1 20040102/usr/sbin/symlinks
>      42 20040102/vol/1/bmshome
>       1 20040102/vol/2/webalizer_working
>      12 20040102/vol/3/home
> 
> You'll notice that the hard link counts in this file system are not very
> high yet (only 8), yet it is _very_ intensive to have rsync try to sync
> /vol/6system_backups/client1 to /vol/6_mirror/system_backups/client1 with
> the --hard-links flag set:
> 
> root at s:/vol/6/system_backups/client1# find 20040102 ! -links 1 -type f -printf '%n\t%i\t%s\t%d\t%h/%f\n'|head -50|tail -5
> 8       11323   10108   2       20040102/bin/mknod
> 8       11324   25108   2       20040102/bin/more
> 8       11325   60912   2       20040102/bin/mount
> 8       11326   10556   2       20040102/bin/mt-GNU
> 8       11327   33848   2       20040102/bin/mv
> 
> 
> If there is anything that I did not articulate clearly, if you have any
> followup questions, if you would like us to test some code for you guys,
> or if there is anything else that you feel that I can do to help, please
> do not hesitate to ask.
> 
> Sincerely,
> 
> --
> Lester Hightower
> 10East Corp.
> 
> 
> p.s.  10East created and now supports the MARC system (marc.10east.com) in
> various ways, including hosting it, though it is primarily administered by
> Mr. Hank Leininger, a good friend and former employee.  I didn't see any
> mention of MARC in the rsync web-site.  Please feel free to use it.




More information about the rsync mailing list