--hard-links performance

Wed Jul 11 05:26:18 GMT 2007

On Fri, Jun 22, 2007 at 03:33:31PM -0400, George Georgalis wrote:
>On Tue, Jun 05, 2007 at 11:11:27AM -0700, Chuck Wolber wrote:
>>On Tue, 5 Jun 2007, Paul Slootman wrote:
>>
>>> > In any case, what's the general consensus behind using the 
>>> > --hard-links option on large (100GB and above) images? Does it still 
>>> > use a ton of memory? Or has that situation been alleviated?
>>> 
>>> The size of the filesystem isn't relevant, the number of hard-linked 
>>> files is. It still uses a certain amount of memory for each hard-linked 
>>> file, but the situation is a lot better than with earlier rsync 
>>> versions. (As always, make sure you use the newest version.)
>>
>>In our case, we store images as hardlinks and would like an easy way to 
>>migrate images from one backup server to another. We currently do it with 
>>a script that does a combination of rsync'ing and cp -al. Our layout is 
>>similar to:
>>
>>image_dir
>>| -- img1
>>| -- img2 (~99% hardlinked to img1)
>>| -- img3 (~99% hardlinked to img2)
>>   .
>>   .
>>   .
>>` -- imgN (~99% hardlinked to img(N-1))
>>
>>
>>Each image in image_dir is hundreds of thousands of files. It seems to me 
>>that even a small amount of memory for each hardlinked file is going to 
>>clobber even the most stout of machines (at least by 2007 standards) if I 
>>tried a wholesale rsync of image_dir using --hard-links. No?
>>
>>If so, then is a "hard link rich environment" an assumption that can be 
>>used to make an optimization of some sort?
>
>I had a C program which would scan directory points and on some
>criteria, (I forget exactly, size and mtime?), it would decide to
>unlink one file and link the name to the other. I could look for
>it but no guarantees I'll find it, or soon... it was designed for
>identical files with different names.
>
>you could tar transfer then minimize with the program. of course
>everyone on this list would prefer to use rsync, maybe the
>algorithm could be integrated in? :) maybe I can find the code.
>it was written by a very senior individual...

the program is http://www.ka9q.net/code/dupmerge/
there are 200 lines of well commented C; however
there may be a bug which allocates too much memory
(one block per file); so my application runs out. :\
If you (anyone) can work it out and/or bring it into
rsync as a new feature, that would be great. Please
keep the author and myself in the loop!

// George

-- 
George Georgalis, information systems scientist <IXOYE><