rsyncing many files and hard links: optimisation suggestions?

Judith Retief JudithR at inet.co.za
Fri Sep 29 10:10:51 GMT 2006


I suspect the standard optimisation - breaking up the rsync into smaller
batches - is not going to work for us. This is our situation:

We rsync two directories in /spool to a backup. They are large: almost 2mil
files in the first dir, with about 4mil hard links in the second one linking
to them. The files don't change much; a few thousand added daily, and with
about 3 hard links created for each. Another few hundred might be deleted or
changed. We use 
	rsync -a -H --delete 
This worked very well at first, but since reaching 1mil files performance
has dropped dramatically.

Knowing about the mem problems when rsyncing lots of files, my first option
was to break the rsync down in batches. Don't think this will work though:

- firstly: rsync only uses 30% of the mem, no swap mem is used. So mem isn't
the issue. 
- secondly: I think the hard links won't be created correctly. This is why:

If I have hard links like so:
	/spool/foo/real-file
	/spool/bar/bar1/real-file -> /spool/foo/real-file
     	/spool/bar/bar2/real-file -> /spool/foo/real-file

then
	rsync -H user at host:/spool/foo /spool
	rsync -H user at host:/spool/bar /spool

will result in _two_ copies of real-file on the client. And if the 'bar'
rsync is split into two rsync batches:
	rsync -H user at host:/spool/bar/bar1 /spool/bar
	rsync -H user at host:/spool/bar/bar2 /spool/bar

I'm going to have three copies of real-file, rather than one copy and two
hard links, isn't it? 

When I do an strace on rsync on the client, it's almost invariably busy with
lstat'ing the local drive. I guess this is the receiver building up its file
list? And when the file list contains lots of hard links then it has to sort
all the files in one huge list? 

If the problem is the actual disk access, then I can't think of anything to
do. If it is the sorting, then cutting down the batch sizes should help, at
the expense of having copies of some files rather than hard links. 

Or am I missing a major point here?



More information about the rsync mailing list