TODO hardlink performance optimizations

John Van Essen vanes002 at umn.edu
Mon Jan 5 11:34:04 GMT 2004


On Sun, 4 Jan 2004 16:23:17 -0800, jw schultz <jw at pegasys.ws> wrote:
> On Sun, Jan 04, 2004 at 05:30:03AM -0800, jw schultz wrote:
>> On Sun, Jan 04, 2004 at 06:35:03AM -0600, John Van Essen wrote:
>> > I've modified hlink.c to use a list of file struct pointers instead of
>> > copies of the actual file structs themselves, so that will save memory.
>> > I'll submit that patch for review in a day or two after I've tested it.
>> 
>> I've just done the same.  It reduces the memory requirements
>> of the hlink list to 1/18th.  It is also somewhat faster to
>> build that way because we don't have to walk the list.
>> 
>> If we built the hlink_list one element at a time the way we
>> do the file_list only putting those files that we might link
>> in it it would be smaller but building it would be slower.
> 
> I'm noodling on the idea of purging the hlink_list of files
> that don't have links.

I'm all for it.

> Start by getting rid of the !IS_REG files because we don't
> link those anyway.

Right.

> After the hlink_list is sorted hardlinked files are adjacent
> to one another in the list so a single pass walk could do an
> implace purge (collapse) of non-linked files.
> 
> At one pointer per file space savings wouldn't be that large.
> But for every regular file we to a binary search of the
> hlink_list.  100,000 binary searches of a 200 member list
> would be a lot faster than 100,000 searches of a 105,000
> member list.

Indeed.

> If a flag were added to the file_struct the purge pass could
> set it if the file had any links so the 100,000 binary
> searches of a 105,000 member list could be reduced to 200
> binary searches of a 200 member list.

That's what I was thinking when I had proposed that the sender
include a flag bit that indicates that a regular file has more
than one node, and is thus hardlinked.  But your idea works
without needing that.  I like it.

>  We do init_hard_links
> prior to forking so the flag wouldn't suffer COW.
> 
> My current version of the patch does get rid of the
> non-regular files but i don't do the non-linked purge.
> It is just an idea at this point.  I'm not quite motivated
> enough to do it yet.
> 
> Hmm, 2.6.1 is shaping up to be a performance release.

Yes.  Maybe it can be released before another 11 months go by?  :)

Remember that there is also the issue of hardlinked files being
transferred unnecessarily when the first file in the alphabetically
sorted hardlink group does not exist, but a later one does that
can be used as the 'master'.

I've thought about this some more...

We need to make rsync flexible enough to choose a different file
as a master if the first one in the sorted list does not exist.

Specifically, it needs to choose the first file in a group that it
discovers _does_ exist on the receiver (not just the first file in
the sorted list).  That file becomes the master for the hardlinks
(and might have to be transferred if it doesn't "match").

If it gets to processing the final file in the group, and still
none of them has been found to exist on the receiver, then that
final file becomes the master file and must be transferred.

So we'd need these things:

- a flag passed to check_hard_link() indicating whether or not the
  file being checked already exists on the receiver (it doesn't
  have to match, neccessarily) to avoid unnecessary testing within
  this routine.

- a method to indicate that a particular file is the 'master' for
  a hardlink group.  If a master has been found, the remaining files
  can be skipped (not transferred).

- a method to indicate that a file has already been processed by
  check_hard_link (so we can determine when the last remaining file
  is being checked).

We could use extra bits in the flag word in file_struct to indicate
'master' and 'already checked', but recent revelations about CoW
optimization make me reluctant to do that.  Another possibility is
to use a byte array corresponding to the index array.

Further comments, thoughts, and suggestions (by anyone) are invited.
-- 
        John Van Essen  Univ of MN Alumnus  <vanes002 at umn.edu>



More information about the rsync mailing list