TODO hardlink performance optimizations
jw schultz
jw at pegasys.ws
Mon Jan 5 21:12:16 GMT 2004
On Mon, Jan 05, 2004 at 05:34:04AM -0600, John Van Essen wrote:
> On Sun, 4 Jan 2004 16:23:17 -0800, jw schultz <jw at pegasys.ws> wrote:
> > On Sun, Jan 04, 2004 at 05:30:03AM -0800, jw schultz wrote:
> >
> > I'm noodling on the idea of purging the hlink_list of files
> > that don't have links.
>
> I'm all for it.
>
> > Start by getting rid of the !IS_REG files because we don't
> > link those anyway.
>
> Right.
>
> > After the hlink_list is sorted hardlinked files are adjacent
> > to one another in the list so a single pass walk could do an
> > implace purge (collapse) of non-linked files.
> >
> > At one pointer per file space savings wouldn't be that large.
> > But for every regular file we to a binary search of the
> > hlink_list. 100,000 binary searches of a 200 member list
> > would be a lot faster than 100,000 searches of a 105,000
> > member list.
>
> Indeed.
>
> > If a flag were added to the file_struct the purge pass could
> > set it if the file had any links so the 100,000 binary
> > searches of a 105,000 member list could be reduced to 200
> > binary searches of a 200 member list.
>
> That's what I was thinking when I had proposed that the sender
> include a flag bit that indicates that a regular file has more
> than one node, and is thus hardlinked. But your idea works
> without needing that. I like it.
>
> > We do init_hard_links
> > prior to forking so the flag wouldn't suffer COW.
> >
> > My current version of the patch does get rid of the
> > non-regular files but i don't do the non-linked purge.
> > It is just an idea at this point. I'm not quite motivated
> > enough to do it yet.
> >
> > Hmm, 2.6.1 is shaping up to be a performance release.
>
> Yes. Maybe it can be released before another 11 months go by? :)
>
> Remember that there is also the issue of hardlinked files being
> transferred unnecessarily when the first file in the alphabetically
> sorted hardlink group does not exist, but a later one does that
> can be used as the 'master'.
>
> I've thought about this some more...
>
> We need to make rsync flexible enough to choose a different file
> as a master if the first one in the sorted list does not exist.
>
> Specifically, it needs to choose the first file in a group that it
> discovers _does_ exist on the receiver (not just the first file in
> the sorted list). That file becomes the master for the hardlinks
> (and might have to be transferred if it doesn't "match").
>
> If it gets to processing the final file in the group, and still
> none of them has been found to exist on the receiver, then that
> final file becomes the master file and must be transferred.
>
> So we'd need these things:
>
> - a flag passed to check_hard_link() indicating whether or not the
> file being checked already exists on the receiver (it doesn't
> have to match, neccessarily) to avoid unnecessary testing within
> this routine.
>
> - a method to indicate that a particular file is the 'master' for
> a hardlink group. If a master has been found, the remaining files
> can be skipped (not transferred).
>
> - a method to indicate that a file has already been processed by
> check_hard_link (so we can determine when the last remaining file
> is being checked).
>
> We could use extra bits in the flag word in file_struct to indicate
> 'master' and 'already checked', but recent revelations about CoW
> optimization make me reluctant to do that. Another possibility is
> to use a byte array corresponding to the index array.
While the idea of per file status flags is attractive for
more than just this lets see what can be done short of going
there.
You list three things but i think we don't need any of them.
Thanks to compare_dest (and link_dest) we already have the
groundwork laid for the basis file to have a different path
than the destination. Let me describe a few steps that get
us the same a different way.
- Extract the fnamecmp calculation to one separate function,
basis_path(), called in both generator.c and receiver.c.
It is a little surprising this has not already been done.
- init_hard_links() sets nlinks in file_struct.
- change the return value of check_hard_link() from int to
file_struct** so it returns the pointer to the first entry
in the hlink_list array that matches, or NULL if false.
- basis_path() now can check nlinks and if !0 iterate over
the paths provided by the list returned from
check_hard_link() looking for the first that exists in
dest or compare_dest.
An alternative would be to have check_hard_link() return
nlinks and set an optional pointer into the hlink_list but i
prefer putting the nlinks count in file_struct when we do
init_hard_links().
This way the first file of a link set will be the source,
destination, and link target but if it doesn't already exist
as a destination the other links will be checked for
existence for use as the basis. All subsequent processing
of the links remains unchanged.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list