TODO hardlink performance optimizations

John Van Essen vanes002 at umn.edu
Wed Jan 7 07:04:34 GMT 2004


On Tue, 6 Jan 2004 15:11:54 -0800, jw schultz <jw at pegasys.ws> wrote:
> On Tue, Jan 06, 2004 at 02:04:19AM -0600, John Van Essen wrote:
>> On Mon, 5 Jan 2004, jw schultz <jw at pegasys.ws> wrote:
> [snip]
>> > union links {
>> >         struct idev {
>> >                 INO64_T inode;
>> >                 DEV64_T dev;
>> >         }
>> >         struct hlinks {
>> >                 int                count;
>> >                 file_struct        **link_list;
>> >         }
>> > }
>> > 
> [snip]
>> 
>> Now that you have opened the door for re-using the inode and
>> dev area in file_struct, I can make my initial suggestion that
>> requires two 'additional' pointers in file_struct.  I didn't
>> suggest it before because I didn't want to add to the struct.
>> 
>> I propose to reuse the DEV amd INODE areas to store two new
>> file_struct pointers (the 2nd part of the union).  These would
>> be set during the post-qsort phase during init_hard_links.
>> 
>> The first pointer links together the file_structs for each
>> hardlink group in a list in the same order as in the qsort
>> so they can be walked by your proposed link-dest method.
>> 
>> The second pointer points to the head (first file_struct of
>> a hardlink group).
>> 
>> For each file_struct that is modified in this way a flag bit
>> needs to be set indicating so.
> 
> No it doesn't.  After we walk the qsorted hlink_list all
> file_structs->links pointers either point to a hlinks or
> are NULL.  If there are no links file_structs->links is
> NULL.

If you qsort the entire list without first filtering out any
entries (e.g. !IS_REG), then yes, that would be true.

>> Then, if the file_struct address equals its head pointer, then
>> you are at a head of a hardlink list which can be processed
>> using the new method using link-dest that you outlined earlier.
> 
> So you are proposing a singly linked list:
>          struct hlinks {
>                  file_struct                *head;
>                  file_struct                *next;
>          }

Yes.
 
> That would work too.  With the pointers into the array
> you compare the first in the array with yourself instead of
> head.  I preferred knowing in advance how many links there
> were.

But there is no need to know exactly how many links, is there?

>> I'm still not clear on what exactly will happen during that new
>> processing method.  I assume that if the head file does not exist,
>> it will find any existant file and hardlink it to the head file,
>> and if not found, will transfer the head file.  Is that correct?
> 
> Not exactly.  Like my earlier post if the head (your term)
> doesn't exist (existence including compare_dest) we iterate
> over the link set and the first that does exist is used for
> fnamecmp.  But that is phase two or three.

Let me put it another way...  Using your proposed method, after
the 'head' file is processed, will it now exist on the receiver
so that its siblings can be hardlinked to it?

>> Processing of non-head files (file_struct address not equal to
>> head pointer) can be skipped since they are hardlinked later.
>> 
>> The final walk-through that creates the hardlinks for the
>> non-head files can walk the qsorted list and use the head
>> pointer as the target for the hardlinks of the non-nead files.
> 
> Actually, if we could do the hardlinks of non-head files as
> we encounter them while walking the file list and used your
> singly linked lists we could free the hlink_list after
> traversal.  The problem is that we would need to be sure the
> head had completed processing so the hardlink would have to
> be created in receiver and that gets ugly.

Right.  You've just explained the here-to-fore unknown reason
why that sibling hlinking was being done in a separate, final
phase.  If we keep it that way, I'd like to see a comment added
to explain what you just explained.

But here's an idea, still using my proposed head/next struct...

- Make *next a circular list (the last points to the first).
- Leave *head NULL initially.
- When a file of a hlink group is examined, and *head is NULL,
  then it is the first one of that group to be processed, and
  it becomes the head.  All the *head pointers are set to its
  file_struct address.
- Subsequent processing of siblings can immediately hardlink
  them to the head file.

The drawback is that this will invoke CopyOnWrite (as dicussed
earlier in this thread).  To avoid that, *head would have to
point outside to a group head pointer, which would then be set.
So you'd need an array of pointers of the same size as the
number of distinct hardlink groups.  Say!  We already have the
sorted hlink_list.  We could just point to the first element
of the group (setting it to NULL initially) after creating
the circularly linked list.

>  One possibility
> would be to keep the trailing walk-through but reduce the
> hlink_list to an array of heads.

Keep the array and use just the heads, as per above.

>> No extra memory required beyond that already required for the
>> qsorted pointer list.
>> 
>> No binary search required during the file processing phase.
>> 
>> The key is getting the link-dest method working for this.
> 
> Huh?  This doesn't have any adverse affect on link-dest.
> It only has a potential positive affect on
> (link|compare)_dest.

I was referring to using the link-dest routine(s) as per your
earlier suggestion (I thought there were some small changes
needed to make this work but maybe I misunderstood).
-- 
        John Van Essen  Univ of MN Alumnus  <vanes002 at umn.edu>



More information about the rsync mailing list