TODO hardlink performance optimizations

jw schultz jw at pegasys.ws
Wed Jan 7 08:03:13 GMT 2004


On Wed, Jan 07, 2004 at 01:04:34AM -0600, John Van Essen wrote:
> On Tue, 6 Jan 2004 15:11:54 -0800, jw schultz <jw at pegasys.ws> wrote:
> > On Tue, Jan 06, 2004 at 02:04:19AM -0600, John Van Essen wrote:
> >> On Mon, 5 Jan 2004, jw schultz <jw at pegasys.ws> wrote:
> > [snip]
> >> > union links {
> >> >         struct idev {
> >> >                 INO64_T inode;
> >> >                 DEV64_T dev;
> >> >         }
> >> >         struct hlinks {
> >> >                 int                count;
> >> >                 file_struct        **link_list;
> >> >         }
> >> > }
> >> > 
> > [snip]
> >> 
> >> Now that you have opened the door for re-using the inode and
> >> dev area in file_struct, I can make my initial suggestion that
> >> requires two 'additional' pointers in file_struct.  I didn't
> >> suggest it before because I didn't want to add to the struct.
> >> 
> >> I propose to reuse the DEV amd INODE areas to store two new
> >> file_struct pointers (the 2nd part of the union).  These would
> >> be set during the post-qsort phase during init_hard_links.
> >> 
> >> The first pointer links together the file_structs for each
> >> hardlink group in a list in the same order as in the qsort
> >> so they can be walked by your proposed link-dest method.
> >> 
> >> The second pointer points to the head (first file_struct of
> >> a hardlink group).
> >> 
> >> For each file_struct that is modified in this way a flag bit
> >> needs to be set indicating so.
> > 
> > No it doesn't.  After we walk the qsorted hlink_list all
> > file_structs->links pointers either point to a hlinks or
> > are NULL.  If there are no links file_structs->links is
> > NULL.
> 
> If you qsort the entire list without first filtering out any
> entries (e.g. !IS_REG), then yes, that would be true.

Either NULL it during the hlink_list walk or don't populate
file_struct->links while building the file list (saving a
malloc) for unlinkable files.

> >> Then, if the file_struct address equals its head pointer, then
> >> you are at a head of a hardlink list which can be processed
> >> using the new method using link-dest that you outlined earlier.
> > 
> > So you are proposing a singly linked list:
> >          struct hlinks {
> >                  file_struct                *head;
> >                  file_struct                *next;
> >          }
> 
> Yes.
>  
> > That would work too.  With the pointers into the array
> > you compare the first in the array with yourself instead of
> > head.  I preferred knowing in advance how many links there
> > were.
> 
> But there is no need to know exactly how many links, is there?

Not at this time.

> >> I'm still not clear on what exactly will happen during that new
> >> processing method.  I assume that if the head file does not exist,
> >> it will find any existant file and hardlink it to the head file,
> >> and if not found, will transfer the head file.  Is that correct?
> > 
> > Not exactly.  Like my earlier post if the head (your term)
> > doesn't exist (existence including compare_dest) we iterate
> > over the link set and the first that does exist is used for
> > fnamecmp.  But that is phase two or three.
> 
> Let me put it another way...  Using your proposed method, after
> the 'head' file is processed, will it now exist on the receiver
> so that its siblings can be hardlinked to it?

Yes.

> >> Processing of non-head files (file_struct address not equal to
> >> head pointer) can be skipped since they are hardlinked later.
> >> 
> >> The final walk-through that creates the hardlinks for the
> >> non-head files can walk the qsorted list and use the head
> >> pointer as the target for the hardlinks of the non-nead files.
> > 
> > Actually, if we could do the hardlinks of non-head files as
> > we encounter them while walking the file list and used your
> > singly linked lists we could free the hlink_list after
> > traversal.  The problem is that we would need to be sure the
> > head had completed processing so the hardlink would have to
> > be created in receiver and that gets ugly.
> 
> Right.  You've just explained the here-to-fore unknown reason
> why that sibling hlinking was being done in a separate, final
> phase.  If we keep it that way, I'd like to see a comment added
> to explain what you just explained.
> 
> But here's an idea, still using my proposed head/next struct...
> 
> - Make *next a circular list (the last points to the first).
> - Leave *head NULL initially.
> - When a file of a hlink group is examined, and *head is NULL,
>   then it is the first one of that group to be processed, and
>   it becomes the head.  All the *head pointers are set to its
>   file_struct address.
> - Subsequent processing of siblings can immediately hardlink
>   them to the head file.
> 
> The drawback is that this will invoke CopyOnWrite (as dicussed
> earlier in this thread).  To avoid that, *head would have to
> point outside to a group head pointer, which would then be set.
> So you'd need an array of pointers of the same size as the
> number of distinct hardlink groups.  Say!  We already have the
> sorted hlink_list.  We could just point to the first element
> of the group (setting it to NULL initially) after creating
> the circularly linked list.

The hardlinks have to be created in the receiver but the
logic of selecting which one is the target/head must happen
before the fork so the receiver can act on it.  This means
that the data related to these must become fixed before the
fork.  Doing that also avoids a COW.

The list** reuses the qsorted hlink_list.  The linked list
with head means we only need a list of heads.  Pick one or
the other and don't modify the data after the fork.

I'm inclined right now to go with the non-looping linked
list and a list of heads so that in the generator we

	if (file->links && file->links->head != file)
		return;
	/* test will be a little more complicated because
	 * file is actually a copy
	 */

Then do_hard_links() would

	for (i = 0; i < hlink_count; i++) {
		src = hlink_list[i];
		while (dest = hlink_list[i]->links->next) {
			do_a_link(src, dest);
		}
	}

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt


More information about the rsync mailing list