TODO hardlink performance optimizations

jw schultz jw at pegasys.ws
Tue Jan 6 00:18:30 GMT 2004


On Mon, Jan 05, 2004 at 02:46:52PM -0800, Wayne Davison wrote:
> On Mon, Jan 05, 2004 at 01:12:16PM -0800, jw schultz wrote:
> > - init_hard_links() sets nlinks in file_struct.
> 
> I assume you mean by counting adjacent runs of identical inodes after
> the sort is done (while eliminating single-item runs from the list at

Yes.  That is what i meant.

> the same time).  I'd like to avoid adding the nlinks variable to the
> per-file structure that would add more memory when --hard-link wasn't
> used, though.  One way to do this and to cut down on all the adjacent-
> entry comparisons would be to create a parallel array (after the sort)
> of counts back to the first entry.  This way, whenever we have a match,
> we know exactly how many items back the first item is in the array.  It
> would also let you scan to the end of each grouping by stopping when you
> get down to an item with a 0 offset (which would require an extra item
> at the end of the list for a terminating 0).

I don't like that idea at all.   It puts us back into the
business of doing a link lookup for every regular file.

> > - change the return value of check_hard_link() from int to
> >   file_struct** so it returns the pointer to the first entry
> >   in the hlink_list array that matches, or NULL if false.
> 
> Just to be sure we're on the same page:  we need three states:  we need
> to know "master linked" (with pointer to first item) "slave linked" (so
> we can skip it until the end-processing) and "not linked".  If the first
> two states return a pointer to the first item, we can differentiate them
> by checking if the returned value is "us" or not.
> 
> > - basis_path() now can check nlinks and if !0 iterate over the paths
> >   provided by the list returned from check_hard_link() looking for the
> >   first that exists in dest or compare_dest.
> 
> This seems like a good way to go to me.

Your comments have promted a bit more thought and i'm
thinking we might eliminate check_hard_link().  If we walk the
hlink list after sorting we have enough info to push the
data back up to the file_list and eliminate searching
altogether.

One way of doing that would be to have

struct hlinks {
	int		count;	/* Count of entries in links list */
	file_struct	**link_list; /* array of links */
}

Put a pointer to this in the file_struct, NULL means no links.

Put into perspective, this reduces hlink_list by 68bytes per
entry, eliminates non-linked entries from hlink_list, adds 4
bytes per file in file_struct and adds another 8 bytes per
link set.  In sum, the worst case (every file has two links
and no !IS_REG) this saves 60 bytes per file and eliminates
the binary searches.

If you really want to save some per-file space when not
preserving links (hand written): 

  struct file_struct {

-	INO64_T inode;
-	/** Device this file lives upon */
-	DEV64_T dev;
+	hlink_info *links

  }

union links {
	struct idev {
		INO64_T inode;
		DEV64_T dev;
	}
	struct hlinks {
		int		count;
		file_struct	**link_list;
	}
}

Or use casts with file_struct.links being void
so access would use link_set = (hlinks)(file->links).

Once we have identified the link sets we no longer have any
use for the device and inode numbers so we overwrite it with
struct hlinks.  We can free file->links if nlinks == 1.
Don't even populate file->links if we aren't linking or if
!IS_REG.  Also no point populating file->links if dev and
inode are NULL;

Freeing a 16byte structure interleaved with file_struct and
string space is unlikely to be much benefit so overwriting
the union may be better than a free+malloc pair.

This saves another 12bytes per file when no -H and when we
do -H it adds 4 bytes/file which should be offset by the
savings on non-linked files.


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt


More information about the rsync mailing list