TODO hardlink performance optimizations
jw schultz
jw at pegasys.ws
Tue Jan 6 00:18:30 GMT 2004
On Mon, Jan 05, 2004 at 02:46:52PM -0800, Wayne Davison wrote:
> On Mon, Jan 05, 2004 at 01:12:16PM -0800, jw schultz wrote:
> > - init_hard_links() sets nlinks in file_struct.
>
> I assume you mean by counting adjacent runs of identical inodes after
> the sort is done (while eliminating single-item runs from the list at
Yes. That is what i meant.
> the same time). I'd like to avoid adding the nlinks variable to the
> per-file structure that would add more memory when --hard-link wasn't
> used, though. One way to do this and to cut down on all the adjacent-
> entry comparisons would be to create a parallel array (after the sort)
> of counts back to the first entry. This way, whenever we have a match,
> we know exactly how many items back the first item is in the array. It
> would also let you scan to the end of each grouping by stopping when you
> get down to an item with a 0 offset (which would require an extra item
> at the end of the list for a terminating 0).
I don't like that idea at all. It puts us back into the
business of doing a link lookup for every regular file.
> > - change the return value of check_hard_link() from int to
> > file_struct** so it returns the pointer to the first entry
> > in the hlink_list array that matches, or NULL if false.
>
> Just to be sure we're on the same page: we need three states: we need
> to know "master linked" (with pointer to first item) "slave linked" (so
> we can skip it until the end-processing) and "not linked". If the first
> two states return a pointer to the first item, we can differentiate them
> by checking if the returned value is "us" or not.
>
> > - basis_path() now can check nlinks and if !0 iterate over the paths
> > provided by the list returned from check_hard_link() looking for the
> > first that exists in dest or compare_dest.
>
> This seems like a good way to go to me.
Your comments have promted a bit more thought and i'm
thinking we might eliminate check_hard_link(). If we walk the
hlink list after sorting we have enough info to push the
data back up to the file_list and eliminate searching
altogether.
One way of doing that would be to have
struct hlinks {
int count; /* Count of entries in links list */
file_struct **link_list; /* array of links */
}
Put a pointer to this in the file_struct, NULL means no links.
Put into perspective, this reduces hlink_list by 68bytes per
entry, eliminates non-linked entries from hlink_list, adds 4
bytes per file in file_struct and adds another 8 bytes per
link set. In sum, the worst case (every file has two links
and no !IS_REG) this saves 60 bytes per file and eliminates
the binary searches.
If you really want to save some per-file space when not
preserving links (hand written):
struct file_struct {
- INO64_T inode;
- /** Device this file lives upon */
- DEV64_T dev;
+ hlink_info *links
}
union links {
struct idev {
INO64_T inode;
DEV64_T dev;
}
struct hlinks {
int count;
file_struct **link_list;
}
}
Or use casts with file_struct.links being void
so access would use link_set = (hlinks)(file->links).
Once we have identified the link sets we no longer have any
use for the device and inode numbers so we overwrite it with
struct hlinks. We can free file->links if nlinks == 1.
Don't even populate file->links if we aren't linking or if
!IS_REG. Also no point populating file->links if dev and
inode are NULL;
Freeing a 16byte structure interleaved with file_struct and
string space is unlikely to be much benefit so overwriting
the union may be better than a free+malloc pair.
This saves another 12bytes per file when no -H and when we
do -H it adds 4 bytes/file which should be offset by the
savings on non-linked files.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list