TODO hardlink performance optimizations

John Van Essen vanes002 at umn.edu
Tue Jan 6 08:04:19 GMT 2004


On Mon, 5 Jan 2004, jw schultz <jw at pegasys.ws> wrote:
> On Mon, Jan 05, 2004 at 02:46:52PM -0800, Wayne Davison wrote:
[ snip ]
>> This seems like a good way to go to me.
> 
> Your comments have promted a bit more thought and i'm
> thinking we might eliminate check_hard_link().  If we walk the
> hlink list after sorting we have enough info to push the
> data back up to the file_list and eliminate searching
> altogether.

Right.

[ snip ]
> If you really want to save some per-file space when not
> preserving links (hand written):
> 
>   struct file_struct {
> 
> -        INO64_T inode;
> -        /** Device this file lives upon */
> -        DEV64_T dev;
> +        hlink_info *links
> 
>   }
> 
> union links {
>         struct idev {
>                 INO64_T inode;
>                 DEV64_T dev;
>         }
>         struct hlinks {
>                 int                count;
>                 file_struct        **link_list;
>         }
> }
> 
> Or use casts with file_struct.links being void
> so access would use link_set = (hlinks)(file->links).
> 
> Once we have identified the link sets we no longer have any
> use for the device and inode numbers so we overwrite it with

This is a _great_ observation!!

> struct hlinks.  We can free file->links if nlinks == 1.
> Don't even populate file->links if we aren't linking or if
> !IS_REG.  Also no point populating file->links if dev and
> inode are NULL;
> 
> Freeing a 16byte structure interleaved with file_struct and
> string space is unlikely to be much benefit so overwriting
> the union may be better than a free+malloc pair.

Agreed.  Union is good.

> This saves another 12bytes per file when no -H and when we
> do -H it adds 4 bytes/file which should be offset by the
> savings on non-linked files.

Now that you have opened the door for re-using the inode and
dev area in file_struct, I can make my initial suggestion that
requires two 'additional' pointers in file_struct.  I didn't
suggest it before because I didn't want to add to the struct.

I propose to reuse the DEV amd INODE areas to store two new
file_struct pointers (the 2nd part of the union).  These would
be set during the post-qsort phase during init_hard_links.

The first pointer links together the file_structs for each
hardlink group in a list in the same order as in the qsort
so they can be walked by your proposed link-dest method.

The second pointer points to the head (first file_struct of
a hardlink group).

For each file_struct that is modified in this way a flag bit
needs to be set indicating so.

Then, if the file_struct address equals its head pointer, then
you are at a head of a hardlink list which can be processed
using the new method using link-dest that you outlined earlier.

I'm still not clear on what exactly will happen during that new
processing method.  I assume that if the head file does not exist,
it will find any existant file and hardlink it to the head file,
and if not found, will transfer the head file.  Is that correct?

Processing of non-head files (file_struct address not equal to
head pointer) can be skipped since they are hardlinked later.

The final walk-through that creates the hardlinks for the
non-head files can walk the qsorted list and use the head
pointer as the target for the hardlinks of the non-nead files.

No extra memory required beyond that already required for the
qsorted pointer list.

No binary search required during the file processing phase.

The key is getting the link-dest method working for this.
-- 
        John Van Essen  Univ of MN Alumnus  <vanes002 at umn.edu>



More information about the rsync mailing list