How hard links are processed (was Test case for hard link failure)

Thu Nov 27 21:22:06 EST 2003

On Tue, 25 Nov 2003, Pete Wenzel <pmwenzel at yahoo.com> wrote:
[ ... ]
> The above is BAD (nonoptimal) behavior; the entire file is transferred,
> even though it could simply have been linked.  It seems that "a" is
> transferred before it is determined that a suitable equivalent (linked)
> file "b" already exists.
> 
> I suspect that this has to do with handling the file list in a sorted
> order; when the missing filename is encountered first, it is transferred
> in full.  Not being familiar with the rsync protocol or source code, I
> can't say whether this should be fixed on the client or server side.

I ran into this exact same situation a while back, and took a good
long look at the code in hlink.c to figure out what the problem was.

Here is my take at an explanation of how rsync processes hardlinks.
This could be added as a section to the (newly-restored) rsync doc
file at http://www.pegasys.ws/how-rsync-works.html

Additional comments follow...

---------------------  start of documentation  -----------------------

When a -H or --hard-links option is used to preserve hard links on the
destination (receiver), the following steps to construct the hardlink
list (named "hlink_list") happen after the file list is received:

- An array to hold pointers to all the file list entries is allocated.
- It is initialized to each file list entry in list-order.
- It is sorted (using qsort) with a 3-part comparison test:
    1 - device number
    2 - inode number
    3 - file name (full path)

This list of pointers now has hardlinked files grouped together and
sorted alphabetically.

As rsync processes each file in the main loop, each file's hardlink
status is determined using the hardlink list:

- A binary search for the file is done using the 3-part comparison test.
- The immediately preceding entry in the hardlink list is examined:
  - If it has the same dev/inode as the current entry, then the current
    file of interest is to be hardlinked to at least one other file and
    nothing is done at this time (a message is printed in verbose mode).
  - Otherwise, it is either a non-hardlinked file, or it is the first
    file in a hardlink group.  Either way it is treated as a normal file
    and if it is not present, it is fetched from the sender.

After rsync has processed all the files, an additional post-processing
step is performed.  For each group of hardlinked files, the first file
is left alone, and for each of the remaining files, any existing file
is unlinked, and a hardlink is created to its immediate predecessor.

---------------------  end of documentation  -------------------------

This explains the situation where if A and B are hardlinked, and B is
present but A is not, A will get transferred, B will get *deleted*, and
B is then hardlinked to A.  That's exactly what I see happening.

If A is present, and B is not, then no transfer is needed and only the
hardlink is done.

There is no easy solution.  You can't resort the hardlink groups to put
an existing file at the top because that messes up the binary search.

The only way I thought of to solve this is to pre-process the hardlink
list and for each group of same dev/inode files, if the first file does
not exist, examine the remaining entries to see if any of them exist.
If one is found, then hardlink the first file to the existing file.

Now, if the first file matches its source file when rsync processes it,
then it won't get needlessly fetched.

I decided not to tackle this, since it requires more in-depth knowledge
of rsync to utilize its various support routines, and I just don't feel
that comfortable with it.  So my rsyncs take an extra 20 minutes each
night needlessly transferring hundreds of megabytes of log files.  :-/

If anyone else cares to take on this project, be my guest!
-- 
        John Van Essen  Univ of MN Alumnus  <vanes002 at umn.edu>