Time rsYnc Machine (tym)

Thu Aug 9 14:04:56 MDT 2012

I may be mistaken, but I heard at one time that rsync was noticeably slower
when asked to preserve hard links.  I'm guessing this is a matter of CPU
requirements rather than I/O requirements, but that's just a guess.

You can go a long way toward detecting hardlinks by just watching for
st_nlink > 1 from fstat().

Another way, one that only bothers with hardlinks that are part of a
transfer, uses bloom filters.  You stash a inode#,device# pair in the
filter.

A bloom filter is a probabilistic set datastructure.   You can't iterate
over its elements (it doesn't store them all), but you can add elements and
test candidate elements for membership.  The chief advantage is that you
can have storage requirements as low as one bit per element "stored".

When you do a set membership test, a bloom filter can give one of two
answers: "This element is definitely not in the set" or "This element is
almost definitely in the set".

Detecting hardlinks is a good use of a bloom filter, because all you really
want to do is cut down on the number of things you need to store in an
array/list.  It's OK to have a few things in the list that aren't really
hardlinks - the goal is chiefly to dramatically cut down on the storage
requirements, and avoid a large sort() or O(file_count) hash table.  That's
where the "filter" part of the name comes from - it's filtering the count
of the objects of interest down to a more manageable size.

On Tue, Aug 7, 2012 at 9:07 PM, Linda Walsh <rsync at tlinx.org> wrote:

>
>
> Dan Stromberg wrote:
>
>> FWIW, it might be nice to add a hardlink detecting bloom filter to rsync
>> at some point.  This makes the process of detecting hardlinks less
>> expensive.  Another way to narrow down the field is to just look at
>> st_nlink.
>>
> ----
>         What's a bloom filter? and how / why would it make things
> less expensive?  I don't understand why it is expensive now?  You have
> to visit all files -- likely a previsit to get a size estimate -- reading
> all the inodes at that point,
>
> Then have a hash 'ino2names' for each inode to point to an array name of
> files
> found in the tree with the same inode
>
> %ino2names
> $ino2names=>[array of paths relative to root of tree being examined]
>
> Since the size of the transfer is known after the initial scan -- all the
> inode
> inode->path mapping would be knowable as well, at that point.  No extra
> expense involved.
>
> Of course given the error I reported, it seems rsync has gotten broken,
> recently
> with hard links -- they aren't that difficult.  It is presumed, that links
> 'out of tree'
> are ignored @ source and target -- meaning target files end up with same
> internal
> linkages as on source, and any external links would be broken.
>
> I still have no clue what a bloom filter is?? ;-)  Cluesticks anyone?
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20120809/c2125b68/attachment.html>