[Bug 10263] New: Extend Behaviour of the --fuzzy Parameter

samba-bugs at samba.org samba-bugs at samba.org
Tue Nov 12 07:29:04 MST 2013


https://bugzilla.samba.org/show_bug.cgi?id=10263

           Summary: Extend Behaviour of the --fuzzy Parameter
           Product: rsync
           Version: 3.1.0
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: core
        AssignedTo: wayned at samba.org
        ReportedBy: me at haravikk.com
         QAContact: rsync-qa at samba.org


The --fuzzy parameter can be great for speeding up transfers that involve
renamed files that haven't changed location, however it isn't as effective if,
for example, a directory was renamed, as the entire contents of the directory
will end up being copied in full anyway.


What I'd like to propose is an extension of the --fuzzy parameter to also
consider directories in the path, possibly with an optional depth to limit how
far back it will go.

For example, say I have the following directory with three files:
/foo/bar/A
/foo/bar/B
/foo/bar/C

But in the source this is changed to:
/foo/barred/A
/foo/barred/B
/foo/barred/C

When /foo/barred/A is being considered, rsync will first need to check for an
existing directory of the same name, and won't find one (causing it to create
one instead). This means that fuzzy matching has no existing directory to
compare within, so it should instead look at the next level down (within /foo)
and quickly look at existing directories with the same (or similar) creation
date. If such a directory is found (/foo/bar) then rsync can look inside this
for matches for existing files, possibly using linking for speed.

Linking is obviously preferred, but may depend a lot on whether rsync can
detect in advance if the fuzzily matched directory is going to be deleted, as
this would mean that linking or moving would be okay to do.


Anyway, this could help to solve or at least limit a common pitfall with
synchronisation that arises when a directory is renamed or moved.

i recommend the use of an optional parameter, e.g - --fuzzy-depth to limit how
many levels of a path rsync will look for fuzzy matches, though if the
behaviour is sensible enough then it may not be necessary. For example,  lets
extend the directory above:
/foo/bar/example/folder/A

Which is later renamed to:
/foo/barred/example/folder/A

When considering file A rsync won't find a direct match or anywhere to look for
fuzzy matches, so it will look inside /foo/barred/example for a fuzzy match for
"folder", again finding nothing. It then tries /foo/barred for a match to
"example" but again fails (no directory). Finally it tries inside /foo for a
directory with similar creation date to "barred", and will find /foo/bar.
Inside this it will find "example", then "folder" then finally a match for file
"A".

The issue is how likely it is that a search of /foo will produce multiple fuzzy
matches for "barred", but I think the likelihood is low enough that it
shouldn't add too much overhead, even with extremely complex hierarchies. The
end result should be an improvement in transfer speed for directories that were
renamed; won't help much for directories that were moved by any significant
amount, but I think renaming is more common.

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.


More information about the rsync mailing list