[Bug 10581] New: --fuzzy-delay and --fuzzy-limit for fuzzy match tuning

samba-bugs at samba.org samba-bugs at samba.org
Thu May 1 07:57:10 MDT 2014


https://bugzilla.samba.org/show_bug.cgi?id=10581

           Summary: --fuzzy-delay and --fuzzy-limit for fuzzy match tuning
           Product: rsync
           Version: 3.1.0
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P5
         Component: core
        AssignedTo: wayned at samba.org
        ReportedBy: samba at haravikk.com
         QAContact: rsync-qa at samba.org


It seems that when backing up folders with a very large number of files,
--fuzzy behaves in a sub-optimal fashion, forcing rsync to build a file list
for the entire folder if even a single new (sender only) file is encountered,
which can completely halt a transfer until all of the folder's contents are
known.

To give you a better idea; I have a backup command that I run, but one of the
items included in the backup is a huge OS X sparse bundle disk image comprised
of some 32,000+ bands all stored within a single folder inside the image
bundle.

With --fuzzy disabled, rsync very quickly identifies files that are new or
changed and starts sending them in a reasonable amount of time (given how many
there are to check). However, with --fuzzy enabled, there is a huge (hours
long) delay before a single file gets transferred.

Now, I assume this is because rsync is waiting for the destination file-list to
be completed so it can perform fuzzy matching for similar files, however with
such large folders this can result in an incredible delay for little gain. Such
large folders aren't uncommon for modern disk image formats and also for
well-used mail folders, as just two examples. While currently I just run with
--fuzzy disabled, I would rather keep it enabled for other folders where the
match can help to improve matching against relocated files.


So I'm proposing two new --fuzzy related options as follows:

--fuzzy-limit sets a limit on the size of a folder where fuzzy matching is
performed. By setting this to say 500, fuzzy matching can be temporarily
disabled for any particularly large folders where the benefits will be far
outweighed by the delays. This is the simpler of the two to implement I think.
Giving the value as normal will set a limit on folder size at both ends, while
setting a value with a plus (e.g - --fuzzy-limit +500) will only test the
sender, and a minus will test only the destination.

--fuzzy-delay changes the behaviour of --fuzzy such that any fuzzy matching
will be deferred until the file-list for the folder is complete. Instead,
updates and deletion checks* will continue normally until the file list for the
folder is complete, at which point any pending fuzzy matches are performed, and
the updates/deletions continue. *in the case of --delete-during this may result
in even more missing potential matches than normal, which is why --fuzzy-delay
may not be suitable as default behaviour.


Either of these features should help to greatly optimise the performance of
--fuzzy, so that it particularly large directories don't result in a
significant drop in performance with fuzzy matching enabled, particularly when
there is a difference in speed between devices (e.g - faster sender, slower
destination such as a NAS or shared remote host).

-- 
Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.


More information about the rsync mailing list