--fuzzy search over to-be-deleted files to catch moved files and directories

Matt McCutchen matt at mattmccutchen.net
Fri Dec 4 15:07:08 MST 2009


On Tue, 2009-11-24 at 12:33 +0100, H. Langos wrote:
> Ok, I see. Does "--fuzzy" check if the filezize is in the same order of 
> magnitude (or at most one order up/down)? 
> Expensive fuzzy string matching on filenames can probably safely be skipped 
> if abs(round(log10(ssize))-round(log10(dsize))) > 1

The current implementation has no such check.  That might be a good
idea.

> > > This sounds like it would be a good idea to (have the option to) include 
> > > the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" 
> > > uses) included in the --fuzzy search.
> > 
> > I'm not clear on what you're proposing here.  Could you provide an
> > example?
> 
> [...]

I see, you're suggesting to do a --fuzzy search over all files in the
entire destination that are going to be deleted.  (Your original remark
confused me because there's no "delete candidates directory": files to
be deleted are only hard-linked into ".~tmp~" when --detect-renamed
finds potential rename targets for them.)  This large name-similarity
search may be too slow, as I originally said.

> > That would overlap even more with the current --fuzzy functionality.
> > There may be a better way to factor things.
> 
> Right. There are a lot of options that change the way rsync looks for
> quick-check or basefile candidates and due to the organic growth of features
> their behavior is not always as the users expect.
> 
> Maybe it is time to think about a more consistent way to control the 
> search for a basefile and quick-check candidate.
> 
> My first idea would be to add a more explicit form of control. E.g. lists 
> of key value pairs that say _what_ aspect of a file you want to match 
> and _how good_ you need it to match it for passing the quick-check or 
> for usage as a base for the delta transfer.
> Existing options can easily be translated into that explicit form so that
> internally there would only be one control logic.
> 
> Here are some examples of the current options translated into that new
> schema (I hope I got them right, but keep in mind that this is just a 
> sketch):
> 
> default behaviour of rsync is something like this:
> 
>  --quick path=same,filename=same,size=same,mtime=same
>  --delta path=same,filename=same
[...]

Interesting idea.  This functionality could probably subsume the
--*-dest options too and address the enhancement requests I entered for
basis dirs:

https://bugzilla.samba.org/show_bug.cgi?id=5644
https://bugzilla.samba.org/show_bug.cgi?id=5645
https://bugzilla.samba.org/show_bug.cgi?id=5646

It just needs to be planned carefully and to interact properly with
itemization, whatever that is determined to mean.  Please feel free to
enter an enhancement request, but understand that Wayne may not wish to
implement it in the main version of rsync.  The reality is that every
new feature adds to the maintenance burden.

-- 
Matt



More information about the rsync mailing list