--fuzzy search over to-be-deleted files to catch moved files and directories

H. Langos henrik-rsync at prak.org
Tue Nov 24 04:33:10 MST 2009


On Sat, Nov 21, 2009 at 09:08:05AM -0500, Matt McCutchen wrote:
> On Fri, 2009-11-13 at 18:58 +0100, H. Langos wrote:
> > > > > [--detect-renamed] doesn't calculate name similarity like --fuzzy because that would
> > > > > be prohibitively expensive in the current implementation.
> > > > Only files of the same size should be
> > > > candidates to start with, right?
> > > 
> > > No, the name similarity calculation I'm talking about is the fallback to
> > > select a similar basis file when no available destination file passes
> > > the quick check, so it does not require a size match.
> > 
> > Hmm, ok so fuzzy also finds files that are slightly different and have their
> > name slightly changed.
> 
> There's no "slightly" on "different" there.  Assuming --fuzzy doesn't
> find a quick-check match (and it probably won't because --detect-renamed
> has already searched the whole destination with the same criteria), the
> choice of basis files is based exclusively on name similarity.

Ok, I see. Does "--fuzzy" check if the filezize is in the same order of 
magnitude (or at most one order up/down)? 
Expensive fuzzy string matching on filenames can probably safely be skipped 
if abs(round(log10(ssize))-round(log10(dsize))) > 1

> > This sounds like it would be a good idea to (have the option to) include 
> > the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" 
> > uses) included in the --fuzzy search.
> 
> I'm not clear on what you're proposing here.  Could you provide an
> example?

Ok, I start with this situation, where src and dst are in sync:
src/new/foo.jpg
src/new/bar.jpg
src/2009/

dst/new/foo.jpg
dst/new/bar.jpg
dst/2009/

Then I run my picture import script. The files are renamed, moved to
different directories and some bytes have been added. I end up with 
something like this:

src/new/
src/2009/2009-11-23-foo.jpg
src/2009/2009-11-23-bar.jpg

dst/new/foo.jpg
dst/new/bar.jpg
dst/2009/

If I run rsync in this situation, then dst/a/foo.jpg and dst/b/bar.jpg will 
end up on the destination's to-be-deleted list and --fuzzy would find nothing
in dst/2009/ that it could use as base for the "new" files 
src/2009/2009-11-23-foo.jpg and src/2009/2009-11-23-bar.jpg.

What I propose is that, lets call it "--fuzzy-detect-renamed" should not only
look in the same directory but also in the to-be-deleted list that
"--detect-renamed" uses as temporary asylum for deletion/renaming
candidates.

Since foo.jpg and bar.jpg are on that to-be-deleted list, my expectation of 
that new behavior is that foo.jpg would be taken as base for 2009-11-23-foo.jpg 
and bar.jpg would be taken as a base file for 2009-11-23-bar.jpg

> > In fact I do just those things with a script when
> > importing pictures from any of my cameras into the photo archive. I 
> > rename them as shown above and then I move them to a directory structure
> > made of <year>/<month>/<day>/ . I don't change the exif tags yet, which 
> > I wanted to add in the future. 
> > But that would make the  size+mtime/checksum test fail. Using "--fuzzy" 
> > would help, but only if I'd do an rsync between the moving operation 
> > and the tag changing operation.
> > 
> > No matter which operation I'd do first, but doing both together would 
> > mean completely new transfer to my backup location. :-/
> 
> Right.  Note that if you did an rsync between the moving and the tag
> changing, you wouldn't need --fuzzy on the second rsync because the
> files would already be in the right places.

Right.

> Efficiently handling simultaneous renames and data changes is very hard
> for a stateless tool like rsync.  If I understand correctly that you're
> moving files without changing their basenames, it would work in this
> case to extend --detect-renamed to look for an exact basename match if
> there is no quick-check match.

I do change the basename too (e.g. I rename "img_1023.jpg" to 
"2009-10-18_img_1023.jpg") but in a way that fuzzy matching should
be able catch).

> That would overlap even more with the current --fuzzy functionality.
> There may be a better way to factor things.

Right. There are a lot of options that change the way rsync looks for
quick-check or basefile candidates and due to the organic growth of features
their behavior is not always as the users expect.

Maybe it is time to think about a more consistent way to control the 
search for a basefile and quick-check candidate.

My first idea would be to add a more explicit form of control. E.g. lists 
of key value pairs that say _what_ aspect of a file you want to match 
and _how good_ you need it to match it for passing the quick-check or 
for usage as a base for the delta transfer.
Existing options can easily be translated into that explicit form so that
internally there would only be one control logic.

Here are some examples of the current options translated into that new
schema (I hope I got them right, but keep in mind that this is just a 
sketch):

default behaviour of rsync is something like this:

 --quick path=same,filename=same,size=same,mtime=same
 --delta path=same,filename=same


when given the "--checksum" optione it is:

 --quick path=same,filename=same,checksum=same
 --delta path=same,filename=same


with the current "--fuzzy" option it is

 --quick path=same,filename=same,size=same,mtime=same
 --delta path=same,filename=same
 --delta path=same,filename=fuzzy


with the current "--detect-renamed" option it is

 --quick path=same,filename=same,size=same,mtime=same
 --delta path=same,filename=same
 --delta path=deleted,size=same,mtime=same


this is more easily extendible as new aspects can be added without 
changing current behaviour and new "qualities" of matching can be 
added to express stuff like

explicit source files (regardless of the src filename):

 --delta path=some/arbitrary/path/,filename=foo.img


a "pool directory" of source files:

 --quick path=my/pool/path/,filename=same,mtime=same,size=same
 --delta path=my/pool/path/,filename=fuzzy,size=fuzzy


only use files as base if they are smaller:

 --delta path=same,filename=same,size=smaller

you could even express when to skip the delta comparisons completely.
e.g. if the destination file was created before the source file (a 
situation that you encounter when syncing a directory with rotating 
log files and a rotation has taken place at /src)

  --whole path=same,filename=same,ctime=older


sure this schema is more verbose than the current set of options, but people
would use it in scripts rather than on the command line and there you want
your commands to be as verbose and explicit as possible. after all you'll
want somebody else to understand your scripts without reading all command's
man pages and you'll want the behavior to stay constant even when the next 
mayor version of rsync changes the behavior of one of the summary options.


cheers
-henrik



More information about the rsync mailing list