--fuzzy search over to-be-deleted files to catch moved files and directories

H. Langos henrik-rsync at prak.org
Fri Nov 13 10:58:17 MST 2009


Hi Matt,

Thank you very much for answering those questions and helping me to
understand rsync better!

On Thu, Nov 12, 2009 at 11:20:19PM -0500, Matt McCutchen wrote:
> Attempting to address each of your questions, here and then in your
> other message...
> 
> On Wed, 2009-11-11 at 12:17 +0100, H. Langos wrote: 
> > > It will find moved files that match exactly
> > > according to the "quick check" in effect (size + mtime or checksum). 
> > 
> > That is basename+size+mtime  or basename+checksum, right?
> 
> No, a basename match is not a requirement (hence the ability to detect
> renames), but it is a tie-breaker. 

Ahh, ok, so here size+mtime or checksum select the base file. 

And if that selection fails then "--fuzzy" search is applied but looks 
only in the /dst directory for a suitable candidate.

(Or is the temporal order reversed?)

> > How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"? 
> 
> --detect-renamed and --fuzzy are two different means of finding basis
> files that overlap in some cases but do not really interact.
> --detect-renamed considers the whole destination using the quick check,
> while --fuzzy considers only the same destination subdir using
> size+mtime or otherwise name similarity.
> 
> --delete-before and --delete-during may reduce the effectiveness of
> --fuzzy, as stated in the man page description of --fuzzy, but they do
> not affect --detect-renamed since --detect-renamed actually works during
> the delete pass.
...
> > > It doesn't calculate name similarity like --fuzzy because that would
> > > be prohibitively expensive in the current implementation.
> > Only files of the same size should be
> > candidates to start with, right?
> 
> No, the name similarity calculation I'm talking about is the fallback to
> select a similar basis file when no available destination file passes
> the quick check, so it does not require a size match.

Hmm, ok so fuzzy also finds files that are slightly different and have their
name slightly changed.

This sounds like it would be a good idea to (have the option to) include 
the delete candidates directory .~tmp~ (or whatever else "--detect-renamed" 
uses) included in the --fuzzy search.

The real world applications are obvious. Apart from software packages 
as described in https://bugzilla.samba.org/show_bug.cgi?id=3392#c7 
(thanks for tha link!), which is aspecial case, using rsync friendly
gzip/zlib compression, there is the large area of media files.

Example:
For my photo collections it would speed things up in the case where I
move pictures to a different directory, rename them from DSC_01234.JPG to
20091113-174354_dsc01234.jpg (extracted timestamp from exif data) and 
add author, license and some keywords to the exif tags.

This is not theory. In fact I do just those things with a script when
importing pictures from any of my cameras into the photo archive. I 
rename them as shown above and then I move them to a directory structure
made of <year>/<month>/<day>/ . I don't change the exif tags yet, which 
I wanted to add in the future. 
But that would make the  size+mtime/checksum test fail. Using "--fuzzy" 
would help, but only if I'd do an rsync between the moving operation 
and the tag changing operation. 

No matter which operation I'd do first, but doing both together would 
mean completely new transfer to my backup location. :-/


Same thing goes for mp3 collections when you finally find the time 
to tag your new music and move it to the right directory in your 
collection.
 
> > Why would it be so expensive?
> 
> Wayne said so here:
> 
> https://bugzilla.samba.org/show_bug.cgi?id=3392#c11

Well, I think I'll have to wait then ... or refrain from doing move
and change operations at the same time. :-)

Thank you very much for your help!

cheers
-henrik
 


More information about the rsync mailing list