--fuzzy search over to-be-deleted files to catch moved files and directories

H. Langos henrik-rsync at prak.org
Wed Nov 11 04:17:50 MST 2009


Hi Matt,

Thank you very much for the quick response!

On Tue, Nov 10, 2009 at 12:13:09PM -0500, Matt McCutchen wrote:
> On Tue, 2009-11-10 at 17:55 +0100, H. Langos wrote:
> > Now the question is, if fuzzy search could be extended to search for
> > moved files across directory borders. If for example I move pictures
> > aound to reorganize them in a different directory hierarchy, or if I 
> > move whole directories around as I reorganize part of my music files.
> 
> Consider the --detect-renamed option provided by the maintained patch
> "detect-renamed.diff".  

That sounds just like the thing. 
I applied the patch from git://git.samba.org/rsync-patches.git
that is tagged v3.0.6 (d64936b9..) and it builds nicely with the debian 
lenny source package of 3.0.6 .. now I'll have to see how it works.

> It will find moved files that match exactly
> according to the "quick check" in effect (size + mtime or checksum). 

That is basename+size+mtime  or basename+checksum, right?

How does "--detect-renamed" interact with "--fuzzy" and "--delete-after"? 

> It doesn't calculate name similarity like --fuzzy because that would
> be prohibitively expensive in the current implementation.

Why would it be so expensive? Only files of the same size should be
candidates to start with, right. For small files (where most same-size 
collisions will occure) the gain of fuzzy detecting renames is probably
not worth it. Normal move detection however will be helpful when moving
e.g. a kernel source tree around. For big files there should be very few
candidates to do a fuzzy comparison against. So cost should be rather low
for bigger files.

The thing that I am worried about is the case of DVD backups. 
Here you get a lot of big VTS_01_1.VOB files that all have the same 
size: 1073739776 bytes. So basename+filesize already match and the only
thing that would avoid loss of data would be the mtime. Frankly I 
wouldn't want to trust the integrity of my DVD backups to a timestamp.
 
Is there a way to enforce checksum tests for moved files while keeping 
size+mtime tests for files that didn't get moved/renamed ?

cheers
-henrik



More information about the rsync mailing list