--fuzzy enhancements: size match in all directories

Robert Siemer Robert.Siemer at backsla.sh
Mon Dec 21 00:18:21 MST 2009


Hello everyone,

Image you rsynced your mp3 archive. Later you do some cleanup renaming
and start splitting up the directory into a hierarchy and do some file
move around.

Data-wise you did nothing, meta-data-wise you did a lot. --fuzzy comes
into mind for the next rsync. Unfortunately fuzzy matching does not
include other (sub-)directories and cares a little too much about
modification times for this case.

I was thinking about introducing a superset of the current fuzzy
matching (works initially like the original, but tries more base files
if nothing matched so far), and/or two new threshold values with e.g.
 --fuzzy-thresholds 1000:20000
where the numbers refer to the file size on the sender-side, the first
meaning “below this size, don’t even consider fuzzy matching” and the
second number meaning “above this size try harder to find a base file”.
This could default to --fuzzy-thresholds 0:<unlimited>, the old
behaviour.

In case of the more aggressive search: when running out of base files
with the original algorithm, try _all_ files in the destination
hierarchy with just the same size, possibly sorted by
Levenshtein-distance for the file name with full path.

The idea is to catch simple copy/move arounds, while still keeping
unreasonable base files away. Especially with bigger files, the
likeliness of exact same size collisions is pretty small. The risk is:
unnecessary checksum calculations with a wrong base file. If you think
that risk is too high, don’t use that option...

Is there a good reason why this functionality is not in rsync yet?


Regards,
Robert



More information about the rsync mailing list