Need for a partial checksums patch?

Simo Melenius simo.melenius at iki.fi
Wed Dec 28 05:04:13 MST 2011


Hi everyone!

I played around with rsync sources a little and wrote a small patch that
computes the checksums from parts of the files only. I'm just writing to
ask if the rsync developers would have any interest in the sort of
functionality described below. If you do, I'm willing to work with you to
produce a cleaned up patch for git.

For background: This started as a way to satisfy an itch I had with backing
up my media files. I had some problems, assumedly with timestamps (causing
some files to be backed up again eventhough they had not changed), and thus
I initially learned about 'rsync -c' to only copy files that had actually
been physically changed. However, checksumming big files (even dozens of
gigabytes) takes time. Now, I observed that my files never really change
only little and in only some parts. Also, undetected corruption is not an
issue here: I can survive that by other means. Yet using --size-only would
have been too coarse: I wanted to peek into the contents of the file a
little. So, basically, I needed a quick way to recognize or fingerprint a
big blob of data with high probability and check if it had been backed up
already.

I experimented by adding a new option that causes file_checksum() to not
sweep the file linearly but with increasing intervals. As a first approach,
I just doubled the index 'i' in each iteration and added another md5_update
to be applied at location size-i-CSUM_CHUNK. Thus, the file is checksummed
sparsely but with increasing density towards the beginning and the end of
the file. This seems to work well enough for me and best of all, it's
blazing fast (with enough practical confidence, for me). Further details
and changes about implementation and the approach are likely to emerge.

But at this point the question is whether rsync team would be interested in
a fuzzy-checksum feature like this at all? I'll keep a local fork anyway
but that only benefits me.


Simo

-- 
() Today is the car of the cdr of your life.
/\ http://arc.pasp.de/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20111228/e0966e7f/attachment.html>


More information about the rsync mailing list