Caching {filePath,mtime64,checksum} values to speed up execution-time

Wayne Davison wayned at samba.org
Thu Mar 13 18:37:16 MDT 2014


On Tue, Mar 11, 2014 at 3:11 PM, Doug Robinson
<doug.robinson at wandisco.com>wrote:

> I was wondering what folks thought of a proposal to enhance rsync to be
> able to create and maintain a cache of {filePath, 64-bit mtime, checksum}
> beforehand on both source and target systems and then use that cache later
> on when asked to sync the two systems together?


See patches (in order of recommendation): db.diff, checksum-updating.diff,
checksum-xattr.diff.

I personally use db.diff in one situation at work combined with a sqlite DB
on the source and destination machines.  You just need to periodically weed
out any old inode values (via rsyncdb --clean /dirs) if things start to
bloat.  In the future I'd like to see the db.diff code included by default
as loadable libraries, which would allow someone to install plain rsync and
only also install sqlite-using rsync and/or mysql-using modules if they
want the extra functionality.  There is also a plan to eventually have the
db code map the inodes in the db to paths for things like rename
optimizations.

That said, all these patches currently do is cache checksums.  The db
patch's default strict checking only uses a cached inode's info if the
size+mtime+ctime all match what we knew about the file when it was cached
(which makes it pretty safe).  If you switch to a more lax algorithm (no
ctime) you need to be extra sure the files don't get updated in some way as
to leave the file matching the laxer inode info (e.g. only let rsync make
changes to the files and/or make sure that modify timestamps always
increase so that there is no chance of accidentally matching an older inode
record).

If you're wondering how an mtime-using algorithm helps your use case, keep
in mind that the mtimes don't need to match between hosts, just between
each host's files and its db cache (and any non-matching or missing ones
get (re)computed to the new checksum).

I'll also point out that if you want to use sqlite, I recommend you use the
very latest db.diff (from the git patches repo) since it has a change that
alleviates locking contention between the multiple rsync processes in a
single copy (you can't really share the db between simultaneous rsync
copies due to sqlite's poor multi-process locking -- use mysql for that).

The rsyncdb manpage has info on initializing the db, noting mounts,
maintenance, etc.

The other patches might also be useful to you, so feel free to check them
out:  https://git.samba.org/?p=rsync-patches.git

..wayne..
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20140313/c2d5b1fd/attachment.html>


More information about the rsync mailing list