Caching {filePath,mtime64,checksum} values to speed up execution-time

Doug Robinson doug.robinson at wandisco.com
Wed Mar 12 07:41:52 MDT 2014


Kevin:

On Tue, Mar 11, 2014 at 6:18 PM, Kevin Korb <kmk at sanitarium.net> wrote:
>- --checksum should not be used during normal rsync operations.  It is
>for special cases only.

I noted in a reply that we're in that "special case" arena.  The use
case is one where operations are being replicated between systems and
the data files created are meant to be identical.  However, bad things
happen and then repairs are needed.  Since it is the operations that
are being replicated and not the data the time stamps do not match
but the byte-counts should.  Unfortunately, the byte-counts don't
tell the real story and therefore the checksums must be computed.

There are a lot of products that are doing this type of replication
these days: Subversion's svnsync and (the company I work for)
WANdisco's SVN MultiSite products to mention just two.  There are a
lot more.

>Rsync can still have a lot of overhead getting the timestamps via
>stat() but that can't really be helped.

The overhead of getting the timestamps is tiny compared with the I/O
computing the checksums.

>I don't really understand how file mtimes would be cached.  How would
>rsync know what mtimes don't match the cache without checking
>stat()ing the files and then the job is already done so the cache
>wouldn't accomplish anything.

The concept is that, without file system corruption or intentional
covert corruption, the tuple {filePath, 64-bit mtime, checksum} is
an invariant.  Cache invalidation is simple: if the mtime is not
exactly the same then re-compute the checksum.  Using such a cache
means that the comparison operation is of the same order of
magnitude as a "find" through the subtree of the file system versus
reading every file in that subtree AND doing the necessary CPU
intensive checksum computation.  It would be a huge win.

Thank you.

Doug

>On 03/11/2014 06:11 PM, Doug Robinson wrote:
>> Folks:
>>
>> When using rsync to copy huge amounts of data I've found that a
>> significant amount of time is spent computing the checksums.
>> Sometimes hours, ... sometimes days - it depends on the total
>> amount of data checked!  And after that sometimes it's only a few
>> files that need to be updated.
>>
>> I've pulled the latest git (rsync-3.1.1pre1) and didn't see
>> anything to address this (or I missed it?).
>>
>> I was wondering what folks thought of a proposal to enhance rsync
>> to be able to create and maintain a cache of {filePath, 64-bit
>> mtime, checksum} beforehand on both source and target systems and
>> then use that cache later on when asked to sync the two systems
>> together?  Then cache entry validation would be a quick stat64() to
>> make sure that the 64-bit mtime didn't change before sending the
>> checksum over the wire for comparison.
>>
>> Clearly the cache would need to be completely invalidated (or
>> re-created) if the file system became corrupt.  That could be
>> handled via an "rm -rf" of the cache.
>>
>> Thoughts?
>>
>> Thank you.
>>
>> Doug -- WANdisco // /Non-Stop Data/
>>
>> t. 925-396-1125 e. doug.robinson at wandisco.com
>> <mailto:doug.robinson at wandisco.com>
>- --
>~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~
>        Kevin Korb                      Phone:    (407) 252-6853
>        Systems Administrator           Internet:
>        FutureQuest, Inc.               Kevin at FutureQuest.net  (work)
>        Orlando, Florida                kmk at sanitarium.net (personal)
>        Web page:                       http://www.sanitarium.net/
>        PGP public key available on web site.
>~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~

-- 
Doug Robinson

WANdisco // *Non-Stop Data*

t. 925-396-1125
e. doug.robinson at wandisco.com

-- 


Join us in New York and San Francisco for Subversion & Git Live 2014<http://www.wandisco.com/subversion-git-live-2014>

Listed on the London Stock Exchange: WAND<http://www.bloomberg.com/quote/WAND:LN>

THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE 
PRIVILEGED.  If this message was misdirected, WANdisco, Inc. and its 
subsidiaries, ("WANdisco") does not waive any confidentiality or privilege. 
 If you are not the intended recipient, please notify us immediately and 
destroy the message without disclosing its contents to anyone.  Any 
distribution, use or copying of this e-mail or the information it contains 
by other than an intended recipient is unauthorized.  The views and 
opinions expressed in this e-mail message are the author's own and may not 
reflect the views and opinions of WANdisco, unless the author is authorized 
by WANdisco to express such views or opinions on its behalf.  All email 
sent to or from this address is subject to electronic storage and review by 
WANdisco.  Although WANdisco operates anti-virus programs, it does not 
accept responsibility for any damage whatsoever caused by viruses being 
passed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.samba.org/pipermail/rsync/attachments/20140312/4dd16726/attachment.html>


More information about the rsync mailing list