silent data corruption with rsync

Leen Besselink leen at consolejunky.net
Tue Mar 11 12:05:32 MDT 2014


On Tue, Mar 11, 2014 at 11:52:51AM -0500, Karl O. Pinc wrote:
> On 03/11/2014 11:02:28 AM, Sig Pam wrote:
> > Hi everbody!
> > 
> > I'm currently working in a project which has to copy huge amounts of
> > data from one storage to another. For a reason I cannot validate any
> > longer, there is a roumor that "rsync may silently corrupt data".
> > Personally, I don't believe that.
> > 
> > "They" explain it this way: "rsync does an in-stream data
> > deduplication. It creates a checksum for each data block to transfer,
> > and if a block with the same checksum has already been transferred
> > sooner, this old block will be re-used to save bandwidth. But, for 
> > any
> > reason, two diffent blocks can produce the same checksum even if the
> > source data is not the same, effectively corrupting the data stream".
> 
> Well, yeah.  It works that way if you're transferring data over
> the network.
> 
> The question is: "how often will this problem exhibit itself?"
> The answer is: "Usually, never within the lifetime of the Universe."
> 

If anyone wants a much longer discription of how the rsync algorithm works.

There was a talk at the Ottawa Linux Symposium by Andrew Tridgell:

http://www.linuxsymposium.org/2000/rsync.php

I found a recording here:

http://ftp.gnumonks.org/pub/congress-talks/ols2000/high/cd2/2000-07-21_15-02-49_C_64.mp3

If you prefer reading, there is a transcript on Source Forge in Lyx format:

http://olstrans.cvs.sourceforge.net/viewvc/olstrans/ols2000/transcripts/completed/OLS2000-rsync.lyx?view=markup

> You're a lot more likely to have data corruption due to a 
> cosmic ray hitting your box.
> 
> There are some cases where the answer is: "Maybe more often."  The only 
> time I can think of that you'd want to worry about
> is if you're researching MD5
> checksum collisions and have a lot of data on disk that has
> collisions in the checksumming.  In other words,
> if you're actively trying to cause problems it might be an issue.
> 
> (The older rsyncs used MD4.)
> 
> If you're actually _copying_ data rather than backing it up then
> avoid the issue by not using rsync.  Otherwise the tradeoff
> is worth the risk.
> 
> Karl <kop at meme.com>
> Free Software:  "You don't pay back, you pay forward."
>                  -- Robert A. Heinlein
> -- 
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


More information about the rsync mailing list