silent data corruption with rsync
Leen Besselink
leen at consolejunky.net
Tue Mar 11 12:05:32 MDT 2014
On Tue, Mar 11, 2014 at 11:52:51AM -0500, Karl O. Pinc wrote:
> On 03/11/2014 11:02:28 AM, Sig Pam wrote:
> > Hi everbody!
> >
> > I'm currently working in a project which has to copy huge amounts of
> > data from one storage to another. For a reason I cannot validate any
> > longer, there is a roumor that "rsync may silently corrupt data".
> > Personally, I don't believe that.
> >
> > "They" explain it this way: "rsync does an in-stream data
> > deduplication. It creates a checksum for each data block to transfer,
> > and if a block with the same checksum has already been transferred
> > sooner, this old block will be re-used to save bandwidth. But, for
> > any
> > reason, two diffent blocks can produce the same checksum even if the
> > source data is not the same, effectively corrupting the data stream".
>
> Well, yeah. It works that way if you're transferring data over
> the network.
>
> The question is: "how often will this problem exhibit itself?"
> The answer is: "Usually, never within the lifetime of the Universe."
>
If anyone wants a much longer discription of how the rsync algorithm works.
There was a talk at the Ottawa Linux Symposium by Andrew Tridgell:
http://www.linuxsymposium.org/2000/rsync.php
I found a recording here:
http://ftp.gnumonks.org/pub/congress-talks/ols2000/high/cd2/2000-07-21_15-02-49_C_64.mp3
If you prefer reading, there is a transcript on Source Forge in Lyx format:
http://olstrans.cvs.sourceforge.net/viewvc/olstrans/ols2000/transcripts/completed/OLS2000-rsync.lyx?view=markup
> You're a lot more likely to have data corruption due to a
> cosmic ray hitting your box.
>
> There are some cases where the answer is: "Maybe more often." The only
> time I can think of that you'd want to worry about
> is if you're researching MD5
> checksum collisions and have a lot of data on disk that has
> collisions in the checksumming. In other words,
> if you're actively trying to cause problems it might be an issue.
>
> (The older rsyncs used MD4.)
>
> If you're actually _copying_ data rather than backing it up then
> avoid the issue by not using rsync. Otherwise the tradeoff
> is worth the risk.
>
> Karl <kop at meme.com>
> Free Software: "You don't pay back, you pay forward."
> -- Robert A. Heinlein
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
More information about the rsync
mailing list