silent data corruption with rsync

Karl O. Pinc kop at meme.com
Tue Mar 11 10:52:51 MDT 2014


On 03/11/2014 11:02:28 AM, Sig Pam wrote:
> Hi everbody!
> 
> I'm currently working in a project which has to copy huge amounts of
> data from one storage to another. For a reason I cannot validate any
> longer, there is a roumor that "rsync may silently corrupt data".
> Personally, I don't believe that.
> 
> "They" explain it this way: "rsync does an in-stream data
> deduplication. It creates a checksum for each data block to transfer,
> and if a block with the same checksum has already been transferred
> sooner, this old block will be re-used to save bandwidth. But, for 
> any
> reason, two diffent blocks can produce the same checksum even if the
> source data is not the same, effectively corrupting the data stream".

Well, yeah.  It works that way if you're transferring data over
the network.

The question is: "how often will this problem exhibit itself?"
The answer is: "Usually, never within the lifetime of the Universe."

You're a lot more likely to have data corruption due to a 
cosmic ray hitting your box.

There are some cases where the answer is: "Maybe more often."  The only 
time I can think of that you'd want to worry about
is if you're researching MD5
checksum collisions and have a lot of data on disk that has
collisions in the checksumming.  In other words,
if you're actively trying to cause problems it might be an issue.

(The older rsyncs used MD4.)

If you're actually _copying_ data rather than backing it up then
avoid the issue by not using rsync.  Otherwise the tradeoff
is worth the risk.

Karl <kop at meme.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein


More information about the rsync mailing list