file corruption

f-rsync at media.mit.edu f-rsync at media.mit.edu
Fri Mar 8 22:12:21 MST 2013


    > Date: Fri, 08 Mar 2013 22:26:24 -0500
    > From: Kevin Korb <kmk at sanitarium.net>

    > If it were me, based on my previous experience, I would shut down both
    > systems and run memtest86+ or "Windows Memory Diagnostics" on both
    > systems.  Make sure to enable the extended tests.  Let them run
    > overnight and see if they identify a problem.

...but note that "no errors" doesn't mean "RAM good."

In particular, I had a motherboard once that would corrupt certain bit
patterns in RAM only when CPU throttling was enabled and the CPU had
throttled down at the wrong moment.  I discovered this doing a
transfer to/from cryptographic filesystems, so problems at the drive
or interface level would have corrupted entire blocks, which wasn't
happening.  I discovered it after using dd and nc to transfer about
2TB from one machine over the network to another at the block level,
and, being paranoid, had checksummed both ends afterwards---and
discovered they didn't match.

Once I narrowed it down a few problematic files, I did
  while [ 1 ]; md5sum some-file; sleep 10; done
and watched the output.  If I was running something CPU-bound in
another window, every checksum matched.  If the machine was idle,
then some -didn't- match.  Whoops.  [The solution for that particular
machine was to disable CPU throttling.  Problem solved.  Presumably
there was some flaky timing when the speed of various buses changed.]

I'd say, if RAM testing turns up nothing, you should try shipping a
few terabytes of random bits to the far machine and use an nc tunnel
to redirect them back to the sending machine and compare what you get.
That may implicate the network hardware, the remote machine, whatever,
but it would take rsync itself out of the picture.  Or, if you don't
want to set up the reflected tunnel, then just take some disk that
isn't getting written to (e.g., -dismounted- filesystem), checksum
it, and then dd | nc it to the remote machine and run that through
the same checksum (no need to write it to disk there).  If they match,
then flip the sender & receiver and try it again.


More information about the rsync mailing list