rsync patch

Sun Oct 7 14:40:05 MDT 2012

I've made a small patch to rsync that adds three options that are
useful in data-recovery situations.  I don't know whether the
maintainer will want to add this to the official distribution, but he
is free to do so if he wishes.  At present, I don't have anywhere to
host the patch but I wanted to make it available so it may be tested
more thoroughly.

To be honest, I haven't actually tested the --fill-byte option just
yet as I only added it this morning for the sake of completeness.  I
had previously hard-coded 0xfe during my recent data-recovery
sessions.  I have several drives that were damaged by domestic
police/military "authorities" in Toronto, Canada who sought to destroy
evidence and/or to corrupt my own personal files just to cause
trouble, and I'm still got three drives that need new logic (boards
and eeprom transplants) so I will be doing much more testing in the
coming months.

The main changes are to fileio.c in the map_file() function, which
I've restructured a little to allow for retrying errors, delaying
before retrying errors, and filling the read buffer with an arbitrary
byte to make the identification of unrecovered data easier.

The three options are:

--retry-errors=NUM, which retries the residue of a failed read NUM
times

--retry-delay=NUM, which sleeps after an error is encountered before
retries are attempted.  This helps some disks that go nuts when they
encounter certain kinds of media error.  For instance, a WD Elements
1023, a 750G 2.5" USB will lose its marbles on bad tracks sometimes,
and will seek the heads all over the place very rapidly while slowing
reads to a crawl.  If there is a 10 second pause after an error is
reported, the quiescent drive parks its heads, after which continued
reading continues at a reasonable pace, further errors
notwithstanding.

--fill-byte=NUM, which fills the read buffer with your byte of choice.
This helps to identify the parts of files that remain unrecovered.  By
default, rsync uses 0x00 but this can be confused with valid data in
many file types.

Typical usage in a data-recovery scenario where the filesystem isn't
really screwed up would be something like:

mount -o  /dev/volume/bad /mnt/bad

rsync -ai --ignore-existing --retry-errors=1 --retry-delay=30 --fill-byte=0xfe /mnt/bad /mnt/recover 2>&1 | tee /mnt/recover/recovery.logN

I've written a crappy little script that scans the log files and
installs or removes missing files so successive runs do not attempt to
read files that are known to be bad.  It isn't ready for prime-time,
so I've not included it here but such a utility can be made in a few
minutes if you're familiar with any scripting language.  Adding
missing directories and/or soft links would also help.

Note that if you wish to retain files with corrupt data, you must
include the --ignore-errors flag, otherwise any file that has at least
one unreadable segment will be deleted before it is put in place in
the destination directory.

It is reasonable to do a run with the options as shown in the above
example, and then to assemble a list of bad files to retry in
successive attempts.  The quality of a data-recovery undertaking will
depend quite a bit on the exact nature of the drive fault.

[rambling on a little further]

On more than one occasion, someone caused one of my drives to
power-down the spindle while leaving the heads out over the platter.
In such an instance, the surface of the heads and platters are
sufficiently flat to cause the two to stick together, frustrating
platter rotation.  You have to physically slide the heads off into the
park position.  If I'd had a laboratory at the time, I would have
attempted to using shims, but that risks the geometry of the
head-stack, and doesn't guarantee that the heads will come unstuck.  I
think there may be an atomic force at work that causes two atomically
flat surfaces to stick together, but all of my physics education is
hazy and from long ago.

Under Linux, there are sysfs options which can improve recovery time.
By default there is a 30-second command timeout for the SCSI
emulation.  Some drives go comatose in some error conditions and only
come back with a bus reset and an aborted command.  Command timeouts
are retried by the kernel six times, so without adjusting the SCSI
command timeout rsync (or anything in userspace) will be waiting three
minutes after each error.  Five seconds is long enough to exceed the
USB bus reset cycle.  Additionally, it is probably wise to set
nr_requests to 1, readahead_kb to 0 and nomerges to 1 for the affected
drive.

Feedback on the patch is appreciated.

Sincerely,

Steve Thompson