[clug] data corruption riddle

Wed Jun 4 10:04:19 EST 2003

I am looking at this for a while now and still cannot explain
what is going on.

I tar to tape.
I untar from tape.
tar finds no extract errors but some files are bad.

I tried the untar on different machines and operating systems so
I am rather sure this is not a problem with 'tar -x'.

And this is not all, the actual process is more involved.

machine e7 runs 'rsh e4 tar -c... -f -'
	stdout piped into 'buffer' (on e7)
		which writes to tape (on e7)
If I change the output of 'buffer' to go to a local disk then
I never get the corruption (well, I failed to repro it so far)
but writing to tape was never clean.

How can what I do with the stream on e7 affect what tar packs on e4?
I mean, tar must have got bad files on e4, and built correct CRCs for
these, or else the extract would see bad CRCs. Can anyone see how
a corruption AFTER the 'tar -c' can lead to the final results (bad
files without any tape or tar extract errors)?

The only possibility is corruption between the disk and tar (on e4)
or between tar and disk (on e7, and on a few other linux and win32
machines where the tape extract had exactly the same bad data).

e4 is up for 116 days now and had no problems, logs are clean.

Tape errors would cause tar CRC errors, so I should catrch these.
Network problems (rsh) will cause tar CRC errors, same.

All I can think of is that there is a speed difference, the tape
is a DDS2 (slow) and the writing to disk is much faster (even
with the slow 10mb network). Maybe this stall triggers something?
Whatever, reading files of e4 is always faster than the network
to rsh spends much time waiting for stdout to get through e4->e7,
but tar already did it's job...

Any chance that tar does not put CRCs in the headers when output
is to stdout?

The nature of the corruption is also interesting, here is an example:
$ cmp -l new-file old-file
333569   2  77
333570 301 123
333571  27  75
333572  35 237
333573   0 144
333574   0 236
333575   0 356
333576   0  63
333577   0 341
333578   0 254
333579   0  46
333580   0 276
333581   0 124
333582   0 175
333583   0  66
333584   0 162
333585   0 237
333586   0 113
333587   0 172
333588   0 346
333589 100 323
333590 301 254
333591  27 372
333592  35  54
333593 130 125
333594 301 174
333595  27 302
333596 335 244
333597 130 212
333598 301 257
333599  27 171
333600 335 127
333601 140 377
333602 301 257
333603  27 371
333604 335 174
333605 140 302
333606 301 235
333607  27 364
333608 335 203
333609   0 264
333610   0 267
333611   0  56
333612   0  15
333613   0 157
333614   0 165
333615   0 162
333616   0 216
333617   0 241
333618   0 107
333619   0 251
333620   0 224
333621   0 326
333622   0  31
333623   0 121
333624   0 134
333625   0  60
333626   0 233
333627   0 365
333628   0 317
333629   0 151
333630   0  61
333631   0  10
333632   0   7
333633 100   5
333634 301 157
333635  27 125
333636  35 333
333637   0 243
333638   0 241
333639   0 270
333640   0 253

This is often the size of the corruption (even in large files, these
are RPMs).

This pattern of many zeroes and repeated, similar non-zero runs
is common.

e4 is running 2.4.18-24.7.x, the extract machine does not matter much
as I tried a wide selection.

For the interested, here is how I reproduce the problem. Running
	rpm --checksig *.rpm | grep NOT
gives results like:

gcc-2.96-112.7.2.i386.rpm: MD5 GPG NOT OK
ghostscript-6.51-16.3.i386.rpm: MD5 GPG NOT OK
glibc-common-2.2.4-32.i386.rpm: MD5 GPG NOT OK
glibc-devel-2.2.4-32.i386.rpm: MD5 GPG NOT OK
kde-i18n-Czech-2.2.2-2.noarch.rpm: MD5 GPG NOT OK
kde-i18n-Danish-2.2.2-2.noarch.rpm: MD5 GPG NOT OK
kernel-bigmem-2.4.20-13.7.i686.rpm: MD5 GPG NOT OK
mod_ssl-2.8.12-2.i386.rpm: MD5 GPG NOT OK
php-manual-4.1.2-7.2.6.i386.rpm: MD5 GPG NOT OK
samba-2.2.7-3.7.2.i386.rpm: MD5 GPG NOT OK

While the original is clean. There are 408 RPMs in this directory
totalling 700MB. Different files fail in different runs.

 -----------------------------------------------------
#!/bin/sh

# run this on e7

me="`basename $0`"
tape=/dev/nstdds2
log="/var/log/backup/test1.log"
bsize=10240
buffers=400
mtx="mtx -f $tape"
mt="mt -f $tape"

$mtx load 1 || {
        echo "$me: load tape failed"
        exit 1
}

$mt rewind
$mt setblk $bsize
$mt drvbuffer 2

cmd="tar -cl -f - --exclude=iso --exclude=old -C /data/RedHat ."
rsh e4 "$cmd" 2>>$log \
        | buffer -o $tape -s $bsize -b $buffers -p 90 -t >>$log 2>&1 ||
{
        echo "$me: rsh failed"
        exit 1
}

$mt rewind
tar xvf $tape ./7.2/updates 2>>$log

$mt rewind
$mtx unload

# now xfer files to e4 and check...

exit 0

--
Eyal Lebedinsky (eyal at eyal.emu.id.au) <http://samba.org/eyal/>