Data corruption

Mon Aug 29 18:24:08 GMT 2005

We used rsync 2.6.3 on a couple of Solaris 8 machines to update an Oracle 
database from one machine to another. Here is the procedure I used:

The source database was up and running so this operation was similar to doing a 
hot backup. I queried the source database for a list of tablespace names, and 
for each tablespace, I queried the list of datafiles. I put the tablespace in 
hot backup mode, which means that no updates are written to the datafiles; they 
will all go the the redo logs. Then I rsync'ed each datafile in that tablespace 
then took the tablespace out of hot backup mode. Repeat for next tablespace.

Early on in this process, I discovered I had a big performance problem and after 
some experimentation I learned some important things.

Mainly, it was apparently defaulting to using whole-file mode, which is 
different from my past experience. Previously I had always supplied directories 
as the path to rsync, whereas this time I was doing individual files. I'm 
guessing that caused a different default behavior. After I started using 
--no-whole-file and --inplace, the situation improved. For files that had few 
differences, it was quite fast. However, for files that had lots of modified 
datablocks, it was still taking much longer than an rcp would. An rcp of a 4gb 
datafile took about seven minutes whereas rsync with about 10% modified data 
took about half an hour as shown:

-- > Syncing Datafile: /c03/oradata/can/ard04.dbf @ Fri Aug 26 11:46:08 EDT 2005

Number of files: 1
Number of files transferred: 1
Total file size: 4294975488 bytes
Total transferred file size: 4294975488 bytes
Literal data: 403292160 bytes
Matched data: 3891683328 bytes
File list size: 72
Total bytes sent: 4194348
Total bytes received: 405243604

sent 4194348 bytes  received 405243604 bytes  239507.43 bytes/sec
total size is 4294975488  speedup is 10.49

-- > Syncing Datafile: /c03/oradata/can/ard05.dbf @ Fri Aug 26 12:14:37 EDT 2005

Then when we started recovery on the destination database, Oracle complained 
about block zero being corrupted on six (out of more than 330) of the datafiles 
(one at a time). All of those were small, so I just used rcp to copy them (in 
hot backup mode). I started having misgivings then, but continued the process of 
recovering the database and finally got to applying the next to last redo log 
and Oracle barfed on block corruption in one of our big datafiles.

All of the small datafiles that had block zero corrupted had a single block 
transfered via rsync. The process of opening a database and shutting it down 
will cause an update to block zero, and these datafiles are not really used 
during day-to-day operation, so it fits that rsync copied one block. In fact, 
there are a bunch of small datafiles similarly unused that had a single block 
transfered that Oracle did not complain about.

Here is the command line I used:

rsync -ptgoHS --stats --rsh=/usr/bin/rsh -B 8192 --no-whole-file --inplace \
rmthost:${df} ${df}

I probably shouldn't have used -H, and I saw a bug report about it, but can't 
believe it is related to my corruption problem. Is it possible -S is involved 
somehow?

The data corruption of course makes rsync useless to me for copying databases, 
and I'm wondering now if other things I use it for are susceptible to the same 
problem.

However, even if the corruption problem is fixed, the performance of rsync on 
large datafiles with more than a few percent of modified blocks may make it not 
worth using.

Any help is appreciated.

Linus