Data corruption
Linus Hicks
lihicks at gpi.com
Mon Aug 29 18:24:08 GMT 2005
We used rsync 2.6.3 on a couple of Solaris 8 machines to update an Oracle
database from one machine to another. Here is the procedure I used:
The source database was up and running so this operation was similar to doing a
hot backup. I queried the source database for a list of tablespace names, and
for each tablespace, I queried the list of datafiles. I put the tablespace in
hot backup mode, which means that no updates are written to the datafiles; they
will all go the the redo logs. Then I rsync'ed each datafile in that tablespace
then took the tablespace out of hot backup mode. Repeat for next tablespace.
Early on in this process, I discovered I had a big performance problem and after
some experimentation I learned some important things.
Mainly, it was apparently defaulting to using whole-file mode, which is
different from my past experience. Previously I had always supplied directories
as the path to rsync, whereas this time I was doing individual files. I'm
guessing that caused a different default behavior. After I started using
--no-whole-file and --inplace, the situation improved. For files that had few
differences, it was quite fast. However, for files that had lots of modified
datablocks, it was still taking much longer than an rcp would. An rcp of a 4gb
datafile took about seven minutes whereas rsync with about 10% modified data
took about half an hour as shown:
-- > Syncing Datafile: /c03/oradata/can/ard04.dbf @ Fri Aug 26 11:46:08 EDT 2005
Number of files: 1
Number of files transferred: 1
Total file size: 4294975488 bytes
Total transferred file size: 4294975488 bytes
Literal data: 403292160 bytes
Matched data: 3891683328 bytes
File list size: 72
Total bytes sent: 4194348
Total bytes received: 405243604
sent 4194348 bytes received 405243604 bytes 239507.43 bytes/sec
total size is 4294975488 speedup is 10.49
-- > Syncing Datafile: /c03/oradata/can/ard05.dbf @ Fri Aug 26 12:14:37 EDT 2005
Then when we started recovery on the destination database, Oracle complained
about block zero being corrupted on six (out of more than 330) of the datafiles
(one at a time). All of those were small, so I just used rcp to copy them (in
hot backup mode). I started having misgivings then, but continued the process of
recovering the database and finally got to applying the next to last redo log
and Oracle barfed on block corruption in one of our big datafiles.
All of the small datafiles that had block zero corrupted had a single block
transfered via rsync. The process of opening a database and shutting it down
will cause an update to block zero, and these datafiles are not really used
during day-to-day operation, so it fits that rsync copied one block. In fact,
there are a bunch of small datafiles similarly unused that had a single block
transfered that Oracle did not complain about.
Here is the command line I used:
rsync -ptgoHS --stats --rsh=/usr/bin/rsh -B 8192 --no-whole-file --inplace \
rmthost:${df} ${df}
I probably shouldn't have used -H, and I saw a bug report about it, but can't
believe it is related to my corruption problem. Is it possible -S is involved
somehow?
The data corruption of course makes rsync useless to me for copying databases,
and I'm wondering now if other things I use it for are susceptible to the same
problem.
However, even if the corruption problem is fixed, the performance of rsync on
large datafiles with more than a few percent of modified blocks may make it not
worth using.
Any help is appreciated.
Linus
More information about the rsync
mailing list