rsync backup performance question

Sun Jun 22 22:42:08 EST 2003

On Sun, Jun 22, 2003 at 01:59:11PM +0200, Ron Arts wrote:
> jw schultz wrote:
> >On Sun, Jun 22, 2003 at 11:42:46AM +0200, Ron Arts wrote:
> >
> >>Dear all,
> >>
> >>I am implementing a backup system, where thousands of postgreSQL
> >>databases (max 1 Gb in size) on as much clients need to be backed
> >>up nightly across ISDN lines.
> >>
> >>Because of the limited bandwidth, rsync is the prime candidate of
> >>course.
> >
> >
> >Only if you are updating an existing file on the backup
> >server with sufficient commonality from one version to the
> >next.  pg_dump --format=t would is good.  Avoid the built-in
> >compression in pg_dump as it defeats rsync.  
> 
> Restore time is significant, so I think I need a straight mirror
> of the database files on the client. I think importing
> a multi gigabyte SQL dump will take too long for us (one hour
> is the limit). Have not tried that yet on postgreSQL though.

Try doing a dump-restore test before you make the decision.
The dumps are a lot smaller and more compressible than the
database files and you don't need to shut down the database
to do them.  The database has to be completely shut down to
do a file level backup of the database.

> > gzip with the
> >rsyncable patch and bzip2 are OK if you must compress.
> >
> 
> So unpatched bzip2 is ok? nice to know..
> Maybe I can tar an LVM snapshot, and bzip2 that
> before rsyncing. Thanks for that one.

bzip2 should be OK because the encoding of each block is
independent of the other blocks.  Of course the block size
is a bit on the large side (a minimum of 100KB) and that
will diminish the effectiveness of rsync.  I suspect that if
a single byte anywhere in a bzip2 block is changed rsync may
fail to find any matching rsync blocks within it.  Given the
propensity for databases to change apparently random blocks
bzip2 may be inappropriate.  Only testing and a detailed
understanding of bzip2 internals will tell.

Be careful that the filesystem caches are flushed if you go
the LVM snapshot route or you will wind up with inconsistent
tablespaces.

> 
> >The other issue is individual file size.  Rsync versions
> >prior to what is in CVS start having some performance issues
> >with files larger than the 200-500MB range.  
> >
> 
> I'll keep that in mind.
> 
> >
> >>Potential problems I see are server load (I/O and CPU), and filesystem 
> >>limits.
> >
> >
> >Most of the load is on the sender.  Over ISDN even with
> >rsync compressing the datastream no one update should be CPU
> >or I/O issue.  The issue is scheduling so you don't have too
> >many running simultaneously.
> >
> 
> As I understand the algorithm, the server creates a list of checksums
> (which is around 1% size of the original file), which is not really
> CPU intensive, sends that to the client, and then the client does a lot
> of work finding blocks that are the same as the server file.
> 
> So the server at least reads every file completely that is in the
> rsync tree am i correct? In my case that means a lots of disk I/O,
> given the total size for all databases (multiple TB's).
> 
> Please correct me if I'm wrong.

You have a couple of points wrong.  The receiver generates
the block checksums.  If you are pushing that would be the
server but if you are pulling it is the client.  In 2.5.6
and earlier the transmitted block checksums are 6 bytes per
block with a default block size of 700 bytes so just under
1% of file size.  Unless you have a slow CPU the block
checksum generation will be I/O bound.

The only files that are opened are those where metadata
indicate the contents are changed.  In those cases you do
have a lot of disk i/o.  For database backups that will
probably amount to every file.

The sender only does one read pass on each changed file.
The receiver does a read pass for the blocksums and later
reads again the unchanged (possibly relocated) blocks as it
merges them with the changed data to write the new file.
Several files may be in process at any given time.  The
cache capacity of the receiver has a significant impact on
performance.

> 
> >The easiest way to manage the scheduling is to have the
> >server pull.  If that isn't possible then you will need to
> >use an rsync wrapper that keeps the simultaneous runs within
> >limits or put a good deal of smarts into the clients.
> >
> 
> Yeah, pulling is out of the question, because the server can't
> activate the ISDN link. The clients' rsync start time will need
> to be hashed across the night.

I'd favour a wrapper over depending on hashing the start
times.  An alternate approach might be to have the clients
open the connection with port forwarding, write a queue file
and wait for a completion indicator before closing the
connection.  The server could then pull using on the queue
files to identify waiting clients.  While a bit more
complicated it avoids the temporal gaps caused by the
fallback-sleep-retry of the wrappers.

The last thing you want is to thrash the server or cause an
OOM condition.  If at all possible you will want to avoid
paging on the server.  The instant you start thrashing
filesystem cache performance will shrivel.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt