rsync backup performance question

Ron Arts raarts at netland.nl
Sun Jun 22 21:59:11 EST 2003


jw schultz wrote:
> On Sun, Jun 22, 2003 at 11:42:46AM +0200, Ron Arts wrote:
> 
>>Dear all,
>>
>>I am implementing a backup system, where thousands of postgreSQL
>>databases (max 1 Gb in size) on as much clients need to be backed
>>up nightly across ISDN lines.
>>
>>Because of the limited bandwidth, rsync is the prime candidate of
>>course.
> 
> 
> Only if you are updating an existing file on the backup
> server with sufficient commonality from one version to the
> next.  pg_dump --format=t would is good.  Avoid the built-in
> compression in pg_dump as it defeats rsync.  

Restore time is significant, so I think I need a straight mirror
of the database files on the client. I think importing
a multi gigabyte SQL dump will take too long for us (one hour
is the limit). Have not tried that yet on postgreSQL though.

 > gzip with the
> rsyncable patch and bzip2 are OK if you must compress.
> 

So unpatched bzip2 is ok? nice to know..
Maybe I can tar an LVM snapshot, and bzip2 that
before rsyncing. Thanks for that one.

> The other issue is individual file size.  Rsync versions
> prior to what is in CVS start having some performance issues
> with files larger than the 200-500MB range.  
> 

I'll keep that in mind.

> 
>>Potential problems I see are server load (I/O and CPU), and filesystem 
>>limits.
> 
> 
> Most of the load is on the sender.  Over ISDN even with
> rsync compressing the datastream no one update should be CPU
> or I/O issue.  The issue is scheduling so you don't have too
> many running simultaneously.
> 

As I understand the algorithm, the server creates a list of checksums
(which is around 1% size of the original file), which is not really
CPU intensive, sends that to the client, and then the client does a lot
of work finding blocks that are the same as the server file.

So the server at least reads every file completely that is in the
rsync tree am i correct? In my case that means a lots of disk I/O,
given the total size for all databases (multiple TB's).

Please correct me if I'm wrong.

> The easiest way to manage the scheduling is to have the
> server pull.  If that isn't possible then you will need to
> use an rsync wrapper that keeps the simultaneous runs within
> limits or put a good deal of smarts into the clients.
> 

Yeah, pulling is out of the question, because the server can't
activate the ISDN link. The clients' rsync start time will need
to be hashed across the night.

> 
>>Does anyone have experience with such setups?
> 
> 
> Unlikely on that scale over that sort of link.
> 
> I'd suggest experimenting with -v and the --stats options turned on.
> 

I will, thanks.

Ron

-- 
Netland Internet Services
bedrijfsmatige internetoplossingen

http://www.netland.nl   Kruislaan 419              1098 VA Amsterdam
info: 020-5628282       servicedesk: 020-5628280   fax: 020-5628281

Useless Invention: Leather cutlery.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3465 bytes
Desc: S/MIME Cryptographic Signature
Url : http://lists.samba.org/archive/rsync/attachments/20030622/b87ef724/smime.bin


More information about the rsync mailing list