--fuzzy question

Julian Pace Ross linux at prisma.com.mt
Thu May 21 12:34:35 GMT 2009


Excellent... thanks again Ryan. I actually managed to get the sync time down
to a couple of minutes on a 300MB database using --fuzzy.

However your explanation made me realise why one 10GB uncompressed
database.bak file (MSSQL) was not yielding any block matches at all... I
contacted the admin for this db and surprise surprise, he insists on
reindexing everyday... so that must be what was making me scratch my head on
that one.. I'll try play around with the block sizes for that but I'm not
sure if I'll manage to make it any better...
I tried using rsyncrypto on that, whose gzip --rsyncable pipe brought down
the size to 900Mb... (still no matches of course).
Winrar compressed it to 500mb (5% of the original size!) so up to now I've
been winraring and updating the whole thing, which turned out to be the
fastest, even though all potential block matches are lost of course.

Anyway thanks again for your very useful suggestions.

2009/5/20 Ryan Malayter <malayter at gmail.com>

> On Wed, May 20, 2009 at 2:26 AM, Julian Pace Ross <linux at prisma.com.mt>
> wrote:
> > Thanks Ryan!
> > In fact I found it's a combination of factors you mentioned... i.e. a
> > compressed SQL .bak file, so contrary to what I thought, the fuzzy file
> was
> > indeed being found but no matches were being found in the file... thanks
> > again for the info.
>
> If you have the disk space at both ends, I would suggest doing what I
> do for SQL backup synchronization.
>
> 1) Write *uncompressed* .bak files for your databases (with timestamps
> in the file name, such as those produced by the database maintenance
> plan engine). This enables the use of --fuzzy, as you have discovered.
> 2) use Rsync to transfer the uncompressed files, but with the -z
> option enbaled. This compresses the data over the wire, but
> decompresses it at the receiving end.
> 3) Adjust the rsync block size to something smaller if necessary to
> find more matches. I basically went down to 32KB rsync blocks for one
> 15 GB database file (rsync would by default use something like 129KB
> on a file this big). This eats up a lot more CPU, but if irsync can
> still output data faster than your network connection can handle, it
> is the most time-efficient way to go. Use multiples of 8KB, as that is
> the internal page size inherent in MS SQL Server databases. Trial and
> error is your friend here. Run rsyc with low priority (START /LOW
> rsync.exe) so the CPU usage doesn't impact SQL Server.
> 4) Minimize any jobs you have to automatically rebuild indexes. Use
> UPDATE STATISTICS instead on a daily basis, and rebuild only when
> index fragmentation gets heavy. There are lots of scripts out there on
> the net which will automate that for you.
> 5) Minimize the rebuilds of denormalized "reporting" tables or other
> non-essential data. Move these off into other databases that you don't
> replicate if possible.
> 6) Watch out for non-sequential clustered indexes. We use GUIDs for
> primary keys on many tables, and this causes updates and inserts to be
> spread randomly throughout the table as it is physically stored. Even
> channging just 5% of the data can result in a change to every database
> page in such a scenario). Hot tables which use emails or other VARCHAR
> fields as clustered index keys also result in similar behavior.
>
> Most of these suggestions would apply for rsyncing any sort of
> database backup file... Exchange, PostgreSQL, Oracle, or even
> (horror!) MySQL.
>
>
> --
> RPM
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options:
> https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
>
-------------- next part --------------
HTML attachment scrubbed and removed


More information about the rsync mailing list