Large file - match process taking days

Sat Aug 2 15:18:50 GMT 2008

I believe I've figured out why the process was taking so long...or at least
have a theory.  In the end it appears that much of the data was being sent
even though the "true" amount of data change was less than 7% of the
filesize.  

Exchange uses a database page size of 4K.  Many times a page is deleted and
then new data is written to that page (delete a message, new message
arrives).  Exchange will try to keep the data file size constant by reusing
freed up space and it will do online "defragmentation" nightly by default.
Defragmentation might be the wrong term because online defragmentation
really "makes additional database space available by detecting and removing
database objects that are no longer being used."

Although only 7% of the file is changing, the overall number of data pages
would approach 1.5 million.  In all likelihood, these pages would be spread
throughout the file.

So if the usual approach of making the blocksize larger to process the file
is used then rsync actually performs worse.  This is because a change in a
single 4K data page (likely occurrence) will cause the entire block to be
sent.  This is what I was seeing in the earlier tests, increasing blocksize
decreases performance by sending more data.

When I changed the blocksize to be close to the default of sqrt(filesize)
but rounded down to a function of 4K, rsync performance is much better.  The
performance of a 4K rounded blocksize is better than the default (in this
case, 262144).

I'm continuing to test to find the "best" blocksize for these types of
files.  I'm just sending this info for future reference for those using
rsync for large Exchange files or other database files.

Rob