Rsync friendly zlib/gzip compression - revisited

Sun Feb 13 00:37:00 GMT 2005

Hi, all - 

I've been digging into the question of using rsync to sync up compressed data, and read with interest the conversations regarding the rsync friendly patch available for gzip (I belive Rusty provided this patch).  For anyone interested, the message thread is archived at http://lists.samba.org/archive/rsync/2002-June/thread.html.  The broad outline of this approach were mentioned in Andrew Tridgell's PhD thesis, and explained in more depth in a transcript of a talk he gave.

My initial thought on reading this was "no way this will work" - then I dug in, and convinced myself that it would indeed work.  After testing Rusty's patch, I am extremely convinced that the approach works - very elegent.

At that point, I started to consider the question raised in the discussion threads by Kevin Easton - primarily:  are the checksum and compression block trigger selected by Rusty optimal?

To refresh memories:  Rusty's checksum algorithm is to compute the sum of the previous 4096 bytes of source data, and his block trigger kicks in when the checksum is divisible by 4096 (i.e. SUM mod 4096 == 0).

Because the value of 4096 prior bytes seemed fairly arbitrary, and because modding by the same value as the window size seemed somewhat arbitrary, I decided to do some parametric tests with a couple of different checksum strategies, window sizes and block trigger mask sizes.

The following is pretty long - I apologize for making my first post on the topic so extensive - but the results are interesting, and I figured I'd lay out everything that I have so far to see if anyone a) is interested or b) has any ideas of directions to pursue...

Test Description
==========

Here are some definitions:

SUM = the value of the rolling checksum at a given point in the input stream
Window size = the number of bytes in the input stream that are considered for the rolling check sum
Block mask size = The size of value used in checking the current SUM to see if the current compression block should be wrapped up
Compression block size = The number of bytes from the input stream between checksum triggers

I decided to test using two forms of rolling checksum:

1.  Rusty's strategy of just adding the values from the input stream
2.  Andrew's rolling checksum that is used in the rsync algorithm itself

For starters (I'm still working on processing tests on a larger data set), I chose a 28 MB database file (dBase file) - fixed record size, lot's of compression opportunities.

I then constructed a parametric study to see the impact of window size, block mask size and algorithm based on the database file as an input file.  The output of the tests is the average compression block size, and the standard deviation of the compression block size.

For the initial tests here (more follow!), this is run in a test harness that just computes effective block sizes based on the two checksum algorithms - it is NOT running the patched zlib/gzip code directly (I do that testing later on).  The results should help us understand the dependencies between different parameters, which will help determine how to pursue future testing.

My thinking here is as follows:

Average compressed block size
------------------------------------
The average compression block size is going to pretty much dictate the overall effectiveness of the zlib/gzip underlying compression algorithm.  Ideally, we want to have a control that we can tune to adjust this value so we get decent performance from zlib, but not have blocks that are too big (which would reduce the effectiveness of the rsync operation itself).  The zlib algorithm's performance is non-linear with respect to the compression block size, so finding this balance point is particularly important from an optimization perspective.  In general, I find that block sizes around 4000 - 5000 seem to be about optimal - anything below that, and the compression drops off.  Anything above, and the compression doesn't improve much.

Std deviation of compressed block size
--------------------------------------------
I initially thought that the standard deviation of the compression block size would be important because of the non-linear behavior of the zlib compression algorithm with respect to block size.  If our standard deviation is too high, then we might have a great average block size, but most blocks will be significantly above or below the average, so the effective performance of both the compression and rsync pieces of the problem will be far from optimal.

Another way of putting this:  If I say that there isn't a good deal of improvement in compression ratio for compression block sizes above 4000 bytes, then I would optimally choose my checksum algorithm and parameters so that 4000 = (Xavg - Xsigma).  If I have a small Xsigma, that allows me to have a smaller Xavg.  And smaller Xavg *should* translate into better rsync performance.  I have NOT tested this yet, but it is certainly an obvious next step.

There is one more consideration:  If a small change is made to a file, what are the chances of that change messing up 2 checksum blocks instead of 1?  In straghtforward rsync, messing up one or two blocks is not that big of a deal.  Once the data is compressed, I'm thinking that it would tend to be amplified.  I haven't done any statistics on this, and I could be compeltely off base - but my initial intuition is that having a smaller window size would lead to better ultimate rsync performance...  I would be extremely interested in anyone else's thoughts on this point...

Results
=====

So - on to the testing:

For those of you who like making pretty graphs (actually, graphing this really helps to hilight the trends), here are some numbers from my tests (I apologize in advance for the formatting on these tables - fighting with formatting in an ASCII file is no fun - please bear with me!):

The following data is for window size = 1298

[Block Mask Size] [Avg Block Size (R )] [Std Dev Block Size (R)] [Ratio (R )] [Avg Block Size (S )] [Std Dev Block Size (S)] [Ratio (S)]
1024 2378 1173 0.49 2551 1393 0.55
1524 2767 1644 0.59 3049 1946 0.64
2024 3163 2039 0.64 3708 2639 0.71
2524 3783 2680 0.71 4354 3219 0.74
3024 4226 3151 0.75 4950 3928 0.79
3524 4535 3464 0.76 5598 4594 0.82
4024 5030 3824 0.76 6257 5190 0.83

Discussion of results
=============
My initial results (though far from complete), show some interesting things:

1.  For small window sizes (< 1000 bytes), the behavior of the simple addition checksum is quite nasty.  The block size standard deviation is huge and the average block size is all over the place.  As the window size grows above ~1000 bytes, things settly out and behave themselves much better.

In contrast, the rsync rolling checksum was very well behaved, even down to window sizes of 25.  The standard deviations for the rsync rolling checksum are a bit high (150% of the average block size), but taper off to a constant value of ~100% of the average block size when the window is > 100 bytes.

For the rest of the analysis, I've selected 1298 as the window size, although I could have used a value as low as 100 for the RSync checksum and still gotten the same results.

2.  Once you have sufficient window size, it stops becoming a parameter for consideration, and the Block Mask Size becomes the "dial" that allows you to control both the average compression block size and the standard deviation.  The correlation between the two is pretty much linear, with the RSync style checksum having a more controlled result.

3.  In general, the Rsync checksum had a lower average block size and a lower block size standard deviation for a given block mask.  The overall trend of the ratio between the average and standard deviation is pretty much independent of algorithm selected, but there is an approx 5-10% increase in standard deviation as a percentage of average block size for the simple chechsum calculation relative to the rsync checksum.

4.  The large scale behavior of the ratio between average and standard deviation is definitely dependent on Block Mask Size.  Higher mask values result in larger standard deviation - but is non-linear, and appears to taper off to about 85% for the simple checksum and 75% for the rsync checksum.

Given that we want to lower the standard deviation, this result implies that the block mask size should be selected to be as low as possible.

5.  Earlier, I mentioned that the value that probably impacts the performance of the compression algorithm the most is going to be (Xavg - Xstddev).  If we look at that value as a function of BlockMaskSize, we find something quite interesting - there is *almost* no dependency on Block Mask Size!  The effects of the increase in standard deviation pretty much wipe out any advantage of the increased average.  For example, for the RSync checksum, varying the block mask size from 1024 to 4024 causes the (Xavg - Xstddev) value to vary between 1200 to 1070 - not much!

If we look at (Xavg - Xstddev) as a function of window size, we find that there is no dependency at all once we go above some value of window size (the value of that cutoff is dependent on the checksum algorithm selected)

The conclusion here (it turns out that this conclusion is wrong - see below) is that changing the window size and block mask size probably will not have a significant effect on the performance of the compression algorithm, as long as the window size is above the point where the results are statistically unpredictable.  For the RSync checksum, this point is approximately 100 bytes.  For the simple checksum, this point is approximately 500 bytes.  

Here's the data crunched to show this (values marked #VALUE! are where the std deviation was too big to calculate using the 64 bit integer math in my test application):

[WinSize] [Xavg-Xstdev (R )] [Xavg-Xstdev (S )]
25 791.3509857 #VALUE!
30 854.7655263 #VALUE!
36 847.0310926 #VALUE!
43 963.7640702 #VALUE!
51 1013.541078 #VALUE!
61 997.480154 1301.475001
73 1036.707541 1624.38203
87 1068.007719 #VALUE!
104 1082.094327 #VALUE!
124 1113.082926 1084.712182
148 1113.28381 1137.304064
177 1106.720093 2215.686195
212 1116.509118 962.0990605
254 1113.412682 #VALUE!
304 1132.536022 1019.862558
364 1118.921432 991.9188742
436 1104.036368 1062.314435
523 1100.639225 1098.862742
627 1112.34673 1024.872196
752 1127.374869 1104.576236
902 1107.316805 1093.34639
1082 1078.665537 1104.471305
1298 1129.472404 1079.824598
1557 1106.018974 1060.80879
1868 1117.813407 1108.882505
2241 1120.976969 1101.694318
2689 1114.43462 1143.649274
3226 1106.447702 1121.075456
3871 1122.427955 1076.250929

This leads to one obvious test:  Try Rusty's algorithm, and vary the block size and window size , and see how the compression ratio is effected.  If the conclusion above is correct, then messing with the window size should have no impact at all, and messing with the block size should actually hurt the compression ratio a little bit with increasing block size.

I was quite surprised to find that the results of that test did not line up with my expectations.  Actually running the data through the zlib compression algorithm (adjusted to use the simple checksum calculation and allow for varying window size and block size independently), gives the following results:

1.  As expected, the compression ratio is independent of the window size, as long as the window size is above the minimum critical size for the the checksum algorithm.

2.  However, the compression ratio IS dependent on block mask size.  If we choose a window size of 450 (anything above 350 will do), we see that the compression ratio varies from 4.5 to 4.9 as the mask size increases from 1000 to 4000.  Masks above 4000 do not appear to have much of an impact on effective compression ratio.  I suspect that this test result is what led Rusty to choose the window/block size of 4096 in his patch.

The reason in the difference of #2 with my original expectations can probably be attributed to my lack of understanding of exactly how the zlib algorithms work.

RSync testing
========

Given the above results, I thought it would be interesting to test my hypothesis about the effect of window size on the efficiency of the rsync alogorithm.

I ran a suite of tests as follows:

1.  Original data used is the same dBase database file as for the prior tests
2.  Modified data is the original data with some miscellaneous changes throughout.  To give a feel for the extent of the changes, running the rsync algorithm against the uncompressed data in these two files with an RSBlock size of 1000 gives a speedup of approximately 4000:1 - so not a huge amount of changes.

I then ran a series of tests where I compute the speed up for a set of different rsync block sizes and window sizes.  In my initial testing, I adjusted the window size and block mask size in lock step (i.e. using Rusty's original algorithm).

Results
=====
I won't spend a lot of time looking at the effect of rsync block size on the speed up - that has been studied pretty well already, but just so we have the data points, here are the results assuming a fixed compression window size of 4000 (which is pretty close to Rusty's rsyncable patch implementation):

[RS Block Size] [Speed Up]
500 194.49103
1000 179.82353
1500 164.90042
2000 169.59175
2500 154.23415

For the rest of this analysis, I'm going to set the rsync block size to 1000 and leave it there.

If we look at RSync performance as a function of window size, we find, as expected, that the rsync performance decreases.  The curve isn't smooth (the compression algorithm causes all sorts of havoc here), but the general trend is clear:  Increasing the compressed window size and block mask size = decreased rsync performance.  What isn't clear (because I varied the window size and block mask size together), is which of the two actually is the driving force (window size or block mask size).  My initial results with fiddling with the compression performance indicate that it should be the block mask size that matters, and the window size shouldn't matter.

Here are the test results:

[Window AND block mask Size] [Speed Up]
100 335.6063
600 335.56348
1100 264.708
1600 273.0129
2100 220.42537
2600 226.33165
3000 219.47977
3500 225.90854
4000 179.82353
4500 195.95692
5000 230.86983

My initial thoughts here are:
1.  If the block mask size is mostly responsible for determining the performance of the zlib compression algorithm and if the window size is mostly responsible for determining the performance of the rsync algorithm, then we may have an opportunity to optimize the performance of the current --rsyncable patch.  The test results above imply that we could be looking at a 25% improvement in rsync performance without impacting compression or significantly adjusting Rusty's algorithm.  I think that the test results so far tend to support this as a possibility.

2.  Further, if we can get really small window sizes by switching to the original rsync rolling checksum (instead of using the simple checksum in the current patch), then we *might* be able to achieve even better optimization, without adding a lot of computation overhead (the overhead of the rsync algorithm compared to the simple computation used now should be pretty low compared to the rest of what's going on in the zlib computation).  

Anyway - that's where I'm headed with this.  I'll have results to indicate whether #1 is worth pursuing (i.e. testing with many other files) tomorrow.  #2 will require a bit more coding to the --rsyncable patch, but it should be relatively simple to do.

That's all I have for now - I'm running tests this evening to independently adjust the window and block mask size, and to test a different data file just to make sure I'm not way off-base (i.e. make sure the trends indicated by my results are not dependent on the input file).

I'd love to hear any thoughts on the matter!

- Kevin