Rsync friendly zlib/gzip compression - revisited

Mon Feb 14 06:07:29 GMT 2005

Kevin Day wrote:

>At that point, I started to consider the question raised in the discussion threads by Kevin Easton - primarily:  are the checksum and compression block trigger selected by Rusty optimal?
>  
>
Oh, come on, Kevin. Couldn't you have sent this email just one week ago, 
before I started to recruit all of my math skills to do the precise same 
analysis (for a slightly different purpose, though).

I am talking about rsyncrypto here, an rsync friendly encryption. 
http://sourceforge.net/projects/rsyncrypto. Trade offs are slightly 
different than for compression - for each restarted block we lose at 
most 15 bytes, and expected value is 8, so the number of blocks doesn't 
matter that much. What does matter for us is the predictability of the 
switch. If an attacker can guess when a block will end, she has another 
possible known plain text, practically free of charge.

Another major difference between us is that I am assuming input data is 
random, as I get the results of gzip :-). You, on the other hand, are up 
against an impossible foe, I fear. You have no way of testing whether 
the benchmark you chose is, in any way, indicative of the typical rsync 
user's file.

>To refresh memories:  Rusty's checksum algorithm is to compute the sum of the previous 4096 bytes of source data, and his block trigger kicks in when the checksum is divisible by 4096 (i.e. SUM mod 4096 == 0).
>  
>
Actually, the the version actually committed to Debian has slightly 
different values. They are 8192 while summing over 256 bytes.

>Because the value of 4096 prior bytes seemed fairly arbitrary, and because modding by the same value as the window size seemed somewhat arbitrary, I decided to do some parametric tests with a couple of different checksum strategies, window sizes and block trigger mask sizes.
>  
>
The only ones possible. I tried to create a formula describing the 
expected length of the block based on Rusty's three parameters (minimal 
block length, span of sum and modulo). Each byte is a known (uniform 
distribution over the values 0-255. Like I said, since I'm assuming I'm 
getting the output of gzip, I can assume that. You cannot). The sum over 
span is still fairly simple (normal distribution with average at 
255*span/2, and variance of 255^2/12*span). There things get awfully 
complicated. There is a huge interdependency between consecutive bytes, 
but two bytes span bytes apart are statistically unrelated (again, 
assuming random input, which you cannot).

>The following is pretty long - I apologize for making my first post on the topic so extensive - but the results are interesting, and I figured I'd lay out everything that I have so far to see if anyone a) is interested or b) has any ideas of directions to pursue...
>  
>
I'm sorry. I am very interested, but cannot read it all right now. I 
will comment on it in a few hours or tomorrow. I will give you a few of 
my insights on the matter:
1. If the expected length of each block is too similar to the minimal 
length, you run a danger of rsync's efficiency being lost. Consider the 
following. Minimal length=4096 (as in Rusty's case), expected 
length=4120 (totally hypothetical). First run of the file has block 
breaks at offsets 4100, 8220, 12300, etc. You then add one byte towards 
the end of the first block. Obviously the first block will need to be 
transfered again, but you run the risk that, since the second block has 
now shifted, it will also end on a different boundary, and thus have to 
be transfered again as well. If the expected length is too similar to 
the minimal length, this can propagate for quite a few blocks.

How much is too similar? My original gut feeling was that "E=min*2" is a 
good value, but as I'm writing this email, I am getting the feeling that 
"E=min+span" would work even better.

>Test Description
>==========
>
>Here are some definitions:
>
>SUM = the value of the rolling checksum at a given point in the input stream
>Window size = the number of bytes in the input stream that are considered for the rolling check sum
>Block mask size = The size of value used in checking the current SUM to see if the current compression block should be wrapped up
>Compression block size = The number of bytes from the input stream between checksum triggers
>
>I decided to test using two forms of rolling checksum:
>
>1.  Rusty's strategy of just adding the values from the input stream
>2.  Andrew's rolling checksum that is used in the rsync algorithm itself
>
>
>For starters (I'm still working on processing tests on a larger data set), I chose a 28 MB database file (dBase file) - fixed record size, lot's of compression opportunities.
>  
>
You may wish to check out the rsyncrypto CVS. There is a utility there 
called "blocksizes", that is aimed exactly at checking this. This is 
still not released (I only wrote it yesterday night), and gets its input 
data from /dev/urandom. I also have premade span sizes for 50000 blocks 
for 8192, 256, 8192 (Debian's gzip default), 8192, 8192, 8192, and 8192, 
8160, 8192 (where the modulo sits at the dead center of the Gaussian 
curve). Email me if you want the raw data.

>I then constructed a parametric study to see the impact of window size, block mask size and algorithm based on the database file as an input file.  The output of the tests is the average compression block size, and the standard deviation of the compression block size.
>
>For the initial tests here (more follow!), this is run in a test harness that just computes effective block sizes based on the two checksum algorithms - it is NOT running the patched zlib/gzip code directly (I do that testing later on).  The results should help us understand the dependencies between different parameters, which will help determine how to pursue future testing.
>
>My thinking here is as follows:
>
>Average compressed block size
>------------------------------------
>The average compression block size is going to pretty much dictate the overall effectiveness of the zlib/gzip underlying compression algorithm.  Ideally, we want to have a control that we can tune to adjust this value so we get decent performance from zlib, but not have blocks that are too big (which would reduce the effectiveness of the rsync operation itself).  The zlib algorithm's performance is non-linear with respect to the compression block size, so finding this balance point is particularly important from an optimization perspective.  In general, I find that block sizes around 4000 - 5000 seem to be about optimal - anything below that, and the compression drops off.  Anything above, and the compression doesn't improve much.
>
>Std deviation of compressed block size
>--------------------------------------------
>I initially thought that the standard deviation of the compression block size would be important because of the non-linear behavior of the zlib compression algorithm with respect to block size.  If our standard deviation is too high, then we might have a great average block size, but most blocks will be significantly above or below the average, so the effective performance of both the compression and rsync pieces of the problem will be far from optimal.
>  
>
Here gzip and rsyncrypto differ in objectives. For rsyncrypto, a large 
standard deviation is a plus, as then the attacker cannot guess where a 
block ends.

>Another way of putting this:  If I say that there isn't a good deal of improvement in compression ratio for compression block sizes above 4000 bytes, then I would optimally choose my checksum algorithm and parameters so that 4000 = (Xavg - Xsigma).  If I have a small Xsigma, that allows me to have a smaller Xavg.  And smaller Xavg *should* translate into better rsync performance.  I have NOT tested this yet, but it is certainly an obvious next step.
>  
>
Please note my min+average observation above. They shouldn't be too 
similar. Also, a big standard deviation will probably decrease the 
chances that a small change will mess two blocks. Then again, you do 
have compression to think of.

>There is one more consideration:  If a small change is made to a file, what are the chances of that change messing up 2 checksum blocks instead of 1?  In straghtforward rsync, messing up one or two blocks is not that big of a deal.  Once the data is compressed, I'm thinking that it would tend to be amplified.  I haven't done any statistics on this, and I could be compeltely off base - but my initial intuition is that having a smaller window size would lead to better ultimate rsync performance...  I would be extremely interested in anyone else's thoughts on this point...
>  
>
Oops. That's what comes from not reading before replying (damn time 
constraints). You asked for them, you got them.

>2.  Once you have sufficient window size, it stops becoming a parameter for consideration, and the Block Mask Size becomes the "dial" that allows you to control both the average compression block size and the standard deviation.  The correlation between the two is pretty much linear, with the RSync style checksum having a more controlled result.
>  
>
Except for my point above.

>3.  In general, the Rsync checksum had a lower average block size and a lower block size standard deviation for a given block mask.  The overall trend of the ratio between the average and standard deviation is pretty much independent of algorithm selected, but there is an approx 5-10% increase in standard deviation as a percentage of average block size for the simple chechsum calculation relative to the rsync checksum.
>
>4.  The large scale behavior of the ratio between average and standard deviation is definitely dependent on Block Mask Size.  Higher mask values result in larger standard deviation - but is non-linear, and appears to taper off to about 85% for the simple checksum and 75% for the rsync checksum.
>
>Given that we want to lower the standard deviation, this result implies that the block mask size should be selected to be as low as possible.
>
>5.  Earlier, I mentioned that the value that probably impacts the performance of the compression algorithm the most is going to be (Xavg - Xstddev).  If we look at that value as a function of BlockMaskSize, we find something quite interesting - there is *almost* no dependency on Block Mask Size!  The effects of the increase in standard deviation pretty much wipe out any advantage of the increased average.  For example, for the RSync checksum, varying the block mask size from 1024 to 4024 causes the (Xavg - Xstddev) value to vary between 1200 to 1070 - not much!
>
>If we look at (Xavg - Xstddev) as a function of window size, we find that there is no dependency at all once we go above some value of window size (the value of that cutoff is dependent on the checksum algorithm selected)
>
>The conclusion here (it turns out that this conclusion is wrong - see below) is that changing the window size and block mask size probably will not have a significant effect on the performance of the compression algorithm, as long as the window size is above the point where the results are statistically unpredictable.  For the RSync checksum, this point is approximately 100 bytes.  For the simple checksum, this point is approximately 500 bytes.  
>
>Here's the data crunched to show this (values marked #VALUE! are where the std deviation was too big to calculate using the 64 bit integer math in my test application):
>
>[WinSize] [Xavg-Xstdev (R )] [Xavg-Xstdev (S )]
>25 791.3509857 #VALUE!
>30 854.7655263 #VALUE!
>36 847.0310926 #VALUE!
>43 963.7640702 #VALUE!
>51 1013.541078 #VALUE!
>61 997.480154 1301.475001
>73 1036.707541 1624.38203
>87 1068.007719 #VALUE!
>104 1082.094327 #VALUE!
>124 1113.082926 1084.712182
>148 1113.28381 1137.304064
>177 1106.720093 2215.686195
>212 1116.509118 962.0990605
>254 1113.412682 #VALUE!
>304 1132.536022 1019.862558
>364 1118.921432 991.9188742
>436 1104.036368 1062.314435
>523 1100.639225 1098.862742
>627 1112.34673 1024.872196
>752 1127.374869 1104.576236
>902 1107.316805 1093.34639
>1082 1078.665537 1104.471305
>1298 1129.472404 1079.824598
>1557 1106.018974 1060.80879
>1868 1117.813407 1108.882505
>2241 1120.976969 1101.694318
>2689 1114.43462 1143.649274
>3226 1106.447702 1121.075456
>3871 1122.427955 1076.250929
>
>
>
>This leads to one obvious test:  Try Rusty's algorithm, and vary the block size and window size , and see how the compression ratio is effected.  If the conclusion above is correct, then messing with the window size should have no impact at all, and messing with the block size should actually hurt the compression ratio a little bit with increasing block size.
>
>I was quite surprised to find that the results of that test did not line up with my expectations.  Actually running the data through the zlib compression algorithm (adjusted to use the simple checksum calculation and allow for varying window size and block size independently), gives the following results:
>
>1.  As expected, the compression ratio is independent of the window size, as long as the window size is above the minimum critical size for the the checksum algorithm.
>
>2.  However, the compression ratio IS dependent on block mask size.  If we choose a window size of 450 (anything above 350 will do), we see that the compression ratio varies from 4.5 to 4.9 as the mask size increases from 1000 to 4000.  Masks above 4000 do not appear to have much of an impact on effective compression ratio.  I suspect that this test result is what led Rusty to choose the window/block size of 4096 in his patch.
>
>The reason in the difference of #2 with my original expectations can probably be attributed to my lack of understanding of exactly how the zlib algorithms work.
>
>
>
>
>RSync testing
>========
>
>Given the above results, I thought it would be interesting to test my hypothesis about the effect of window size on the efficiency of the rsync alogorithm.
>
>I ran a suite of tests as follows:
>
>1.  Original data used is the same dBase database file as for the prior tests
>2.  Modified data is the original data with some miscellaneous changes throughout.  To give a feel for the extent of the changes, running the rsync algorithm against the uncompressed data in these two files with an RSBlock size of 1000 gives a speedup of approximately 4000:1 - so not a huge amount of changes.
>
>I then ran a series of tests where I compute the speed up for a set of different rsync block sizes and window sizes.  In my initial testing, I adjusted the window size and block mask size in lock step (i.e. using Rusty's original algorithm).
>
>
>Results
>=====
>I won't spend a lot of time looking at the effect of rsync block size on the speed up - that has been studied pretty well already, but just so we have the data points, here are the results assuming a fixed compression window size of 4000 (which is pretty close to Rusty's rsyncable patch implementation):
>
>[RS Block Size] [Speed Up]
>500 194.49103
>1000 179.82353
>1500 164.90042
>2000 169.59175
>2500 154.23415
>
>
>For the rest of this analysis, I'm going to set the rsync block size to 1000 and leave it there.
>
>
>If we look at RSync performance as a function of window size, we find, as expected, that the rsync performance decreases.  The curve isn't smooth (the compression algorithm causes all sorts of havoc here), but the general trend is clear:  Increasing the compressed window size and block mask size = decreased rsync performance.  What isn't clear (because I varied the window size and block mask size together), is which of the two actually is the driving force (window size or block mask size).  My initial results with fiddling with the compression performance indicate that it should be the block mask size that matters, and the window size shouldn't matter.
>
>Here are the test results:
>
>[Window AND block mask Size] [Speed Up]
>100 335.6063
>600 335.56348
>1100 264.708
>1600 273.0129
>2100 220.42537
>2600 226.33165
>3000 219.47977
>3500 225.90854
>4000 179.82353
>4500 195.95692
>5000 230.86983
>
>
>
>My initial thoughts here are:
>1.  If the block mask size is mostly responsible for determining the performance of the zlib compression algorithm and if the window size is mostly responsible for determining the performance of the rsync algorithm, then we may have an opportunity to optimize the performance of the current --rsyncable patch.  The test results above imply that we could be looking at a 25% improvement in rsync performance without impacting compression or significantly adjusting Rusty's algorithm.  I think that the test results so far tend to support this as a possibility.
>
>2.  Further, if we can get really small window sizes by switching to the original rsync rolling checksum (instead of using the simple checksum in the current patch), then we *might* be able to achieve even better optimization, without adding a lot of computation overhead (the overhead of the rsync algorithm compared to the simple computation used now should be pretty low compared to the rest of what's going on in the zlib computation).  
>
>Anyway - that's where I'm headed with this.  I'll have results to indicate whether #1 is worth pursuing (i.e. testing with many other files) tomorrow.  #2 will require a bit more coding to the --rsyncable patch, but it should be relatively simple to do.
>
>
>
>That's all I have for now - I'm running tests this evening to independently adjust the window and block mask size, and to test a different data file just to make sure I'm not way off-base (i.e. make sure the trends indicated by my results are not dependent on the input file).
>
>I'd love to hear any thoughts on the matter!
>
>- Kevin
>  
>

-- 
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html