Rsync friendly zlib/gzip compression - revisited
kevin at trumpetinc.com
Mon Feb 14 19:21:21 GMT 2005
Hey - when the stars align, they align :-)
Anyway, I've continued my testing (with a broader selection of files, and doing a full zlib and rsync analysis), and I have some interesting results.
As would be expected, there is definitely a dependency of rsync performance (i.e. "bytes changed") and what I've called the block mask size - that makes sense because the size of the mask determines the average size of the compressed output blocks, which makes the chunks that rsync has to work with bigger/less efficient.
Interestingly, there also appears to be an orthogonal dependency on the window size - increasing window size while keeping the mask size constant results in worse rsync performance. My previous results (and testing with additional files) show that there is NOT a dependency on compression performance with changing window size (as long as you are above the threshold where the checksum behaves itself). This suggests that there is definitely an opportunity to improve the rsync performance by minimizing the window size.
As Shacar writes, this may very well have been done already in the patch released to the Debian build - it sounds like that release uses a window size of 256 and a mask size of 8192 (although I haven't actually looked at the code). Interestingly, for the simple checksum algorithm, these are pretty much the values that my analysis so far indicates should be used. Any window size less than 250/300, and the behavior of the checksum becomes misbheaved. I would question the decision to use 8192 as the mask size - anything above 4000 is probably not going to make much of a difference in the compression ratio, but it will have a big impact on the rsync performance (it effectively doubles the 'changed bytes' over the results with a 4096 mask, but gets you less than 0.1% on the compression ratio - these are big files, so 0.1% is not necessarily something to sneeze at, but the tradeoff in rsync performance is pretty nasty...).
Some other interesting characteristics of the dependency on window size:
1. The dependency is linear above a certain threshold.
2. Below that threshold, the rsync performance improves dramatically with reduced window size. This reduction is especially large with large mask sizes. This kind of makes sense if you think about it - large mask sizes will cause large compressed blocks. If there is a change in the original data, it will be more likely to effect two adjascent blocks if the window size is large. I think that Shachar's description of this in his response to my original message hits it on the head.
By way of demonstration, here are some numbers if I use a large block mask size (5474 is the one my test happened to use) to get good compression:
[Window Size] [Normalized Data Transfer Required]
A couple of things to note:
a) A window size of 1219 requires the transfer of 1.58 times the number of bytes if the window size was 165.
b) the window size increases logarithmically in the table (this is just how I ran the test)
If you plot these, you'll see that the required data transfer drops of quickly for small window sizes, but increases gradually for larger window sizes (this is very common for a boundary condition problem, and I suppose that the nearness of a given change to a given window can be viewed in this context, so the results kind of make sense).
So - if we can improve the rsync performance by 60% by dropping the window size from 1200 to 165, and the improvements appear to get better and better as we get smaller, then going with a checksum that allows for very small window sizes would seem like a very good thing to do. In fact, we may be able to achieve the rsync performance that we would have gotten with a considerably smaller block mask size, without sacrificing efficiency in the zlib algorithm.
It would seem (but I'll have to actually program this before I say for sure), that you may very well be able to have your cake and eat it to. My testing so far indicates that the rolling-checksum built in to the rsync algorithm should allow considerably smaller window sizes than the simple summation checksum.
For Shachar's encryption studies, this may also bring the window size down to the same order of magnitude as his encryption block size, which may address some of the challenges he is likely to run into...
Just in case onyone is interested, the way I am using the rsync checksum is in it's 16 bit form, except that I perform back calculation instead of forward calculation (i.e. I compute over the last WINDOWSIZE bytes instead of the next WINDOWSIZE bytes) - functionally, I don't think this makes any difference, it is just a difference of opinion about what the "Current byte" is. I compute r1 and r2, then add the two together and check against the mask value.
As an aside, the 32 bit rsync rolling checksum wouldn't work well for this because the mod operation varies towards the least significant bytes, which would basically give us exactly the simple sumation checksum already in use.
One final thought: Right now, I compare the modulo of the checksum and the mask size and look for when it equals zero. Given the propensity of long strings of zeros in certain types of files, it may be desirable to compare against something other than zero (both the simple checksum and the rsync checksum compute long strings of zero bytes to be 0). That is probably a micro-optimization, but it can't hurt. I'll probably choose to compare against something like binary 1010101... masked to the byte length of the mask size (this would require the mask size to be a power of 2 - we'll see if that's possible). I'd be interested if anyone has any comments on that...
Finally - Shachar writes "How much is too similar? My original gut feeling was that "E=min*2" is a
good value, but as I'm writing this email, I am getting the feeling that
"E=min+span" would work even better."
That is an excellent question. If it could be answered, it would allow us to statistically determine an optimal rsync block size. So far, my results indicate that, larger rsync block sizes lead to worse performance. There doesn't appear (at least in my data so foar) to be an optimal - smaller is better... Of course, my defintion of "performance" may be in question, because I am only concrned with the 'bytes changed' value. If you consider the size of the checksum table, then there will certainly be an optimum rsync block size, but this kind of comes back to the original work on rsync - what we are talking about would be an optimization of the rsync algorithm as applied to resettable random (in a very lose definition) streams. If that's the measure, then I haven't seen such a dependence yet.
However, I just kicked off another parametric study varying the rsync block size and the block mask size, while holding the window size constant. If there is an optimization avaialble for rsync block size, then we should see a dependency in these results.
Another way of thinking about this, that may lead to another approach: The ideal case, of course, would be if we knew how big each compressed block was, and computed the checksums on that. This would require us to compute a huge number of simultaneous rolling checksums, so we don't want to do that brute force type of approach - but maybe there is a direction someone can go with that as a starting point... It should be possible to compute the cost of each block changing as a function of rsync block size, then minmize the cost for a single rsync block size...
I'm thinking along the lines of:
Cost of a single block (i), where Xrsy is the rsync blocksize and Xcomp is the actual compressed (output) block size for block i:
if (Xrsy < Xcomp(i))
Xcomp(i) - Xrsy
The above reflects the fact that if the rsync block is bigger than the compression block that we loose the entire compression block, but if it is smaller, then we only lose the difference. It's actually more complicated than that - we may have to include a parameter that takes the remainder of Xcomp(i)/Xrsys.... we need to account for the fact that multiple rsync blocks could land inside a single compression block (and in fact, this may be desirable).
Heck, for a set of compressed block sizes, ideally we would have a rsync block size that is a factor of all of them. That's obvisouly only going to happen with a rsync block size of 1 - so it's out of the question - but I'm thinking of something along those lines....
The real question is if it is possible to do this calculation without a massive amount of CPU time... it would be slick to have the modified zlib algorithm spit out not only the compressed data stream, but also an optimal rsync block size...
One final comment: In Andrew's thesis, he found that the algorithm's performance was dependent on the rsync block size, but that the curve was relatively flat. This is true for nice, uncompressed data. However, when we start looking at this behavior for compressed data streams (rsync friendly,of course), then the curve is nowhere near as flat. So optimization of the rsync block size for this one type of data may be a good idea.
> Kevin Day wrote:
>At that point, I started to consider the question raised in the discussion threads by Kevin Easton - primarily: are the checksum and compression block trigger selected by Rusty optimal?
Oh, come on, Kevin. Couldn't you have sent this email just one week ago,
before I started to recruit all of my math skills to do the precise same
analysis (for a slightly different purpose, though).
More information about the rsync