[RFC] dynamic checksum size
jw schultz
jw at pegasys.ws
Sun Mar 23 22:46:34 EST 2003
On Sun, Mar 23, 2003 at 05:45:47PM +1100, Donovan Baarda wrote:
> On Sun, 2003-03-23 at 07:40, jw schultz wrote:
> [...]
> > The two attached patches implement per-file dynamic checksum
> > lengths.
>
> Nice one.
>
> > Here is a quick table showing the file sizes at which
> > transition occurs, the sum2 length and the jump in size of
> > the block sum table transmitted:
> >
> > filesize sum2len transition
> > <125MB 2
> > 125MB 3 8KB
> > 1GB 4 32KB
> > 4GB 5 128KB
> > 16GB 6 523KB
> > 64GB 7 2MB
>
> My reading of the patch doesn't quite match with the above. The block
> size heuristic is 1/10000 the filesize, limited between some miniumum
> and maximum. Ignoring the limits, this means the number of blocks is
> kept at a constant 10000, leaving you stuck at three bytes.
Your reading is correct. I miscounted the zeros.
I merely consolidated and cleaned up the code. I did
nothing to the heuristic. Had i written it i would have
done something that caused the block size to grow at in
keeping with sqrt of file size. But within the same range
of block sizes.
> > This heuristic is a stab in the dark. It may be that it is
> > too aggressive in the rate of increase or insufficiently
> > aggressive at starting the increase. It may be that a 8
> > bits of sum2 in conjunction with the 32 bit adler is enough
> > for smaller files. We may also wish to revisit the
> > heuristic for block sizes that fixes the block count at 1000
> > for files between 700KB and 16MB, or coordinate them so that
> > the transitions would be easier. Both heuristics can be
> > changed at will without breaking compatibility.
>
> I posted some more maths on this a while ago to the old rproxy list to
> someone inquiring about how this affects librsync. In summary, a
> conservative simplification of the relationship between block size,
> blocksum size, and file size is;
>
> p = (s^2)/(B.(2^b))
>
> where:
>
> p is the probablility of a false match (ie, corrupt result)
> s is the file size.
> B is the block size.
> b is the number of bits in the block checksum (blocksum size)
>
> Note that blocksum size is the total blocksum including the rollsum. If
> you plug in an "acceptable probability of failure", you get a direct
> relationship between file size, block size, and blocksum size. Assuming
> a 1/2^n probability of failure is OK, you get;
>
> B = (s^2)/(2^(b-n))
>
> b = n + 2.log2(s) - log2(B)
>
> Note that assuming a fixed blocksum size, the block size must grow at
> the square of the file size. Assuming a fixed bock size, the blocksum
> must grow at the log of the filesize.
>
> I think it makes sense to grow the block size linearly with file size
> (which keeps a constant number of blocks), and then have blocksum size
> grow as necessary to make up the difference by default. It might also be
> good to have the ability to specify either blocksum size or block size
> and have the other estimated using the above equations. Of course you
> will want to round up the blocksum size to the nearest byte.
I'm not really inclined to add user-specified bloksum size.
The need for such tuning would be indicative of a poor
heuristic. For most use users should not be specifying a
user-specified block size there are circumstances when it
may be advantageous and not allowing it would be viewed as
a shortcoming by DBAs. Besides we already support a
user-specified block size and i'm not inclined to drop that.
> Using a heuristic that aims for 2^m number of blocks and plugging this
> in to the above equations gives;
>
> B = s/(2^m)
> b = n + log2(s) + m (Note: 32bits of this is the rollsum)
>
> Assuming 8K number of blocks (m=13), and an acceptable error rate of
> less than 1 in 1024 file transfers falling back to whole file (n=10),
> you get;
>
> file size block size blocksum size
> <8M 1K 46 (sum2len=2)
> <32M 4K 48 (sum2len=2)
> <128M 16K 50 (sum2len=3)
> <512M 64K 52 (sum2len=3)
> <2G 256K 54 (sum2len=3)
> <8G 1M 56 (sum2len=3)
> <32G 4M 58 (sum2len=4)
> <128G 16M 60 (sum2len=4)
>
> As you can see from this, your heuristic is very conservative for large
> files, but optimistic for smaller files. Might I suggest the following
> code fragments;
That is due to my error in block count magnitude of the
block size calculation.
> #define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
> #define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */
>
> if (block_size) {
> block_len = block_size;
> } else {
> block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */
> block_len = MIN(block_len, BLOCK_SIZE);
> block_len = MAX(block_len, (CHUNK_SIZE / 2));
As in the patch, the MIN and MAX macros are reversed.
> }
>
> size_t b = BLOCKSUM_EXP;
> size_t l = len;
> while (l >> 1) {
> b+=2;
> }
> size_t c = block_len;
> while (c >> 1) {
> b--;
> }
> sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */
>
With a modicum of fixup to get it to actually work and some
min-max bounds, here is an approximate table what that would
do.
file length block_len block_count s2length checksum array size
50 700 1 2 6
8085K 1008 8215 2 48K
12M 1552 8214 2 48K
15M 2032 8255 2 48K
20M 2576 8240 2 48K
23M 3024 8211 2 48K
27M 3536 8215 2 48K
32M 4144 8200 3 56K
36M 4656 8210 3 56K
39M 5040 8203 3 56K
44M 5664 8211 3 56K
47M 6128 8209 3 56K
51M 6640 8194 3 56K
56M 7168 8210 3 56K
60M 7760 8202 3 56K
63M 8064 8209 3 56K
68M 8736 8196 3 56K
71M 9088 8193 3 56K
76M 9824 8198 3 56K
79M 9K 8205 3 56K
83M 10K 8199 3 56K
86M 10K 8194 3 56K
93M 11K 8198 3 56K
97M 12K 8197 3 56K
101M 12K 8198 3 56K
105M 13K 8201 3 56K
109M 13K 8197 3 56K
113M 14K 8197 3 56K
118M 14K 8200 3 56K
122M 15K 8198 3 56K
127M 15K 8193 3 56K
1063M 16K 66K 4 531K
16G 16K 1034K 5 9312K
261G 16K 16M 6 163M
I'm not qualified to argue the formulae. But i can't say
i'm overly fond of the numbers coming out.
Personally, i'd like to see it a bit more aggressive on the
block size and perhaps a closer relationship between
the checksum array size and the file size.
I do have a variant that scales the length devisor for block
size from 1024 to 16538 as the length progresses from 1MB to
128MB. This in conjunction with your sum2 length formula
produces this:
file length block_len block_count s2length checksum array size
50 700 1 2 6
935K 928 1032 2 6192
1230K 1216 1037 2 6222
1619K 1616 1027 2 6162
2131K 2128 1026 2 6156
2804K 2800 1026 2 6156
3690K 3680 1027 2 6162
4151K 2064 2060 2 12K
5462K 2720 2057 2 12K
7188K 3584 2054 2 12K
9459K 4720 2053 2 12K
12M 6208 2054 2 12K
15M 8176 2052 2 12K
16M 4256 4099 2 24K
21M 5600 4100 2 24K
28M 7360 4105 2 24K
37M 9696 4100 2 24K
49M 12K 4098 2 24K
65M 8400 8196 3 56K
86M 10K 8194 3 56K
113M 14K 8197 3 56K
1063M 16K 66K 4 531K
16G 16K 1034K 5 9312K
261G 16K 16M 6 163M
Block size seems to me more of a fragmentation issue.
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list