[RFC] dynamic checksum size

jw schultz jw at pegasys.ws
Sun Mar 23 22:46:34 EST 2003


On Sun, Mar 23, 2003 at 05:45:47PM +1100, Donovan Baarda wrote:
> On Sun, 2003-03-23 at 07:40, jw schultz wrote:
> [...]
> > The two attached patches implement per-file dynamic checksum
> > lengths.
> 
> Nice one.
>  
> > Here is a quick table showing the file sizes at which
> > transition occurs, the sum2 length and the jump in size of
> > the block sum table transmitted:
> > 
> > 	filesize     sum2len	transition
> > 	<125MB		2
> > 	 125MB		3	  8KB
> > 	   1GB		4	 32KB
> > 	   4GB		5	128KB
> > 	  16GB		6	523KB
> > 	  64GB		7	  2MB
> 
> My reading of the patch doesn't quite match with the above. The block
> size heuristic is 1/10000 the filesize, limited between some miniumum
> and maximum. Ignoring the limits, this means the number of blocks is
> kept at a constant 10000, leaving you stuck at three bytes.

Your reading is correct.  I miscounted the zeros.
I merely consolidated and cleaned up the code.  I did
nothing to the heuristic.  Had i written it i would have
done something that caused the block size to grow at in
keeping with sqrt of file size.  But within the same range
of block sizes.


> > This heuristic is a stab in the dark.  It may be that it is
> > too aggressive in the rate of increase or insufficiently
> > aggressive at starting the increase.  It may be that a 8
> > bits of sum2 in conjunction with the 32 bit adler is enough
> > for smaller files.  We may also wish to revisit the
> > heuristic for block sizes that fixes the block count at 1000
> > for files between 700KB and 16MB, or coordinate them so that
> > the transitions would be easier.  Both heuristics can be
> > changed at will without breaking compatibility.
> 
> I posted some more maths on this a while ago to the old rproxy list to
> someone inquiring about how this affects librsync. In summary, a
> conservative simplification of the relationship between block size,
> blocksum size, and file size is;
> 
>   p = (s^2)/(B.(2^b))
> 
> where:
>   
>   p is the probablility of a false match (ie, corrupt result)
>   s is the file size.
>   B is the block size.
>   b is the number of bits in the block checksum (blocksum size)
> 
> Note that blocksum size is the total blocksum including the rollsum. If
> you plug in an "acceptable probability of failure", you get a direct
> relationship between file size, block size, and blocksum size. Assuming
> a 1/2^n probability of failure is OK, you get;
> 
> B = (s^2)/(2^(b-n))
> 
> b = n + 2.log2(s) - log2(B)
> 
> Note that assuming a fixed blocksum size, the block size must grow at
> the square of the file size. Assuming a fixed bock size, the blocksum
> must grow at the log of the filesize.
> 
> I think it makes sense to grow the block size linearly with file size
> (which keeps a constant number of blocks), and then have blocksum size
> grow as necessary to make up the difference by default. It might also be
> good to have the ability to specify either blocksum size or block size
> and have the other estimated using the above equations. Of course you
> will want to round up the blocksum size to the nearest byte.

I'm not really inclined to add user-specified bloksum size.
The need for such tuning would be indicative of a poor
heuristic.  For most use users should not be specifying a
user-specified block size there are circumstances when it
may be advantageous and not allowing it would be viewed as
a shortcoming by DBAs.  Besides we already support a
user-specified block size and i'm not inclined to drop that.

> Using a heuristic that aims for 2^m number of blocks and plugging this
> in to the above equations gives;
> 
> B = s/(2^m)
> b = n + log2(s) + m   (Note: 32bits of this is the rollsum)
> 
> Assuming 8K number of blocks (m=13), and an acceptable error rate of
> less than 1 in 1024 file transfers falling back to whole file (n=10),
> you get;
> 
> file size	block size	blocksum size
> <8M		1K              46 (sum2len=2)
> <32M		4K		48 (sum2len=2)
> <128M		16K		50 (sum2len=3)
> <512M		64K		52 (sum2len=3)
> <2G		256K		54 (sum2len=3)
> <8G		1M		56 (sum2len=3)
> <32G		4M		58 (sum2len=4)
> <128G		16M		60 (sum2len=4)
> 
> As you can see from this, your heuristic is very conservative for large
> files, but optimistic for smaller files. Might I suggest the following
> code fragments;

That is due to my error in block count magnitude of the
block size calculation.

> #define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
> #define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */
> 
> if (block_size) {
>   block_len = block_size;
> } else {
>   block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */  
>   block_len = MIN(block_len, BLOCK_SIZE);
>   block_len = MAX(block_len, (CHUNK_SIZE / 2));

As in the patch, the MIN and MAX macros are reversed.

> }
> 
> size_t  b = BLOCKSUM_EXP;
> size_t  l = len;
> while (l >> 1) {
>         b+=2;
> }
> size_t  c = block_len;
> while (c >> 1) {
>   b--;
> }
> sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */
> 

With a modicum of fixup to get it to actually work and some
min-max bounds, here is an approximate table what that would
do.

 file length   block_len   block_count  s2length    checksum array size
      50          700            1           2           6 
    8085K        1008         8215           2          48K
      12M        1552         8214           2          48K
      15M        2032         8255           2          48K
      20M        2576         8240           2          48K
      23M        3024         8211           2          48K
      27M        3536         8215           2          48K
      32M        4144         8200           3          56K
      36M        4656         8210           3          56K
      39M        5040         8203           3          56K
      44M        5664         8211           3          56K
      47M        6128         8209           3          56K
      51M        6640         8194           3          56K
      56M        7168         8210           3          56K
      60M        7760         8202           3          56K
      63M        8064         8209           3          56K
      68M        8736         8196           3          56K
      71M        9088         8193           3          56K
      76M        9824         8198           3          56K
      79M           9K        8205           3          56K
      83M          10K        8199           3          56K
      86M          10K        8194           3          56K
      93M          11K        8198           3          56K
      97M          12K        8197           3          56K
     101M          12K        8198           3          56K
     105M          13K        8201           3          56K
     109M          13K        8197           3          56K
     113M          14K        8197           3          56K
     118M          14K        8200           3          56K
     122M          15K        8198           3          56K
     127M          15K        8193           3          56K
    1063M          16K          66K          4         531K
      16G          16K        1034K          5        9312K
     261G          16K          16M          6         163M

I'm not qualified to argue the formulae.  But i can't say
i'm overly fond of the numbers coming out.
Personally, i'd like to see it a bit more aggressive on the
block size and perhaps a closer relationship between
the checksum array size and the file size.

I do have a variant that scales the length devisor for block
size from 1024 to 16538 as the length progresses from 1MB to
128MB.  This in conjunction with your sum2 length formula
produces this:

 file length   block_len   block_count  s2length    checksum array size
      50          700            1           2           6 
     935K         928         1032           2        6192 
    1230K        1216         1037           2        6222 
    1619K        1616         1027           2        6162 
    2131K        2128         1026           2        6156 
    2804K        2800         1026           2        6156 
    3690K        3680         1027           2        6162 
    4151K        2064         2060           2          12K
    5462K        2720         2057           2          12K
    7188K        3584         2054           2          12K
    9459K        4720         2053           2          12K
      12M        6208         2054           2          12K
      15M        8176         2052           2          12K
      16M        4256         4099           2          24K
      21M        5600         4100           2          24K
      28M        7360         4105           2          24K
      37M        9696         4100           2          24K
      49M          12K        4098           2          24K
      65M        8400         8196           3          56K
      86M          10K        8194           3          56K
     113M          14K        8197           3          56K
    1063M          16K          66K          4         531K
      16G          16K        1034K          5        9312K
     261G          16K          16M          6         163M

Block size seems to me more of a fragmentation issue.




-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt


More information about the rsync mailing list