[RFC] dynamic checksum size

Sun Mar 23 17:45:47 EST 2003

On Sun, 2003-03-23 at 07:40, jw schultz wrote:
[...]
> The two attached patches implement per-file dynamic checksum
> lengths.

Nice one.

> Here is a quick table showing the file sizes at which
> transition occurs, the sum2 length and the jump in size of
> the block sum table transmitted:
> 
> 	filesize     sum2len	transition
> 	<125MB		2
> 	 125MB		3	  8KB
> 	   1GB		4	 32KB
> 	   4GB		5	128KB
> 	  16GB		6	523KB
> 	  64GB		7	  2MB

My reading of the patch doesn't quite match with the above. The block
size heuristic is 1/10000 the filesize, limited between some miniumum
and maximum. Ignoring the limits, this means the number of blocks is
kept at a constant 10000, leaving you stuck at three bytes.

> This heuristic is a stab in the dark.  It may be that it is
> too aggressive in the rate of increase or insufficiently
> aggressive at starting the increase.  It may be that a 8
> bits of sum2 in conjunction with the 32 bit adler is enough
> for smaller files.  We may also wish to revisit the
> heuristic for block sizes that fixes the block count at 1000
> for files between 700KB and 16MB, or coordinate them so that
> the transitions would be easier.  Both heuristics can be
> changed at will without breaking compatibility.

I posted some more maths on this a while ago to the old rproxy list to
someone inquiring about how this affects librsync. In summary, a
conservative simplification of the relationship between block size,
blocksum size, and file size is;

  p = (s^2)/(B.(2^b))

where:

  p is the probablility of a false match (ie, corrupt result)
  s is the file size.
  B is the block size.
  b is the number of bits in the block checksum (blocksum size)

Note that blocksum size is the total blocksum including the rollsum. If
you plug in an "acceptable probability of failure", you get a direct
relationship between file size, block size, and blocksum size. Assuming
a 1/2^n probability of failure is OK, you get;

B = (s^2)/(2^(b-n))

b = n + 2.log2(s) - log2(B)

Note that assuming a fixed blocksum size, the block size must grow at
the square of the file size. Assuming a fixed bock size, the blocksum
must grow at the log of the filesize.

I think it makes sense to grow the block size linearly with file size
(which keeps a constant number of blocks), and then have blocksum size
grow as necessary to make up the difference by default. It might also be
good to have the ability to specify either blocksum size or block size
and have the other estimated using the above equations. Of course you
will want to round up the blocksum size to the nearest byte.

Using a heuristic that aims for 2^m number of blocks and plugging this
in to the above equations gives;

B = s/(2^m)
b = n + log2(s) + m   (Note: 32bits of this is the rollsum)

Assuming 8K number of blocks (m=13), and an acceptable error rate of
less than 1 in 1024 file transfers falling back to whole file (n=10),
you get;

file size	block size	blocksum size
<8M		1K              46 (sum2len=2)
<32M		4K		48 (sum2len=2)
<128M		16K		50 (sum2len=3)
<512M		64K		52 (sum2len=3)
<2G		256K		54 (sum2len=3)
<8G		1M		56 (sum2len=3)
<32G		4M		58 (sum2len=4)
<128G		16M		60 (sum2len=4)

As you can see from this, your heuristic is very conservative for large
files, but optimistic for smaller files. Might I suggest the following
code fragments;

#define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
#define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */

if (block_size) {
  block_len = block_size;
} else {
  block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */  
  block_len = MIN(block_len, BLOCK_SIZE);
  block_len = MAX(block_len, (CHUNK_SIZE / 2));
}

size_t  b = BLOCKSUM_EXP;
size_t  l = len;
while (l >> 1) {
        b+=2;
}
size_t  c = block_len;
while (c >> 1) {
  b--;
}
sum.s2length = (b+1-32+7)/8; /* add a bit, subtract rollsum, round up */

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------