[RFC] dynamic checksum size

Mon Mar 24 00:54:26 EST 2003

On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote:
> On Sun, Mar 23, 2003 at 05:45:47PM +1100, Donovan Baarda wrote:
> > On Sun, 2003-03-23 at 07:40, jw schultz wrote:
> > [...]
> > > The two attached patches implement per-file dynamic checksum
> > > lengths.
[...]
> > I think it makes sense to grow the block size linearly with file size
> > (which keeps a constant number of blocks), and then have blocksum size
> > grow as necessary to make up the difference by default. It might also be
> > good to have the ability to specify either blocksum size or block size
> > and have the other estimated using the above equations. Of course you
> > will want to round up the blocksum size to the nearest byte.
> 
> I'm not really inclined to add user-specified bloksum size.
> The need for such tuning would be indicative of a poor
> heuristic.  For most use users should not be specifying a
> user-specified block size there are circumstances when it
> may be advantageous and not allowing it would be viewed as
> a shortcoming by DBAs.  Besides we already support a
> user-specified block size and i'm not inclined to drop that.
[...]

fair enough... the only reason to allow specifying blocksum size would be to
manually tweak the probability of success. Specifying block size is useful
if you know characteristics of the data being synced.

> > As you can see from this, your heuristic is very conservative for large
> > files, but optimistic for smaller files. Might I suggest the following
> > code fragments;
> 
> That is due to my error in block count magnitude of the
> block size calculation.

It's more than that, you are increasing the blocksum size by a byte every
time the block count quadriples... you are not taking into account how the
block size/file size impacts on things. This will only work well for a
particular block size heuristic.

> > #define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
> > #define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */
> > 
> > if (block_size) {
> >   block_len = block_size;
> > } else {
> >   block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */  
> >   block_len = MIN(block_len, BLOCK_SIZE);
> >   block_len = MAX(block_len, (CHUNK_SIZE / 2));
> 
> As in the patch, the MIN and MAX macros are reversed.
[...]

Ahh yeah. I copied these from the patch on the premise that they were
required for other limitations in the implementation (what is CHUNK_SIZE?).
I would prefer if block_len could grow indefinitely as the file size
increases.

> With a modicum of fixup to get it to actually work and some
> min-max bounds, here is an approximate table what that would
> do.
[...]
>  file length   block_len   block_count  s2length    checksum array size
[...]
>      122M          15K        8198           3          56K
>      127M          15K        8193           3          56K
>     1063M          16K          66K          4         531K
>       16G          16K        1034K          5        9312K
>      261G          16K          16M          6         163M

This end bit shows why I'd like the blocksize to be able to grow larger than
some arbitary limit.

> I'm not qualified to argue the formulae.  But i can't say
> i'm overly fond of the numbers coming out.
> Personally, i'd like to see it a bit more aggressive on the
> block size and perhaps a closer relationship between
> the checksum array size and the file size.

The block_size heuristic is pretty arbitary, but the blocksum_size
calculation is not, and calculates the minimum blocksum size to achieve a
particular probability of success for given file and blocksum sizes. You can
replace the block_size heuristic with anything, and the second part will
calculate the required blocksum size.

> I do have a variant that scales the length devisor for block
> size from 1024 to 16538 as the length progresses from 1MB to
> 128MB.  This in conjunction with your sum2 length formula
> produces this:

That sounds pretty arbitary to me... I'd prefer something like having it
grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for
256M, 32K for 1G, etc.

> Block size seems to me more of a fragmentation issue.

yeah... once you adjust blocksum size based on file and block size, the
block size only needs to be tweaked to take into account data characteristics
and adjust checksum array size.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------