[RFC] dynamic checksum size

Donovan Baarda abo at minkirri.apana.org.au
Mon Mar 24 10:56:42 EST 2003

On Mon, 2003-03-24 at 03:36, jw schultz wrote:
> On Mon, Mar 24, 2003 at 12:54:26AM +1100, Donovan Baarda wrote:
> > On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote:
> CHUNK_SIZE is used in mapping the file into memory.  I'm
> not absolutely sure (haven't spelunked that part of the code) what
> would happen if block size were to exceed it.  I suspect it
> would be OK because i don't recall hearing anyone complain
> as a result of using the --block-size option with a huge
> value.

Did the old code limit it to 16K even after people manually set it to
something larger?

> The problem with growing the block size indefinitely is that
> you must balance the cost of the checksums with the cost of
> sending large blocks for small changes.

Yeah, but you also need to keep it in perspective; if you are syncing a
4G file, do you really care much that it transmitted a 64K block instead
of a 16K block? And that's assuming that you could actually have sent a
16K block instead.

The ideal block size to minimimize data transfer is more based on the
nature of the changes than the file size. However, given no further info
than the size of the file, I would guess it's a fair heuristic to assume
a large file will have large changes, which means larger blocks don't
loose you much.

> How many 260GB files do you have?   On the other hand, with
> the fixed block length we see a diminishing advantage.
> We also take a hit with the hashing, match lookups and
> memory footprint.

I think this is more of a concern than the cost of sending slightly
larger blocks.
> > That sounds pretty arbitary to me... I'd prefer something like having it
> > grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for
> > 256M, 32K for 1G, etc.
> I most strongly lean toward a bounded square-root of file
> size.  It produces a nice smooth curve on resource
> consumption diminishing load/MB as the files grow.  I just

...until you hit the boundary. I dislike boundaries and step functions
because they break those nice smooth curves :-)

> hadn't a decent fast integer sqrt function at hand.  I don't
> really care to start bringing in the math lib just for this
> one calculation.  I've cobbled together one that is OK.

In case you didn't notice the "count non-zero shifts" is an approximate
integer log2 function. There are integer approximations for sqrt that
you can use, including Newton-Raphson;

int sqrt(int R) {
  int x,x1;

  x = 1024; /* starting guess close to what we expect */
  repeat {
    x1 = x;
    x = (x1 + R/x1) >> 1;
  } while (x-x1);
  return x;

> Clearly we are getting into the realms of sci-fi here.  Even
> with the 16KB block length limit files in the 1-16GB range
> should be manageable on almost any system that is likely
> to have them.

If we must have an upper limit on block size, I'd prefer it to be
continuous up to the 4G mark. This means an upper limit of 64K.

> The s2length calculation is using the algorithm you provided:
> 	#define BLOCKSUM_EXP 10

Might be worth putting a comment here;

/* The following implements the function;
  blocksum_bits = BLOCKSUM_EXP + 2*log2(len) - log2(block_len)

which gives a probability of rsync algorithm corrupting data and falling
back to whole file transfer of 1/2^BLOCKSUM_EXP */

This is something future generations might want to to tweak or perhaps
grok :-)

>                 b = BLOCKSUM_EXP;
>                 l = len;
>                 while (l >>= 1) {
>                         b += 2;
>                 }
>                 c = block_len;
>                 while (c >>= 1 && b) {
>                         b--;
>                 }
>                 s2length = (b + 1 - 32 + 7) / 8;
>                 s2length = MAX(s2length, 2);

Probably want to add;

                  s2length = MIN(s2length, MD4SUM_SIZE);

just to be on the safe side :-)

ABO: finger abo at minkirri.apana.org.au for more info, including pgp key

More information about the rsync mailing list