weak checksum question

Donovan Baarda abo at minkirri.apana.org.au
Tue Nov 12 12:53:01 EST 2002


On Mon, Nov 11, 2002 at 08:06:40PM -0500, Jeff Abrahamson wrote:
> The weak checksum in checksum.c (see snippet below) differs
> substantially from the one discussed in Andrew Tridgell's doctoral
> thesis on rsync and elsewhere that I've been able to find. I didn't
> find discussion of the change in the mailing list archives. Well, so
> I'm curious what the benefit of the change is.
> 
> (I'm using a rolling checksum in another app, thus the reason for this
> nitty inspection.)
> 
> Can anyone provide a pointer to help me understand why this changed?

There are quite a few variations of the rolling checksum. All that I know of
are varients of what I know of as a "Fletcher checksum". I believe the rsync
documentation describes a straight Fletcher checksum, but also shows how it
can be efficiently "rolled".

I believe rsync uses a straight Fletcher checksum. librsync adds an offset
to each byte before it is fed into the Fletcher checksum. xdelta uses a
random lookup table before each byte is fed into the Fletcher checksum.
Pysync includes an Adler32 implementation in Python, which is the same as a
Fletcher except s1 and s2 are mod'ed by the largest 16 bit prime each time.
pysync also includes a C implementation of the librsync version except
implemented using a very fast unrolled macro idea stolen from the adler32
implementation in zlib. The C implementation in pysync is probably the
fastest one currently around.

I'm unconvinced the extra effort of Adler32's "mod-prime", librsync's
"offset", or xdelta's "random lookup" actually buy you anything. I have a
feeling that they all end up with pretty much the same probability of a
"checksum clash" occuring. This would suggest that a pure fletcher checksum
is the best because it has the lowest overhead, but librsync's offset
doesn't hurt much anyway.

> In checksum.c:
> 
>     uint32 get_checksum1(char *buf1,int len)
>     {
>         int i;
>         uint32 s1, s2;
>         schar *buf = (schar *)buf1;
> 
>         s1 = s2 = 0;
>         for (i = 0; i < (len-4); i+=4) {
>             s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] + 
>               10*CHAR_OFFSET;
>             s1 += (buf[i+0] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET); 
>         }
>         for (; i < len; i++) {
>             s1 += (buf[i]+CHAR_OFFSET); s2 += s1;
>         }
>         return (s1 & 0xffff) + (s2 << 16);
>     }

This elaborate confusion is an attempt to optimise the "fletcher with
offset" varient by processing the input buffer 4 bytes at a time.
Unfortunately, this doesn't buy you much because all those multiplies end up
costing more than an unravelled series of additions. 

I posted a patch somewhere that replaced this implementation with the
rollsum.c code taken from pysync... If you want to see that implementation
have a look at pysync on Freshmeat.

> What I understood from documentation:
> 
>     /* Rolling checksum from Tridgell's PhD thesis. */
>     int get_weak_checksum(signed char *buf, int len)
>     {
>             int s;
>             int s1, s2;
>             int len4;
> 
>             s1 = s2 = 0;
>             len4 = len / 4;
>             for(i = 0; i < len - 4; i += 4) {
>                     s = (buf[i] << 24) + (buf[i+1] << 16)
>                             + (buf[i+2] << 8) + (buf[i+3]);
>                     s1 += s;
>                     s2 += (len4 - 4 * i) * s;
>             }
>             return (s2 << 16) + (s1 & 0xffff);
>     }

I think you got the documentation wrong... although maybe I remember it
wrong :-) I thought the rsync documentation described this;

{
  s1=s2=0;
  for (i=0, i< len; i++) {
    s1+=buf[i];
    s2+=s1;
  }
  return (s2 << 16) | (s1 & 0xffff);
} 

The "fletcher with offset" varient is exactly the same, it just replaces
"s1+=buf[i]" with "s1+=buf[i]+OFFSET". The xdelta version replaces it with
"s1+=LOOKUP[buff[i]]" where LOOKUP is constant array with 256 random integers. 

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------



More information about the rsync mailing list