MD4 checksum_seed

Tue Mar 16 22:07:40 GMT 2004

G'day,

From: "Eran Tromer" <rsync2eran at tromer.org>
> Donovan Baarda wrote:
[...]
> > I thought that without using some sort of known vulnerability the
> > formula would be;
> >   avg number of random attempts = number of possible sum values / 2
>
> Uhm, no -- just 2^(64/2)=2^32 random blocks, by the Birthday paradox.
> Roughly speaking, after looking at N blocks you have N*(N-1)/2 unordered
> pairs of observed blocks, and each pair gives a collision with
> probability 1^-64, so you need N=~2^32.
[...]

Doh! Of course... it was exactly the same "paradox" that showed how high the
probability of a collision was in rsync for large files. You are not
searching for a collision with a particular block, just between any two
blocks.

> > I have seen several variants for the rollsum;
> >
> > rsync: uses a basic fletcher.
>
> Except that strictly speaking, Fletcher uses one's-complement arithmetics.
[...]

Ahh, someone that actually knows :-) I wasn't sure of the exact definitions
and names.

> Quoth RFC 1950:
> "That 65521 is prime is important to avoid a possible large class of
> two-byte errors that leave the check unchanged."
> Not much of a difference in adversarial setting, though.

And not much of a difference I think for rsync in general... I wonder just
how large  the "possible large class" is?

> > librsync's offset probably does nothing, but I have a feeling it might
> > help a bit for blocksizes less than 256 bytes, because it at least helps
> > use more of the checksum domain for the "s1" part. Without it you don't
> > end up using the high bits of s1 at all for small blocks.
>
> It doesn't matter. You just shift the small range of possible values
> -- instead of using a small range of small numbers you'll use a small
> range of large numbers.
[...]

Yeah... that makes sense...

> > It might be possible to craft a mapping like xdelta's but using a
> > special sequence calculated for best effect, much like the polynomials
> > in crc32 and crc64.
>
> Possibly, for "natural" or random data (as opposed to adversarial). But
> in such contexts, random choice is often close to optimal with high
> probability.

I'll take your word for it :-). Actually I think the polynomials in crc are
crafted for common line errors. I'm not sure you can do much "crafting"
without making certain assumptions about the types of errors. In our case I
don't think we can make assumptions about the types of "errors".

> > Have you looked at the librsync rollsum.c implementation in CVS? It
> > replaced rs_calc_weak_sum and should be a fair bit faster.
>
> Uhm. Usually loop unrolling is best left for the compiler, and it's
> counterproductive with some modern processors. Is rollsum.c faster than
> the old implementation with -O3?

I can't remember... there were other cleanups associated with changing to
using rollsum that may have impacted.

The manual unrolling was taken straight from the adler32 implementation in
zlib. I come from an embedded system background where the C compilers are
often pretty basic, and manual unrolling makes a big difference.

I've been thinking of puting together tests for librsync using the "check"
framework so that we can easily test and time things like this.

----------------------------------------------------------------
Donovan Baarda                http://minkirri.apana.org.au/~abo/
----------------------------------------------------------------