MD4 checksum_seed

Thu Mar 18 02:46:31 GMT 2004

G'day,

cc'ing to the librsync-dev list as this is pretty relevant there.

On Thu, 2004-03-18 at 04:41, Eran Tromer wrote:
> Hi,
[...]
> >>>Have you looked at the librsync rollsum.c implementation in CVS? It
> >>>replaced rs_calc_weak_sum and should be a fair bit faster.
> 
> Here are run times for 70000000 calculations over the same all-zero block.
> "checksum.c" is librsync's current code (which differs from rsync's only
> in the chars are signed).
> "rollsum.c" is librsync's rollsumc.c from CVS.
> "XDELTA" is checksum.c changed to add char-to-short lookup table
> (I blindly converted every occurance of "buf[foo]" into
> "lookup_table[buf[foo]]"; the lookup table is xdelta's).
> Numbers in seconds; tested on a Pentium Xeon 1700MHz, gcc 3.2.3.
> 
> checksum.c -O2 CHAR_OFFSET=0:  273.42
> checksum.c -O2 CHAR_OFFSET=31: 300.63
> checksum.c -O3 CHAR_OFFSET=0:   31.42
> checksum.c -O3 CHAR_OFFSET=31:  35.32
> rollsum.c -O2 CHAR_OFFSET=0:    99.47
> rollsum.c -O2 CHAR_OFFSET=31:   99.22
> rollsum.c -O3 CHAR_OFFSET=0:   100.18
> rollsum.c -O3 CHAR_OFFSET=31:   99.77
> XDELTA -O2:                    386.07
> XDELTA -O3:                     32.17

Why would rollsum.c be slightly _slower_ for CHAR_OFFSET=0? 

> If you use the trivial one-char-at-a-time Fletcher algorithm (e.g.,
> disable the first loop in lib/rsync's code and remember to initialize
> i=0) you get the following:
> 
> checksum.c -O2 CHAR_OFFSET=0:  259.81
> checksum.c -O2 CHAR_OFFSET=31: 259.41
> checksum.c -O3 CHAR_OFFSET=0:  134.06
> checksum.c -O3 CHAR_OFFSET=31: 137.44
> XDELTA -O2:                    300.28
> XDELTA -O3:                    135.31
> 
> Some things to observe:
> * The fastest variant is checksum.c with -O3.
> * The extra cost of xdelta-type lookups is acceptable for -O2 and
>   virtually nonexistant for -O3.

The optimiser must something pretty amazing to minimise the table
lookups at compile-time; pretty impressive.

The next question is; is the table lookup benefits enough to justify the
extra complexity... given there is negligible performance hit?

I don't think the lookups actually improve the probability of avoiding
random collisions by much, and although they make it slightly more
complicated, they don't make it much more robust against intentional
attacks either. The only true protection against intentional attacks is
randomising the seed. So do we really benefit much?

> * Compared to checksum.c, rollsum.c.c is thrice faster
>   when using -O2 and 3 thrice *slower* when using -O3.
> * With the current default -O2, checksum.c's four-chars-at-a-time
>   code is worse than the trivial code.
> * Nonzero CHAR_OFFSET has a small but measurable cost.

The strange thing about this is -O3 seems to prefer partial unrolling...
the unroll 4 at a time is 3x faster than unroll all, but is 4x slower
than unroll none.

I wonder if -O3 has similar levels of impact on other platforms? Based
on these results alone, I'm tempted to remove the unrolling from
rollsum.c

The other difference between rollsum.c and checksum.c is rollsum.c uses
while loops and pointer incrementing, whereas checksum.c uses for loops
and array indexing. I wonder it -O3 significantly prefers one over the
other? I suspect this would be very platform dependent.

> BTW, in both rsync and librsync's checksum.c, the loop
>     for (i = 0; i < (len-4); i+=4) {
> has an off-by-one limit. It should be "i <= (len-4)".

Hmm, what kind of impact would that have? 

I just looked and it should not be a problem, because it only means the
last 4 bytes get handled one at a time, the result should still be
correct.

-- 
Donovan Baarda <abo at minkirri.apana.org.au>
http://minkirri.apana.org.au/~abo/