Rsyncing really large files

Wayne Davison wayned at samba.org
Sat Mar 5 16:55:14 GMT 2005


On Sat, Mar 05, 2005 at 02:18:01PM +0200, Shachar Shemesh wrote:
> That's not what I read into it. It seems to me that the checksum 
> function gives a 32bit result, and we are squashing that into a 16bit 
> hash table. Can you point me to the code? Wayne?

Look in match.c.  The hash table is "tag_table", and it's built in the
function "build_hash_table()".  It uses the macro gettag() to transform
the weak checksum into a 16-bit index into the hash table by adding the
two halves of the 32-bit checksum together.

> >However, if you choose to start with the 32 bit rolling hash and mod 
> >that, you will have problems.  The rolling checksum has two distinct 
> >parts, and modding will only pull info from the low order bits,
> 
> Why? This may be something I missed within the code.

He's talking about potentially losing an even distribution of values if
the lowest order bits aren't random enough.  I think we're all only
guessing that it might cluster too much if you just take the value%N
(at least I haven't tried to look at this aspect of the checksum).

> If the source is 16 bit, doing any hash table size bigger than 65536 
> buckets would make no sense, true. Is it 16bit?

The source value is 32 bits.  The "two parts" you ask about refers to
the gettag() macro's current algorithm of how it transforms the 32-bit
checksum into a 16-bit hash index.

..wayne..


More information about the rsync mailing list