Dynamic hash table size (with static has load)

Wayne Davison wayned at samba.org
Sun Feb 26 00:53:44 GMT 2006


On Sat, Feb 25, 2006 at 08:42:37PM +0200, Shachar Shemesh wrote:
> I disagree.

I think you may have meant to say was, "Utterly, totally, and in all
meaningful ways false." :-)  In my hasty read-through of your patch I
was obviously only thinking about powers of 2, so my second critique
completely missed the mark.

In looking at your revised patch, I was thinking that we could change
the way the hash data is layed out and do away with the tag_table array
completely.  The following patch reorganizes the data to have a hash
array that just saves off the indexes into the sum array, and uses a
new "chain" value in the sum_buf structure itself to link collisions
together.  This not only saves memory (even compared to the original),
but does away with the extra comparing of tag values (which will always
be the same when we're in a particular chain of collisions).

    http://rsync.samba.org/ftp/unpacked/rsync/patches/dynamic_hash.diff

One thing this patch does is to (1) leave the array allocated to its
largest size, (2) use realloc() if we need to make it bigger, (3) make
the minimum hash-table size 65537 (a prime).  Some of these decisions
are debatable:

The 1st item makes us more efficient in our malloc calls when sending
large files, but could waste sender-side memory when transferring a
single large file in the middle of a bunch of normal-sized files.

The 3rd item might be a bit over the top, but we used to always allocate
a tag array of 65536 elements, and since I noticed some hash collisions
occurred in small files using a hash-table size of 11 items, I figured
it would be an acceptable overhead to make normal-sized files much more
likely to have no collisions at all.

Comments welcomed.  Thanks again for your patch!

..wayne..


More information about the rsync mailing list