Dynamic hash table size (with static has load)

Shachar Shemesh rsync at shemesh.biz
Sat Feb 25 18:42:37 GMT 2006


Wayne Davison wrote:

>Thanks for the patch!  Here's some comments:
>
> - You didn't change the size of the "tag" typedef (an unsigned short),
>   and your patch makes the value potentially overflow.
>  
>
Gotcha. I'm sending an amended patch.

> - For smaller hash-table sizes, your algorithm does a lookup in the
>   table based only on the s1 value (due to the (s2 << 16) value being
>   too large to have any remainder less than the tablesize).
>
>So, I think this probably needs to leave gettag() calling gettag2(), and
>change gettag2() to factor both s1 and s2 into some kind of an improved
>tag-generating computation.
>  
>
I disagree.

Let's begin with an example. Suppose that we only have 7 hashes
(s->count=7). 7/8=0. 0*10=0. 0+11=11. Our hash table size is 11, which
is the absolute minimum it will ever get.

Now let's suppose all 7 hashes have "1234" as their lower hash value,
and have the number 1000 through 1006 as their high value. They will be
filed to:
1000:1234 -> 4
1001:1234 -> 2
1002:1234 -> 0
1003:1234 -> 9
1004:1234 -> 7
1005:1234 -> 5
1006:1234 -> 3

Obviously, the higher checksum DID get a chance to affect the cell we
land in.

For the more general case, our function (s1+s2*65536)%ts (where ts is
the table size). Modern algebra dictates that this is the same as saying
(s1%ts + (s2%ts) * (65536%ts))%ts. In other words, you can first mod
each element individually, and only then do the actual addition and
subtraction. It's easy to see that s2 will not get nullified ever,
unless 65536%ts is zero. As 65536 is 2^16, and as ts is guarenteed to be
odd, this is impossible.

Venturing deeper into modern algebra, we know it is theoretically
possible that s2 will have some affect on the hash cell chosen, but will
not be able to choose any cell at all. This can be seen in the case of
(s1+s2*15)%9. If s1 is, say, 3, the different s2 values can select cells
3, 6 and 0. This will happen if and only if the factor (15) and the
modulo (9) have a greatest common divisor (gcd - open office calc
actually has a function of that name) which is larger than 1 (3, in this
case). In jargon, we will say that two number that have a gcd of 1 are
coprime.

Since ts is always odd (we multiply a number by 10 and then add 11), it
will always be coprime to 65536 (which is only divided by even numbers).
This means that s2 has as much a chance to select the hash cell we end
up in as s1. I don't think it is necessary to change that aspect of the
code.

I did change the comment in the patch to summarize this point.

>..wayne..
>  
>
       Shachar

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dynamic_hash.patch
Type: text/x-patch
Size: 1502 bytes
Desc: not available
Url : http://lists.samba.org/archive/rsync/attachments/20060225/6d5cb385/dynamic_hash.bin


More information about the rsync mailing list