Dynamic hash table size (with static has load)

Mon Feb 27 10:56:34 GMT 2006

John Van Essen wrote:

>Inasmuch as I can follow some of the simpler optimizations, I'm at a
>loss as to what is being so dramatically improved for large files.
>
>Can you write up the little piece that you would add to the NEWS file
>and describe (in layman's terms is possible) what the benefit is and
>under what conditions an improvement will be observed by the user?
>'Cause otherwise I wouldn't know what is it that is being tested.
>
>Thanks.
>
>    John
>  
>
Being the originator of this patch, I'll pick it up:

Rsync uses intermediate hashes in order to find blocks in the
destination that match existing blocks in the source. This is the heart
of how rsync manages to speed up transfers.

Now here's the confusing part. These hashes are stored in a hash table
(which hash hashes). A hash table is a data-structure construct that
more or less guarentees constant time access to any member, assuming a
certain threshold (called Alpha) is not passed.

Up until now, this hash table had a fixed amount of 65536 buckets.
Assuming a you did not mess with the block size, starting from files
2.5GB big the Alpha threshold was passed, and rsync began spending time
looking through its own data structure for the the correct hash, instead
of simply finding it. The bigger the file, the more time spent on this
unnecessary activity. As people (Isk, in this case) are talking about
transfering over 500GB of files, we waste a potential large amount of
CPU resources on this.

The new patch uses a dynamic number of buckets that depends on the
number of hash blocks in the file. We thus hope to completely eliminate
this bottleneck.

          Shachar