Dynamic hash table size (with static has load)

Shachar Shemesh rsync at shemesh.biz
Sun Feb 26 08:34:25 GMT 2006


Wayne Davison wrote:

>    http://rsync.samba.org/ftp/unpacked/rsync/patches/dynamic_hash.diff
>  
>
A line of credit would have been nice :-)

>One thing this patch does is to (1) leave the array allocated to its
>largest size, (2) use realloc() if we need to make it bigger, (3) make
>the minimum hash-table size 65537 (a prime).  Some of these decisions
>are debatable:
>
>The 1st item makes us more efficient in our malloc calls when sending
>large files, but could waste sender-side memory when transferring a
>single large file in the middle of a bunch of normal-sized files.
>  
>
With a minimal struct size of 65537, I doubt we would have too many
reallocations. After all, a file needs to be over about 2.5 GB for us to
need a larger array. Such files are rare enough, and the time it takes
to allocate the array is fairly insignificant compared to their total
handling time, that I think we can save the memory when handling smaller
files.

As for point 2 - isn't "realloc" potentially less efficient than just
"malloc" if we intend to erase the array's content anyways?

>The 3rd item might be a bit over the top, but we used to always allocate
>a tag array of 65536 elements, and since I noticed some hash collisions
>occurred in small files using a hash-table size of 11 items, I figured
>it would be an acceptable overhead to make normal-sized files much more
>likely to have no collisions at all.
>  
>
Personally, I agree that a minimum is a good idea.

>Comments welcomed.  Thanks again for your patch!
>..wayne..
>  
>
Ok. Here are a few comments:

1. I guess it's a matter of taste, but when you want to make sure a type
has enough states to keep count of the number of elements in an array, I
prefer using "size_t" to "int32". It's more upwards compatible.

2. In "sum_buf", sum1 is defined to be unsigned. It seems dangerous to
me to hash it into a signed index, even if it's almost guarenteed to be ok.


I'm attaching my proposed patch, incorporating all above comments. I
also did some style suggestions (use sizeof(sum_table[0]) instead of
sizeof(int32), initialize the chain element to -1 instead of 0, as
that's the "null" value).


Shachar

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dynamic_hash.patch
Type: text/x-patch
Size: 5551 bytes
Desc: not available
Url : http://lists.samba.org/archive/rsync/attachments/20060226/08c0139b/dynamic_hash.bin


More information about the rsync mailing list