Rsyncing really large files

Thu Mar 3 08:18:01 GMT 2005

Kevin Day wrote:

> Shachar-
>  
> True enough - with one additional thought - if the block size is set 
> to be the square root of the file size, then the load factor on the 
> hash table becomes dynamic in and of itself (bigger block size = less 
> master table entries = fewer hash collisions).

And I'm suggesting making it static, by adjusting the hash table's size 
according to the number of blocks. Just do 
"hashtablesize=(numblocks/8+1)*10;", and you should be set.

> In the case of relatively low bandwidth connections, you will get MUCH 
> better performance improvement by messing with the block size than the 
> size of hash table, becaues the hash table isn't sent over the wire - 
> the block table IS sent over the wire, so reducing it's size can have 
> a big impact on performance if your file isn't changing much.

True, but irrelevant. The hash table performance does not come at the 
cost of extra bandwidth, so I see no reason not to optimize both.

> In Andrew's original thesis, he looked at several very large

...

> The problem you face, of course, is

...

Kevin, I think you are confusing a couple of things:
1. It's not me with the big files. It's Lars. I can't run tests on files 
I don't have. I am merely trying to figure out what stopped Lars so that 
rsync can be better.
2. The size of each block have nothing to do with the question of hash 
table size. Once you've chosen the number of blocks your file will have, 
in whatever way you did, there is an unrelated question of what hash 
table size you should use. Using a 65536 buckets hash table on a 500GB 
divided into 64KB blocks (as Lars is using) means you have, on average, 
125 collisions per bucket. Regardless of the question of whether using 
this size for blocks is smart or not, rsync could handle it better. 
That's what I'm talking about.  

> Trying to increase the size of the hash table may just not be worth it 
> - are you certain that the performance hit you are experiencing

I'm not. Lars is.

> is caused by processing on the recipient side, and not data transfer 
> of the block table?  In my testing (which is actually with my own 
> implementation of the algorithm, so I may have optimizations/ or lack 
> thereof compared to the rsync you are running), the block table 
> transfer is the biggest cause of elapsed time for big files that don't 
> change much.

Don't forget that Lars manages to transfer the whole file in 17 hours. I 
doubt transferring some information about each block takes more than the 
64K the block itself is (as is Lars' case).  

> It may be that the square root of file size for block size isn't 
> appropriate for files as big as you are working with...

It certainly is, as I don't work with such large files. Lars is. While 
at it, he is not using square root. He is using 64KB blocks.

>  

Keyin, I'm trying to make rsync better. Lars' problem is an opportunity 
to find a potential bottleneck. Trying to solve his use of possibly 
non-optimal values won't help rsync, though it will help him. Let's keep 
this part of this thread on it's subject - is the hash table size 
optimal? I don't see how modifying the block sizes is going to change 
this question.

          Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html