Rsyncing really large files
Shachar Shemesh
rsync at shemesh.biz
Thu Mar 3 08:18:01 GMT 2005
Kevin Day wrote:
> Shachar-
>
> True enough - with one additional thought - if the block size is set
> to be the square root of the file size, then the load factor on the
> hash table becomes dynamic in and of itself (bigger block size = less
> master table entries = fewer hash collisions).
And I'm suggesting making it static, by adjusting the hash table's size
according to the number of blocks. Just do
"hashtablesize=(numblocks/8+1)*10;", and you should be set.
> In the case of relatively low bandwidth connections, you will get MUCH
> better performance improvement by messing with the block size than the
> size of hash table, becaues the hash table isn't sent over the wire -
> the block table IS sent over the wire, so reducing it's size can have
> a big impact on performance if your file isn't changing much.
True, but irrelevant. The hash table performance does not come at the
cost of extra bandwidth, so I see no reason not to optimize both.
> In Andrew's original thesis, he looked at several very large
...
> The problem you face, of course, is
...
Kevin, I think you are confusing a couple of things:
1. It's not me with the big files. It's Lars. I can't run tests on files
I don't have. I am merely trying to figure out what stopped Lars so that
rsync can be better.
2. The size of each block have nothing to do with the question of hash
table size. Once you've chosen the number of blocks your file will have,
in whatever way you did, there is an unrelated question of what hash
table size you should use. Using a 65536 buckets hash table on a 500GB
divided into 64KB blocks (as Lars is using) means you have, on average,
125 collisions per bucket. Regardless of the question of whether using
this size for blocks is smart or not, rsync could handle it better.
That's what I'm talking about.
> Trying to increase the size of the hash table may just not be worth it
> - are you certain that the performance hit you are experiencing
I'm not. Lars is.
> is caused by processing on the recipient side, and not data transfer
> of the block table? In my testing (which is actually with my own
> implementation of the algorithm, so I may have optimizations/ or lack
> thereof compared to the rsync you are running), the block table
> transfer is the biggest cause of elapsed time for big files that don't
> change much.
Don't forget that Lars manages to transfer the whole file in 17 hours. I
doubt transferring some information about each block takes more than the
64K the block itself is (as is Lars' case).
> It may be that the square root of file size for block size isn't
> appropriate for files as big as you are working with...
It certainly is, as I don't work with such large files. Lars is. While
at it, he is not using square root. He is using 64KB blocks.
>
Keyin, I'm trying to make rsync better. Lars' problem is an opportunity
to find a potential bottleneck. Trying to solve his use of possibly
non-optimal values won't help rsync, though it will help him. Let's keep
this part of this thread on it's subject - is the hash table size
optimal? I don't see how modifying the block sizes is going to change
this question.
Shachar
--
Shachar Shemesh
Lingnu Open Source Consulting ltd.
Have you backed up today's work? http://www.lingnu.com/backup.html
More information about the rsync
mailing list