Block size optimization - let rsync find the optimal blocksize by itself.

Sun Jun 30 16:20:02 EST 2002

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:
[...]
>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests, there must be
> somewhat else. Maybe the rsync developpers have just changed some part of
> the algorithm. And also, even without using data compression during the
> sync, rsync is always more efficient as it should be theorically, actually
> between 1.5 and 2 times more efficient. Nobody will complain about that but
> I'd be happy if someone would be nice enough to explain me this thing.

I believe that the compression option turns on compression of transfered
data, including file lists, instruction streams etc. Even with compression
turned off, the miss data in the delta is still compressed. 

Another thing to be aware of is when rsync compresses miss data, it also
compresses all the hit data to prime the compressor with context, and throws
away the compressed output for the hit data.

Perhaps this "context compression" is affecting the optimal block size? 

>   Now the auto-optimization algorithm when updating many files at a time.
> Let's consider a set of files to be updated. We will consider only the files
> which have been changed since the last update (e.g. we can find the other
> ones by sending a MD5 sum for each file and trying to match it). We sync the
> first file, but the client keeps the old local version and can find how many
> differences between the two files there is and then guess the optimal block
> size. We assume that the percentage of differences between the files is a
> bit the same in the same set of files. So we use for the second file the
> optimal size found for the first file. Then for the third file we use the
> (arithmetic or geometric?) average of the first two files and so on... Once
> we have synced a certain number of files (10? 100?) we always use the same
> size which is supposed to be the best one.
>   Sorry I'm too long, hope you'll understand everything,
> Olivier

This relies on optimal block size being related for a set of files. I'm not
sure that this heuristic actually applies, and I don't know how much benefit
this would buy for all the complexity it would add.

> P.S. I am not a programmer of any kind so don't wait for me to write any
> line of C (I know I'm a bad boy).

It might be an interesting experiment to write a wrapper for rsync that
accumulated stats on all transfers it was used for and "learned" the optimal
block size to use. This would not require any changes to rsync, and could be
written in something like Python.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------