Block size optimization - let rsync find the optimal blocksize by itself.

Sun Jun 30 09:24:04 EST 2002

Hello,
  Another French student in the rsync mailing list. I have been working on
rsync this year for a documentation project for school and I would like to
give some comment about rsync block size optimization first, and then to
submit a way to make rsync choose by itself the optimal blocksize when
updating a large number of files.

  Well, the first comment: during my work, I wanted to verify that the
theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
PHd Thesis was actually the good one, and when doing the tests on randomly
generated & modified files I discovered that the size sqrt(78*n/Q) is the
actual optimal block size, I tried to understand this by reading all the
thesis, then quite a lot of documentation about rsync but I just can't
figure out why the theorical & experimental optimal block sizes so much
don't match. I _really_ don't think it's coming from my tests, there must be
somewhat else. Maybe the rsync developpers have just changed some part of
the algorithm. And also, even without using data compression during the
sync, rsync is always more efficient as it should be theorically, actually
between 1.5 and 2 times more efficient. Nobody will complain about that but
I'd be happy if someone would be nice enough to explain me this thing.

  Now the auto-optimization algorithm when updating many files at a time.
Let's consider a set of files to be updated. We will consider only the files
which have been changed since the last update (e.g. we can find the other
ones by sending a MD5 sum for each file and trying to match it). We sync the
first file, but the client keeps the old local version and can find how many
differences between the two files there is and then guess the optimal block
size. We assume that the percentage of differences between the files is a
bit the same in the same set of files. So we use for the second file the
optimal size found for the first file. Then for the third file we use the
(arithmetic or geometric?) average of the first two files and so on... Once
we have synced a certain number of files (10? 100?) we always use the same
size which is supposed to be the best one.
  Sorry I'm too long, hope you'll understand everything,
Olivier

P.S. I am not a programmer of any kind so don't wait for me to write any
line of C (I know I'm a bad boy).
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre at club-internet.fr