Block size optimization - let rsync find the optimal blocksize by itself.

Sun Jun 30 17:46:01 EST 2002

I dislike flaming and i don't intend my comments as flame
but you have made some statements that at face value are
problematic.  Perhaps you could correct my misunderstanding.

On Sun, Jun 30, 2002 at 06:23:10PM +0200, Olivier Lachambre wrote:
> Hello,
>   Another French student in the rsync mailing list. I have been working on
> rsync this year for a documentation project for school and I would like to
> give some comment about rsync block size optimization first, and then to
> submit a way to make rsync choose by itself the optimal blocksize when
> updating a large number of files.
> 
>   Well, the first comment: during my work, I wanted to verify that the
> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
> PHd Thesis was actually the good one, and when doing the tests on randomly
> generated & modified files I discovered that the size sqrt(78*n/Q) is the
> actual optimal block size, I tried to understand this by reading all the
> thesis, then quite a lot of documentation about rsync but I just can't
> figure out why the theorical & experimental optimal block sizes so much
> don't match. I _really_ don't think it's coming from my tests, there must be
> somewhat else. Maybe the rsync developpers have just changed some part of
> the algorithm. And also, even without using data compression during the
> sync, rsync is always more efficient as it should be theorically, actually
> between 1.5 and 2 times more efficient. Nobody will complain about that but
> I'd be happy if someone would be nice enough to explain me this thing.

Firstly, I'll give Andrew the benefit of the doubt.  His
track record has been very good. 

In the real world file generation and modification is not
random.  The nearest to random any data gets is when it has
been encrypted.  Compression does increases the randomness
of new data somewhat but modifications are hardly random.

Most real files contain mostly text and as any cryptographer
can tell you text is decidedly non-random.  Executables are
similarly non-random as they are encoded communication of
nouns and verbs in the language of the CPU.  Databases are
highly structured and again contain mostly text.  The
nearest thing to random data you will find are files stored
in a compressed format such as audio, image or newer office
(not MS-Office) files such as Open Office.

To test your ideas you would do much better to observe live
systems.

> 
> P.S. I am not a programmer of any kind so don't wait for me to write any
> line of C (I know I'm a bad boy).

How does someone who is "not a programmer of any kind"
randomly generate and modify files and run all these tests.

On a more pleasant level;  You talk of between 1.5 and 2
times more efficient.  That is rather vague.  How much
effect does this difference in efficiency affect overall
network bandwidth, disk IO, and memory and CPU utilisation.
i.e. changing x had decreased bandwidth utilisation by .2%
under ABC conditions.  Doubling the efficiency of something
that effects only 2% (reducing it to 1.1%) of the load on a
plentiful resource (CPU) might be worthwhile but hardly
exciting.  Trimming network load by 12% would be very
interesting to those rsyncing over slow internet links.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt