Block size optimization - let rsync find the optimal blocksize by itself.

Tue Jul 2 02:07:02 EST 2002

At 17:45 30/06/2002 -0700, you wrote:
>
>I dislike flaming and i don't intend my comments as flame
>but you have made some statements that at face value are
>problematic.  Perhaps you could correct my misunderstanding.

[...]
>>   Well, the first comment: during my work, I wanted to verify that the
>> theorical optimal block size sqrt(24*n/Q) given by Andrew Tridgell in his
>> PHd Thesis was actually the good one, and when doing the tests on randomly
>> generated & modified files I discovered that the size sqrt(78*n/Q) is the
>> actual optimal block size, I tried to understand this by reading all the
>> thesis, then quite a lot of documentation about rsync but I just can't
>> figure out why the theorical & experimental optimal block sizes so much
>> don't match. I _really_ don't think it's coming from my tests, there must be
>> somewhat else. Maybe the rsync developpers have just changed some part of
>> the algorithm. And also, even without using data compression during the
>> sync, rsync is always more efficient as it should be theorically, actually
>> between 1.5 and 2 times more efficient. Nobody will complain about that but
>> I'd be happy if someone would be nice enough to explain me this thing.
>
>Firstly, I'll give Andrew the benefit of the doubt.  His
>track record has been very good. 
>

I think that the theorical optimal block size sqrt(24*n/Q) given by Andrew
in his PHd Thesis was actually the good one in the first version of rsync.
Rsync has been deeply modified since, as he explained it yesterday, that
is why the former optimal block size is no more accurate, which is not a
problem but still interesting to know.

>
>In the real world file generation and modification is not
>random.  The nearest to random any data gets is when it has
>been encrypted.  Compression does increases the randomness
>of new data somewhat but modifications are hardly random.
>
>Most real files contain mostly text and as any cryptographer
>can tell you text is decidedly non-random.  Executables are
>similarly non-random as they are encoded communication of
>nouns and verbs in the language of the CPU.  Databases are
>highly structured and again contain mostly text.  The
>nearest thing to random data you will find are files stored
>in a compressed format such as audio, image or newer office
>(not MS-Office) files such as Open Office.
>
>To test your ideas you would do much better to observe live
>systems.

I know that -fortunately- real world is not random, but I used 
random files because this is the only way to test Andrew's
theorical work, which can maybe later give birth to
applications in the "real world", e.g. in the "algorithm" I
gave to auto-optimize rsync, you must choose a model to
find the best size for a set of files, or you won't be able to
calculate the optimal size for this set when you know the optimal
size for each file. And to validate the model, dealing with
random files first might not be the worst idea.

>
>> 
>> P.S. I am not a programmer of any kind so don't wait for me to write any
>> line of C (I know I'm a bad boy).
>
>How does someone who is "not a programmer of any kind"
>randomly generate and modify files and run all these tests.
>

Let's rassure you: I said I was not a programmer because I do
write almost only CaML code (you may know OCaML, which is,
I heard, a wonderful & powerful language, but I use its ancestor CaML Light
http//caml.inria.fr/ which is rather experimental, so I don't do big
programming, only algorithmics) which is (in my case) more maths/theory
than real computing. This is the difference between theorical &
practical computing. But I still can write enough stuff to generate
random files, which is more maths than real programming.
To run all the tests, I just write bash scripts running under Linux.
I synced a 1MB random file with 100 sets of differences (from 10 to
1000 differences) and 200 block sizes (from 10 to 2000 bytes) for
each set. To count how many bytes went through the network, I
used the --stats flag. Eventually CaML programs & C.A.S. compute
the whole tests for me. 

>
>On a more pleasant level;  You talk of between 1.5 and 2
>times more efficient.  That is rather vague.  How much
>effect does this difference in efficiency affect overall
>network bandwidth, disk IO, and memory and CPU utilisation.
>i.e. changing x had decreased bandwidth utilisation by .2%
>under ABC conditions.  Doubling the efficiency of something
>that effects only 2% (reducing it to 1.1%) of the load on a
>plentiful resource (CPU) might be worthwhile but hardly
>exciting.  Trimming network load by 12% would be very
>interesting to those rsyncing over slow internet links.
>

I have decided that efficiency is only the speedup rate,
this means I only care about network traffic, so between
1.5 & 2 times more efficient is just: "extremement excitant".
The main goal of rsync is to spare bandwidth utilisation, I
guess my definition for efficiency is quite the good one
(I know what a very slow link is).
I didn't care about CPU, disks IO (although I broke one
of my disks the other day) & memory this time.

As I wrote it in my last mail, you will get all the details online
one day (let's say end of August), and see it's quite serious.
You can ask me for the complete document right now, but it's not all
done and still in French... (PDF ~50pages ~700kB)

Amicalement,

Olivier
_______

Olivier Lachambre
2, rue Roger Courtois
25 200 MONTBELIARD
FRANCE

e-mail : lachambre at club-internet.fr