Proposed tweaks

Carlos Carvalho carlos at fisica.ufpr.br
Thu Sep 17 20:20:02 MDT 2009


Lee Winter (lee.j.i.winter at gmail.com) wrote on 16 September 2009 01:16:
 >The use case that needs some optimization is that of online
 >repositories -- mirrors.  In contrast to other kinds of usage such as
 >file synchronization, replication, backup, etc., mirrors present a
 >quite different set of needs.

Yes. And the solution, if you want to go to the many pains of
optimizing this case, is to use a script to drive rsync.

 >The issue with the current implementation of rsync is that it imposes
 >a heavy load on the source mirror (sender in rsync terminology).  The
 >load is composed of two components, one being IO necessary to scan the
 >file system and the other being the computational cost of the delta
 >calculations.

The computational cost is negligible, as you admit yourself (quoted
below). The scan is caused by mirror admins lack of knowledge; it can
be avoided completely if the client script and the upstream admin do
their job well.

This is an old problem that was already very well solved. It dates
from the big ftp sites on the internet some 15-20 years ago. Look at
the mirror script by Lee McLoughlin (and an important patch by Ian
Maclaine-Cross). In Debian it's the mirror package.

However rsync is so much better than ftp that everybody started to use
it and forgot to keep the previous technology that should still be
used, just in combination with rsync instead of ftp. That's what I do
here (and only discovered the previous work afterwards...).

 >1.  CPU performance is increasing faster than disk performance, so
 >eliminating the IO burden is the bigger win.
 >
 >2.  Repositories tend to have files that are already fairly dense.  So
 >they probably don't benefit all that much from the delta handling.  So
 >if the "basis file" can't be swapped to the sender then the
 >computational load can still be eliminated by using --whole-file mode
 >despite the small loss in transport efficiency.  I admit that I have
 >not tested this premise.

Easily noticeable. If you had you wouldn't have said it :-) :-)

Lee Winter (lee.j.i.winter at gmail.com) wrote on 16 September 2009 10:13:
 >On Wed, Sep 16, 2009 at 1:44 AM, Matt McCutchen <matt at mattmccutchen.net> wrote:
 >> Both of your proposals have been discussed before (see below),
 >
 >Good.
 >
 >> but neither has been taken very far because they would both
 >> involve large changes to rsync.
 >
 >The delta computations are addressed below.  Externalizing the file
 >list should have quite minimal impact on the existing code.

The big problem with this approach is not the file list, which can
(and should) be easily generated separately by mirrors, as I said
above. The difficulty is that it makes the update process
transactional; to have reliability you have to deal with all sorts of
failures.

The vast majority of mirrors use a ~10-line script, which is easy to
write and maintain. Ours is so much more efficient and (I believe) as
reliable as plain rsync, but is an 800-line monster full of
subtleties, where a seemingly innocent change may corrupt your mirror
and may only be noticed months later. How many admins would use it?

Forget it...

That's why rsync doesn't do it, even though there have been demands
for a long time.

 >> This is a good idea that has been discussed before and is implemented in
 >> another tool called "zsync".   Wayne noted some of its drawbacks here:
 >>
 >> http://markmail.org/message/pt354zo4njgmupj
 >>
 >> Perhaps you could respond to his concerns.

 >Second, I looked at the comments, and while technically valid I don't
 >think they are reasonable.

They're really excellent and reflect very well what happens in practice.

 >----- quote start
 >(2) This increases the disk I/O on the sending side because it would
 >need to read the file twice (assuming that cached checksums aren't
 >available). Since rsync is running in a pipelined mode, the sender
 >will be iterating over several future files by the time it gets back a
 >request for data chucks from the receiving side. Hopefully the data
 >will still be in the disk cache, but if it is not, the transfer would
 >bog down to a significant degree.
 >----- quote stop
 >
 >Again we have a non-quantitative statement.  The absolute worst case
 >effect is to double the disk IO.  I have a hard time imagining an
 >rsync system that is within a factor of two of being disk-bound.

But that's exactly what happens.

<quoting out of order>

 >I was actually proposing a more ambitious change which would make
 >the directionality the subject of an user option. That seriously
 >increases the scope of the change, but it also gains you the benefit
 >of the feature (flexible direction), so it might be justified on
 >that basis.

The matter is so important that I'd block it at the server side, even
if I had to patch the program.


More information about the rsync mailing list