Rsync: Re: patch to enable faster mirroring of large filesystems

Wed Nov 28 06:34:22 EST 2001

    Date: Tue, 27 Nov 2001 10:49:11 -0600
    From: Dave Dykstra <dwd at bell-labs.com>

    Thank you very much for doing the test Alberto.  I didn't have any set of
    files that large on which I could do a test, and as I said when I tested
    the worse case I could think of with my application I couldn't measure an
    appreciable difference.

    First, I want to make sure that you really did get the optimization turned
    on.  [ . . . 3 paragraphs of clues on verifying optimization omitted . . . ]

I know you're trying to get reliable statistics so it's clear what
sort of performance we're talking about here.  But may I respectfully
suggest that -having- to be so careful about whether optimization
actually got turned on is a clue that there is still a big problem
here?

Seriously, even if --files-from= was -not- as efficient as the
optimized case, if it's so difficult to ensure that you -are- in the
optimized case, what's the point?  If 90% of the users get it wrong---
and 90% of -those- can't even figure out how to -tell-, even if
they're trying to be careful---then clearly the optimization isn't
as useful as it might be.  (And btw, if it's that hard to figure out,
there should be a debugging switch that -tells- the user whether it
got turned on.  Yet another out-of-control command-line option, or
perhaps an addition to one of the verbose modes, but not one that
forces the user to drown in lots of other output, or cause unpatched
rsyncs to hang, or...  People shouldn't have to patch their local
rsync just be sure this is happening.)

Meanwhile, people are tying themselves in knots trying to figure out
how specify which files to transfer.  As I pointed out months ago when
this subject first came up, it seemed that about half the traffic on
the list was from people who were confused about how to specify the
list of files that rsync was supposed to handle.  Letting them use
other tools (e.g., find, or some perl script they just wrote) that
were more transparent and with which they were more familiar seemed
like it would dramatically decrease their learning curve.

I would propose that, -whether or not- the use of --files-from= was a
performance-killer, rsync should have it.  It -would- allow people to
quickly debug a working setup.  -If- for some reason its performance
was bad compared to include/exclude, -then- they could go from a
known-working configuration that might not run at full speed to
a more-difficult-to-debug one that did.  This is the right direction.
(If life was really that bad, it might not be hard for the statistics
from a run to indicate how much time was spent traversing the file
system vs moving files over the connection, which would be a clue
that it was time to move to the "optimized" case.  But it'd be nice
to just avoid having to think about this hair in the first place.)

And, of course, if the data we've seen -was- generated with
optimization, then obviously there's no downside to --files-from=.
It seems pretty clear that the data presented paints a bad picture.
It's hard to believe that --files-from= could be worse.

P.S.  Would --files-from= reduce rsync's large memory consumption
as well, or does it still imply rsync caching some info about every
file it sees during its entire run, and never flushing this info until
the end?  Not remembering something about each file for the entire run
would alone be a powerful reason to include it---there are some tasks
for which finishing -at all- is more important than waiting a while.
It sucks to tell the user, "You can't use the slower approach at all
because we thinmk you should always be fast.  Go buy more memory
instead---if the machine is under your control, can take more memory
in the first place, etc."  I don't recall whether -both- ends of the
connection are so memory-intensive; if so, this is even more important.