batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]
Chris Shoemaker
chris.shoemaker at cox.net
Tue May 18 19:55:01 GMT 2004
On Mon, May 17, 2004 at 08:10:57PM -0700, Wayne Davison wrote:
> On Mon, May 17, 2004 at 05:18:10PM -0400, Chris Shoemaker wrote:
> > The "knowledge" or "memory" of that exact state is more likely to
> > reside with the receiver (who just left that state) than with the
> > sender (who may never have been in that state). Therefore it is more
> > likely to be useful to the receiver than to sender.
>
> This is only true if you imagine a receiver doing one pull and then
> forwarding the update on to multiple hosts. For instance, if you
> use a pull to create the batch files and then make them available
> for people to download, which would help to alleviate load from the
> original server. That said, I think most of the time a receiver is
> going to be a leaf node, so the server tends to be the place where
> a batch is more likely to be useful, IMO.
I can see the "push" pattern for creating batch sets, and I definitely agree
that receiver is likely to be a leaf node, but I'm submitting that on the "big
tree" the expectation of 1) finding another _identical_ leaf and 2) knowing
about that identity, is MUCH better the closer you are to that first leaf node
than _anywhere_ else, server/sender included.
I know there are counter-examples to my proposition -- I just don't think
they're likely. If they were, then there would be more people using and
considering using batch-mode for the sender-side batch-write than people doing
what I'm doing -- making two local mirrors just so I can be the sender for a
write-batch.
I suppose there two theoretical explanations for what's going on. Afterall,
the two receivers are not identical by chance; they were made so, but how?
Case A) The destinations were created by pushing batch-sets from
a server and only ever modified by pushing batch-sets from a server. The
receivers are not necessarily "close" to each other with respect to any
communication path. The receivers are only "related" through the server. In
this scenario, batch-sets should be created by sender.
Case B) The destinations are identical because there are "close"
with respect to some communications path and they were made identical. E.g.
one is a copy of the other, they are both copied from the same physical source
media, they have agreed to syncronize to each other. In this scenario,
batch-sets belong with receiver.
I admit Case A probably really happens sometimes. (I mean the
information transfer pattern; it sounds like rsync batch-mode maybe isn't
actually used for this purpose.) But, I think that Case B must be much more
common.
Of course, I don't have any real usage data to back this theory up, so
I could be full of it. But afterall, isn't it intuitive that the _average_
"communications distance" between two _identical_ copies would much smaller
than the _average_ "communications distance" between two _similar_ copies that
want to syncronize? On average.
>
> In thinking about batch mode, it seems like its restrictions make
> it useful in only a very small set of of circumstances. Since the
> receiving systems must all have identical starting hierarchies, it
> really does limit how often it can be used.
Well, yes.
>
> I'm wondering if batch mode should be removed from the main rsync
> release and relegated to a parallel project? It seems to me that a
I'd be sad to see batch-mode bitrot, but from a purely technical
viewpoint, perhaps the (simple?) task of capturing an output stream of the
protocol shouldn't be so strongly coupled to the rsync project.
> better feature for the mainstream utility would be something that
> optimized away some of the load on the sending system when it is
> serving lots of users. So, having the ability to cache a directory
> tree's information, and the ability to cache checksums for files
> would be useful (especially if the data was auto-updated as it
> became stale). That would make all transfers more optimal,
> regardless of what files the receiving system started from.
That's a very good idea. Optimizing the common case makes sense.
Cache invalidation could be hard, I think. Something the FAM might be
expensive. Recaching on a signal is easy though.
>
> Such a new feature would probably best be added to an rsync
> replacement project, though.
I don't know, it could be a simple performance enhancement with no new
visible features.
-Chris
>
> ..wayne..
More information about the rsync
mailing list