batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]

Tue May 18 19:55:01 GMT 2004

On Mon, May 17, 2004 at 08:10:57PM -0700, Wayne Davison wrote:
> On Mon, May 17, 2004 at 05:18:10PM -0400, Chris Shoemaker wrote:
> > The "knowledge" or "memory" of that exact state is more likely to
> > reside with the receiver (who just left that state) than with the
> > sender (who may never have been in that state).  Therefore it is more
> > likely to be useful to the receiver than to sender.
> 
> This is only true if you imagine a receiver doing one pull and then
> forwarding the update on to multiple hosts.  For instance, if you
> use a pull to create the batch files and then make them available
> for people to download, which would help to alleviate load from the
> original server.  That said, I think most of the time a receiver is
> going to be a leaf node, so the server tends to be the place where
> a batch is more likely to be useful, IMO.

I can see the "push" pattern for creating batch sets, and I definitely agree
that receiver is likely to be a leaf node, but I'm submitting that on the "big
tree" the expectation of 1) finding another _identical_ leaf and 2) knowing
about that identity, is MUCH better the closer you are to that first leaf node
than _anywhere_ else, server/sender included.

I know there are counter-examples to my proposition -- I just don't think 
they're likely.  If they were, then there would be more people using and 
considering using batch-mode for the sender-side batch-write than people doing 
what I'm doing -- making two local mirrors just so I can be the sender for a 
write-batch.

I suppose there two theoretical explanations for what's going on.   Afterall, 
the two receivers are not identical by chance; they were made so, but how?

	Case A) The destinations were created by pushing batch-sets from
a server and only ever modified by pushing batch-sets from a server.  The 
receivers are not necessarily "close" to each other with respect to any 
communication path.  The receivers are only "related" through the server.  In 
this scenario, batch-sets should be created by sender.

	Case B) The destinations are identical because there are "close" 
with respect to some communications path and they were made identical.  E.g. 
one is a copy of the other, they are both copied from the same physical source 
media, they have agreed to syncronize to each other.  In this scenario, 
batch-sets belong with receiver.

	I admit Case A probably really happens sometimes.  (I mean the
information transfer pattern; it sounds like rsync batch-mode maybe isn't
actually used for this purpose.)  But, I think that Case B must be much more
common. 

	Of course, I don't have any real usage data to back this theory up, so
I could be full of it.  But afterall, isn't it intuitive that the _average_
"communications distance" between two _identical_ copies would much smaller
than the _average_ "communications distance" between two _similar_ copies that
want to syncronize?  On average.

> 
> In thinking about batch mode, it seems like its restrictions make
> it useful in only a very small set of of circumstances.  Since the
> receiving systems must all have identical starting hierarchies, it
> really does limit how often it can be used.

	Well, yes.

> 
> I'm wondering if batch mode should be removed from the main rsync
> release and relegated to a parallel project?  It seems to me that a

	I'd be sad to see batch-mode bitrot, but from a purely technical 
viewpoint, perhaps the (simple?) task of capturing an output stream of the 
protocol shouldn't be so strongly coupled to the rsync project.

> better feature for the mainstream utility would be something that
> optimized away some of the load on the sending system when it is
> serving lots of users.  So, having the ability to cache a directory
> tree's information, and the ability to cache checksums for files
> would be useful (especially if the data was auto-updated as it
> became stale).  That would make all transfers more optimal,
> regardless of what files the receiving system started from.

	That's a very good idea.  Optimizing the common case makes sense.  
Cache invalidation could be hard, I think.  Something the FAM might be 
expensive.  Recaching on a signal is easy though.

> 
> Such a new feature would probably best be added to an rsync
> replacement project, though.

	I don't know, it could be a simple performance enhancement with no new 
visible features.

	-Chris
> 
> ..wayne..