batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]

Mon May 17 21:18:10 GMT 2004

On Mon, May 17, 2004 at 10:15:23AM -0400, Alberto Accomazzi wrote:
> 
> Chris,
> 
> to put things in the right prespective, you should read (if you haven't 
> done so already) the original paper describing the design behind batch 
> mode.  The design and implementation of this functionality goes back to 
> a project called the Internet2 Distributed Storage Infrastructure 
> (I2-DSI).  As part of that project, the authors created a modified 
> version of rsync (called rsync+) which had the capability of creating 
> these batch sets for mirroring.  Here are a couple of URLs describing 
> the ideas and motivation behind it:
> http://www.ils.unc.edu/i2dsi/unc_rsync+.html
> http://www.ils.unc.edu/ils/research/reports/TR-1999-01.pdf

	Ah, thank you.  I had seen the first, but not the second.  It was an
interesting read, and it explains a lot.  I see now why the write-batch hooks
are in the _sender_ paths.  This seems a reasonable design decision when the
intention is to replicate changes to many remote copies.
	I can see some justification for wanting write-batch functionality
with both sender and receiver.  However, several things in the report seem to
confirm by growing opinion that, if it has to be in only one, receiver is
sufficient, while sender is not.

> >use.  I figure, there are three cases:
> >
> >   A) If you have access to both source and dest, it doesn't really matter 
> >   too
> >much who writes the batch -- this is like the local copy case.
> >   B) If you have access to the dest but not the source, then you need the
> >client to write the batch -- and it's not far-fetched that you might have
> >other copies of dest to update.
> >   C) However, having access to source but not dest is the only case that
> >_requires_ the sender to write the batch -- now what's the chance that 
> >you'll
> >have another identical dest to apply the batch to?  And if you did, why
> >wouldn't you generate the batch on that dest as in case A, above?
> >
> >   So, it seems to me that it's much more useful to have the 
> >   receiver/client write the batch than sender/client, or receiver/server, or 
> >sender/server.  But, maybe I'm just not appreciating what the potential 
> >uses of batch-mode are.
> >
> >  Survey: so who uses batch-mode and what for?
> 
> I haven't used the feature but back when I read the docs on rsync+ I 
> thought it was a clever way to do multicasting on the cheap.  I think 
> the only scenario where batch mode makes sense is when you need to 
> distribute updates from a particular archive to a (large) number of 
> mirror sites and you have tight control on the state of both client and 
> server (so that you know exactly what needs to be updated on the mirror 
> sites).  This ensures that you can create a set of batch files that 
> contain *all* the changes necessary for updating each mirror site.
> 
> So basically I would use batch mode if I had a situation in which:
> 
> 1) all mirror sites have the same set of files
> 2) rsync is invoked from each mirror site in exactly the same way (i.e. 
> same command-line options) to pull data from a master server
> 
> then instead of having N sites invoke rsync against the same archive, I 
> would invoke it once, make it write out a set of batch files, then 
> transfer the batch files to each client and run rsync locally using the 
> batch set.  The advantage of this is that the server only performs its 
> computations once.  An example of this usage would be using rsync to 
> upgrade a linux distribution, say going from FC 1 to FC 2.  All files 
> from each distribution are frozen, so you should be able to create a 
> single batch which incorporates all the changes and then apply that on 
> each site carrying the distro.

	Indeed, what you describe seems to have been the design motivation.  I
can share what my desired application is: I want to create a mirror of a
public server onto my local machine which physically disconnected from the
Internet, and keep it current.  So, I intend to first rsync update my own copy
which _is_ networked while creating the batch set.  Then I can sneakernet the
batch set to the unnetworked machine and use rsync --read-batch to update it. 
This keeps the batch sets smallish even though the mirror is largish. 

> 
> The question of whether the batch files should be on the client or 
> server side is not easy to answer and in the end depends on exactly what 
> you're trying to do.  In general, I would say that since the contents of 
> the batch mode depend on the status of both client and server, there is 
> not a "natural" location for it.

	While I agree there is some symmetry in the _origin_ of the batch set
that would suggest that there is no natural location for it, I think the
_intended use_ of the batch set strongly suggests that it will usually belong
with the _receiver_ (irrespective of client/server).  Specifically, the batch
set is only useful for other receivers that are identical to the original
receiver.  The "knowledge" or "memory" of that exact state is more likely to
reside with the receiver (who just left that state) than with the sender (who
may never have been in that state).  Therefore it is more likely to be useful
to the receiver than to sender.
	Consider that even in the report's example of pushing the batch sets
out to multiple mirrors, the authors recommend creating the batch set while
updating a "near" or local copy to reduce network load.  So, even when the
initiator has full control over the replication _source_, the act of creating
a batch set presumes such a degree of knowledge of, interest in, and control
over, the _destination_, that creating batch sets at the destination
(receiver) is not inappropriate. 
	Of course, the clincher is a case such as mine, where I have _no_
control or access to the sender/server.  I am only a client/receiver of a
public anonymous rsyncd server, and the batch set I create is obviously only
useful to me, so I'd like my receiver to create it.  It's probably a good
thing that --write-batch crashes the server's sender child.  The mirror
maintainers would probably be annoyed if I was filling their server hard
drives with my batch sets.  :-)

	All that said, I have no intention to remove write-batch hooks from
sender paths.  I figure, let the tool do what the tool does, and someone
smarter than I will figure out what to use it for.  However, IMHO, other than
for pure local updates, batch-mode is pretty close to useless unless the
receiver can write the batch set, and if I can pull it off, I will be
provide a patch that does that.

	BTW, there is a work-around.  If you don't mind duplicating the mirror
twice, one solution is to do a regular (no --write-batch) rsync update of one
copy of the mirror, and then do the --write-batch during a local to local
rsync update of another copy of the mirror.  Actually, this has some real
advantages if your network connection is unreliable. 

	Thanks for your input.

	-Chris
> 
> 
> -- Alberto
> 
> ********************************************************************
> Alberto Accomazzi                      aaccomazzi(at)cfa harvard edu
> NASA Astrophysics Data System                        ads.harvard.edu
> Harvard-Smithsonian Center for Astrophysics      www.cfa.harvard.edu
> 60 Garden St, MS 31, Cambridge, MA 02138, USA
> ********************************************************************
>