batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]

Mon May 17 14:15:23 GMT 2004

Chris,

to put things in the right prespective, you should read (if you haven't 
done so already) the original paper describing the design behind batch 
mode.  The design and implementation of this functionality goes back to 
a project called the Internet2 Distributed Storage Infrastructure 
(I2-DSI).  As part of that project, the authors created a modified 
version of rsync (called rsync+) which had the capability of creating 
these batch sets for mirroring.  Here are a couple of URLs describing 
the ideas and motivation behind it:
http://www.ils.unc.edu/i2dsi/unc_rsync+.html
http://www.ils.unc.edu/ils/research/reports/TR-1999-01.pdf

Chris Shoemaker wrote:

> 	Yes, I think you're right about the original design.  And I guess we'd
> want to preserve that capability.  Or would we?
> 	I'm having a little trouble seeing why this was the intended 
> use.  I figure, there are three cases:
> 
>    A) If you have access to both source and dest, it doesn't really matter too
> much who writes the batch -- this is like the local copy case.
>    B) If you have access to the dest but not the source, then you need the
> client to write the batch -- and it's not far-fetched that you might have
> other copies of dest to update.
>    C) However, having access to source but not dest is the only case that
> _requires_ the sender to write the batch -- now what's the chance that you'll
> have another identical dest to apply the batch to?  And if you did, why
> wouldn't you generate the batch on that dest as in case A, above?
> 
>    So, it seems to me that it's much more useful to have the receiver/client 
> write the batch than sender/client, or receiver/server, or sender/server.  
> But, maybe I'm just not appreciating what the potential uses of batch-mode 
> are.
 >
 >  Survey: so who uses batch-mode and what for?

I haven't used the feature but back when I read the docs on rsync+ I 
thought it was a clever way to do multicasting on the cheap.  I think 
the only scenario where batch mode makes sense is when you need to 
distribute updates from a particular archive to a (large) number of 
mirror sites and you have tight control on the state of both client and 
server (so that you know exactly what needs to be updated on the mirror 
sites).  This ensures that you can create a set of batch files that 
contain *all* the changes necessary for updating each mirror site.

So basically I would use batch mode if I had a situation in which:

1) all mirror sites have the same set of files
2) rsync is invoked from each mirror site in exactly the same way (i.e. 
same command-line options) to pull data from a master server

then instead of having N sites invoke rsync against the same archive, I 
would invoke it once, make it write out a set of batch files, then 
transfer the batch files to each client and run rsync locally using the 
batch set.  The advantage of this is that the server only performs its 
computations once.  An example of this usage would be using rsync to 
upgrade a linux distribution, say going from FC 1 to FC 2.  All files 
from each distribution are frozen, so you should be able to create a 
single batch which incorporates all the changes and then apply that on 
each site carrying the distro.

The question of whether the batch files should be on the client or 
server side is not easy to answer and in the end depends on exactly what 
you're trying to do.  In general, I would say that since the contents of 
the batch mode depend on the status of both client and server, there is 
not a "natural" location for it.

-- Alberto

********************************************************************
Alberto Accomazzi                      aaccomazzi(at)cfa harvard edu
NASA Astrophysics Data System                        ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics      www.cfa.harvard.edu
60 Garden St, MS 31, Cambridge, MA 02138, USA
********************************************************************