batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]

Tue May 18 15:11:51 GMT 2004

Wayne Davison wrote:
> 
>>> The "knowledge" or "memory" of that exact state is more likely to
>>> reside with the receiver (who just left that state) than with the
>>> sender (who may never have been in that state).  Therefore it is more
>>> likely to be useful to the receiver than to sender.
> 
> 
> This is only true if you imagine a receiver doing one pull and then
> forwarding the update on to multiple hosts.  For instance, if you
> use a pull to create the batch files and then make them available
> for people to download, which would help to alleviate load from the
> original server.  That said, I think most of the time a receiver is
> going to be a leaf node, so the server tends to be the place where
> a batch is more likely to be useful, IMO.
> 
> In thinking about batch mode, it seems like its restrictions make
> it useful in only a very small set of of circumstances.  Since the
> receiving systems must all have identical starting hierarchies, it
> really does limit how often it can be used.

I completely agree with Wayne's assesment here.  But just to make things 
clear, let's restate what batch mode provides:

1. a (partial) set of metadata about the state of the sender
2. a (partial) set of metadata about the state of the receiver
3. an rsync-style patch for files that differ in 1. and 2.

so while 1+2+3 may be too restrictive to be useful in mirroring 
datasets, having the capability to create and cache just 1 or 2 may be a 
big win for busy servers.

> I'm wondering if batch mode should be removed from the main rsync
> release and relegated to a parallel project?  It seems to me that a
> better feature for the mainstream utility would be something that
> optimized away some of the load on the sending system when it is
> serving lots of users.  So, having the ability to cache a directory
> tree's information, and the ability to cache checksums for files
> would be useful (especially if the data was auto-updated as it
> became stale).  That would make all transfers more optimal,
> regardless of what files the receiving system started from.

Firs of all, I have a feeling that the number of people who have 
*considered* using batch mode is quite small, and those who actually 
have used in the recent past is certainly an even smaller number (I'm 
thinking zero, actually).  So removing the functionality from the 
mainstream rsync would not be a problem, in fact I think it would be a 
good thing.  It doesn't make sense to keep something in the code that is 
not used and cannot be reliably supported.  Although I applaud Jos's 
efforts in providing this functionality to rsync, I was surprised to see 
it included in the main distribution, especially since it underwent 
virtually no testing as far as I can tell.

There's no doubt that caching the file list on the server side would 
indeed be a very useful feature for all those who use rsyncd as a 
distribution method.  We all know how difficult it can be to reliably 
rsync a large directory tree because of the memory and I/O costs in 
keeping a huge filelist in memory.  This may best be done by creating a 
separate helper application (say rsyncd-cache or such) that can be run 
on a regular basis to create a cached version of a directory tree 
corresponding to an rsyncd "module" on the server side.  The trick in 
getting this right will be to separate out the client-supplied options 
concering file selection, checksumming, etc, so that the cache is as 
general as possible and can be used for a large set of connections so as 
to minimize the number of times that the actual filesystem is scanned.

> Such a new feature would probably best be added to an rsync
> replacement project, though.

Hmmm... "replacement"?  why not make this a utility that can be run 
alongsize an rsync daemon?  Or are you thinking of a design for a "new" 
rsync?

-- Alberto

********************************************************************
Alberto Accomazzi                      aaccomazzi(at)cfa harvard edu
NASA Astrophysics Data System                        ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics      www.cfa.harvard.edu
60 Garden St, MS 31, Cambridge, MA 02138, USA
********************************************************************