batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]
Alberto Accomazzi
aaccomazzi at cfa.harvard.edu
Tue May 18 15:11:51 GMT 2004
Wayne Davison wrote:
>
>>> The "knowledge" or "memory" of that exact state is more likely to
>>> reside with the receiver (who just left that state) than with the
>>> sender (who may never have been in that state). Therefore it is more
>>> likely to be useful to the receiver than to sender.
>
>
> This is only true if you imagine a receiver doing one pull and then
> forwarding the update on to multiple hosts. For instance, if you
> use a pull to create the batch files and then make them available
> for people to download, which would help to alleviate load from the
> original server. That said, I think most of the time a receiver is
> going to be a leaf node, so the server tends to be the place where
> a batch is more likely to be useful, IMO.
>
> In thinking about batch mode, it seems like its restrictions make
> it useful in only a very small set of of circumstances. Since the
> receiving systems must all have identical starting hierarchies, it
> really does limit how often it can be used.
I completely agree with Wayne's assesment here. But just to make things
clear, let's restate what batch mode provides:
1. a (partial) set of metadata about the state of the sender
2. a (partial) set of metadata about the state of the receiver
3. an rsync-style patch for files that differ in 1. and 2.
so while 1+2+3 may be too restrictive to be useful in mirroring
datasets, having the capability to create and cache just 1 or 2 may be a
big win for busy servers.
> I'm wondering if batch mode should be removed from the main rsync
> release and relegated to a parallel project? It seems to me that a
> better feature for the mainstream utility would be something that
> optimized away some of the load on the sending system when it is
> serving lots of users. So, having the ability to cache a directory
> tree's information, and the ability to cache checksums for files
> would be useful (especially if the data was auto-updated as it
> became stale). That would make all transfers more optimal,
> regardless of what files the receiving system started from.
Firs of all, I have a feeling that the number of people who have
*considered* using batch mode is quite small, and those who actually
have used in the recent past is certainly an even smaller number (I'm
thinking zero, actually). So removing the functionality from the
mainstream rsync would not be a problem, in fact I think it would be a
good thing. It doesn't make sense to keep something in the code that is
not used and cannot be reliably supported. Although I applaud Jos's
efforts in providing this functionality to rsync, I was surprised to see
it included in the main distribution, especially since it underwent
virtually no testing as far as I can tell.
There's no doubt that caching the file list on the server side would
indeed be a very useful feature for all those who use rsyncd as a
distribution method. We all know how difficult it can be to reliably
rsync a large directory tree because of the memory and I/O costs in
keeping a huge filelist in memory. This may best be done by creating a
separate helper application (say rsyncd-cache or such) that can be run
on a regular basis to create a cached version of a directory tree
corresponding to an rsyncd "module" on the server side. The trick in
getting this right will be to separate out the client-supplied options
concering file selection, checksumming, etc, so that the cache is as
general as possible and can be used for a large set of connections so as
to minimize the number of times that the actual filesystem is scanned.
> Such a new feature would probably best be added to an rsync
> replacement project, though.
Hmmm... "replacement"? why not make this a utility that can be run
alongsize an rsync daemon? Or are you thinking of a design for a "new"
rsync?
-- Alberto
********************************************************************
Alberto Accomazzi aaccomazzi(at)cfa harvard edu
NASA Astrophysics Data System ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics www.cfa.harvard.edu
60 Garden St, MS 31, Cambridge, MA 02138, USA
********************************************************************
More information about the rsync
mailing list