batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]

Wed May 19 14:40:26 GMT 2004

Chris Shoemaker wrote:

>>There's no doubt that caching the file list on the server side would 
>>indeed be a very useful feature for all those who use rsyncd as a 
>>distribution method.  We all know how difficult it can be to reliably 
>>rsync a large directory tree because of the memory and I/O costs in 
>>keeping a huge filelist in memory.  This may best be done by creating a 
>>separate helper application (say rsyncd-cache or such) that can be run 
>>on a regular basis to create a cached version of a directory tree 
>>corresponding to an rsyncd "module" on the server side.  The trick in 
>>getting this right will be to separate out the client-supplied options 
>>concering file selection, checksumming, etc, so that the cache is as 
>>general as possible and can be used for a large set of connections so as 
>>to minimize the number of times that the actual filesystem is scanned.
> 
> 
> 	What client options are you thinking will be tricky?  Wouldn't the 
> helper app just cache _all_ the metadata for the module, and then rsync would 
> query only the subset it needed?  It's not like the client can change the 
> checksum stride.  [That would hurt.]

What I'm referring to are those options that a client passes to the 
server which influence file selection, checksum and block generation.  I 
haven't looked at the rsync source code in quite a while, but off the 
top of my head here are the issues to look at when considering caching a 
filesystem scan:

1. Exclude/include patterns:
  -C, --cvs-exclude           auto ignore files in the same way CVS does
      --exclude=PATTERN       exclude files matching PATTERN
      --exclude-from=FILE     exclude patterns listed in FILE
      --include=PATTERN       don't exclude files matching PATTERN
      --include-from=FILE     don't exclude patterns listed in FILE
      --files-from=FILE       read FILE for list of source-file names

These should be easy to deal with: I would simply have the cache creator 
ignore any --exclude options passed by the client (but probably honor 
the ones defined in a daemon config file).

2. Other file selection options:
  -x, --one-file-system       don't cross filesystem boundaries
  -S, --sparse                handle sparse files efficiently
  -l, --links                 copy symlinks as symlinks
  -L, --copy-links            copy the referent of all symlinks
      --copy-unsafe-links     copy the referent of "unsafe" symlinks
      --safe-links            ignore "unsafe" symlinks

It's possible that these can also be dealt with easily, but I'm not so 
sure.  Clearly -x influences what gets scanned, so how do you decide 
what to cache?  The other options are probably easier to deal with.

3. File checksums:
  -c, --checksum              always checksum

Should the caching operation always checksum so that the checksums are 
readily available when a client sets -c?  This can lead to a lot of 
computations and disk IO which may be unnecessary if the clients do not 
use this option.

4. Block checksums:
  -B, --block-size=SIZE       checksum blocking size (default 700)

It would be great if we could cache the rolling block checksums as they 
are computed but this may be even harder (or impossible) to deal with. 
And it looks like soon we'll have a new checksum-seed option which will 
further complicate the issue (in fact I admit I have no idea about how 
all of this works beyond versions 2.5.x; maybe somebody with more 
knowledge on the subject will chime in).

So I'm just pointing out that in order to create a cache with a high hit 
probability you have to make assumptions and choices that may be 
non-trivial.  Probably the best solution is reducing the scope of the 
cache so that it contains only the initial file list generation under 
default settings, or maybe you want to have a set of different caches 
created using different options.  I, for one, have consistently been 
using the --checksum option when distributing some sensitive data to our 
mirror sites, so I would want that to be included in a cache.

-- Alberto

********************************************************************
Alberto Accomazzi                      aaccomazzi(at)cfa harvard edu
NASA Astrophysics Data System                        ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics      www.cfa.harvard.edu
60 Garden St, MS 31, Cambridge, MA 02138, USA
********************************************************************