batch-mode fixes [was: [PATCH] fix read-batch SEGFAULT]
Alberto Accomazzi
aaccomazzi at cfa.harvard.edu
Wed May 19 14:40:26 GMT 2004
Chris Shoemaker wrote:
>>There's no doubt that caching the file list on the server side would
>>indeed be a very useful feature for all those who use rsyncd as a
>>distribution method. We all know how difficult it can be to reliably
>>rsync a large directory tree because of the memory and I/O costs in
>>keeping a huge filelist in memory. This may best be done by creating a
>>separate helper application (say rsyncd-cache or such) that can be run
>>on a regular basis to create a cached version of a directory tree
>>corresponding to an rsyncd "module" on the server side. The trick in
>>getting this right will be to separate out the client-supplied options
>>concering file selection, checksumming, etc, so that the cache is as
>>general as possible and can be used for a large set of connections so as
>>to minimize the number of times that the actual filesystem is scanned.
>
>
> What client options are you thinking will be tricky? Wouldn't the
> helper app just cache _all_ the metadata for the module, and then rsync would
> query only the subset it needed? It's not like the client can change the
> checksum stride. [That would hurt.]
What I'm referring to are those options that a client passes to the
server which influence file selection, checksum and block generation. I
haven't looked at the rsync source code in quite a while, but off the
top of my head here are the issues to look at when considering caching a
filesystem scan:
1. Exclude/include patterns:
-C, --cvs-exclude auto ignore files in the same way CVS does
--exclude=PATTERN exclude files matching PATTERN
--exclude-from=FILE exclude patterns listed in FILE
--include=PATTERN don't exclude files matching PATTERN
--include-from=FILE don't exclude patterns listed in FILE
--files-from=FILE read FILE for list of source-file names
These should be easy to deal with: I would simply have the cache creator
ignore any --exclude options passed by the client (but probably honor
the ones defined in a daemon config file).
2. Other file selection options:
-x, --one-file-system don't cross filesystem boundaries
-S, --sparse handle sparse files efficiently
-l, --links copy symlinks as symlinks
-L, --copy-links copy the referent of all symlinks
--copy-unsafe-links copy the referent of "unsafe" symlinks
--safe-links ignore "unsafe" symlinks
It's possible that these can also be dealt with easily, but I'm not so
sure. Clearly -x influences what gets scanned, so how do you decide
what to cache? The other options are probably easier to deal with.
3. File checksums:
-c, --checksum always checksum
Should the caching operation always checksum so that the checksums are
readily available when a client sets -c? This can lead to a lot of
computations and disk IO which may be unnecessary if the clients do not
use this option.
4. Block checksums:
-B, --block-size=SIZE checksum blocking size (default 700)
It would be great if we could cache the rolling block checksums as they
are computed but this may be even harder (or impossible) to deal with.
And it looks like soon we'll have a new checksum-seed option which will
further complicate the issue (in fact I admit I have no idea about how
all of this works beyond versions 2.5.x; maybe somebody with more
knowledge on the subject will chime in).
So I'm just pointing out that in order to create a cache with a high hit
probability you have to make assumptions and choices that may be
non-trivial. Probably the best solution is reducing the scope of the
cache so that it contains only the initial file list generation under
default settings, or maybe you want to have a set of different caches
created using different options. I, for one, have consistently been
using the --checksum option when distributing some sensitive data to our
mirror sites, so I would want that to be included in a cache.
-- Alberto
********************************************************************
Alberto Accomazzi aaccomazzi(at)cfa harvard edu
NASA Astrophysics Data System ads.harvard.edu
Harvard-Smithsonian Center for Astrophysics www.cfa.harvard.edu
60 Garden St, MS 31, Cambridge, MA 02138, USA
********************************************************************
More information about the rsync
mailing list