filelist caching optimization proposal
wayned at samba.org
Mon May 23 16:21:53 GMT 2005
On Mon, May 23, 2005 at 03:24:07PM +0200, Edwin Eefting wrote:
> My idea is to create a patch for something like a --cache option that
> will use a cached version of the filelist:
Something like that would be fairly easy to write, but only if there are
no conflicts between the cache and the live disk. One would simply need
an on-disk representation for the file-list's in-memory data structure,
and a way to save/restore it. If you limited the code to a single
source hierarchy, it might even be possible to use the current send &
receive code for the file-list (with just a little touch-up of the
dir.root value that is sender-side only, and thus not properly set by
the receive code). Every time the server updates, it would want to use
an atomic-update algorithm, like the one implemented in the atomic-rsync
perl script in the "support" dir (which uses a parallel hierarchy and
the --link-dest option to update all the files at the same time).
An alternative to this --cache idea is to use the existing batch-file
mechanism to provide a daily (or twice daily, etc.) update method for
users. It would work like this:
- A master cache server would maintain its files using a batch-writing
rsync transfer updating it atomically (as mentioned above) so that (1)
the batch-creating process can be restarted from scratch if the rsync
run doesn't finish successfully, and so that (2) users have a source
hierarchy that exactly matches the last batch-file's end-state.
- The resulting batch file would be put into a place where it could be
downloaded via some file-transfer protocol, such as on a webserver.
- As long as the user didn't modify the portage hierarchy between
batched runs, it would be possible to just apply each batched
transfer, one after the other, to update to the latest hierarchy. If
something goes wrong with the receive, it is safe to just run the
batch-reading command again (since rsync skips the updates that were
already applied; N.B. --partial must NOT be enabled.) A a fall-back,
a normal rsync command to fetch files from the server would update any
defects and get you back in sync with the batched updates.
- I'd imagine using something like an HTTP-capable perl script to grab
the data and output it on stdout -- this would let the batch be
processed as it arrived instead of being written out to disk first.
Such an update mechanism would work quite well for a consistent N
batched updates per day (where N is not overly large). A set of source
servers could even use this method to mirror the N-update hierarchy
throughout the day. As long as the batch files are named uniquely, the
end-user doesn't need to run the command on a regular schedule: the
script could be smart enough to notice when the local portage hierarchy
was last updated and choose to either perform one or more batch-reading
runs, or to fall-back to doing a normal rsync update.
More information about the rsync