filelist caching optimization proposal

Mon May 23 16:21:53 GMT 2005

On Mon, May 23, 2005 at 03:24:07PM +0200, Edwin Eefting wrote:
> My idea is to create a patch for something like a --cache option that
> will use a cached version of the filelist:

Something like that would be fairly easy to write, but only if there are
no conflicts between the cache and the live disk.  One would simply need
an on-disk representation for the file-list's in-memory data structure,
and a way to save/restore it.  If you limited the code to a single
source hierarchy, it might even be possible to use the current send &
receive code for the file-list (with just a little touch-up of the
dir.root value that is sender-side only, and thus not properly set by
the receive code).  Every time the server updates, it would want to use
an atomic-update algorithm, like the one implemented in the atomic-rsync
perl script in the "support" dir (which uses a parallel hierarchy and
the --link-dest option to update all the files at the same time).

An alternative to this --cache idea is to use the existing batch-file
mechanism to provide a daily (or twice daily, etc.) update method for
users.  It would work like this:

- A master cache server would maintain its files using a batch-writing
  rsync transfer updating it atomically (as mentioned above) so that (1)
  the batch-creating process can be restarted from scratch if the rsync
  run doesn't finish successfully, and so that (2) users have a source
  hierarchy that exactly matches the last batch-file's end-state.

- The resulting batch file would be put into a place where it could be
  downloaded via some file-transfer protocol, such as on a webserver.

- As long as the user didn't modify the portage hierarchy between
  batched runs, it would be possible to just apply each batched
  transfer, one after the other, to update to the latest hierarchy.  If
  something goes wrong with the receive, it is safe to just run the
  batch-reading command again (since rsync skips the updates that were
  already applied; N.B. --partial must NOT be enabled.)  A a fall-back,
  a normal rsync command to fetch files from the server would update any
  defects and get you back in sync with the batched updates.

- I'd imagine using something like an HTTP-capable perl script to grab
  the data and output it on stdout -- this would let the batch be
  processed as it arrived instead of being written out to disk first.

Such an update mechanism would work quite well for a consistent N
batched updates per day (where N is not overly large).  A set of source
servers could even use this method to mirror the N-update hierarchy
throughout the day.  As long as the batch files are named uniquely, the
end-user doesn't need to run the command on a regular schedule:  the
script could be smart enough to notice when the local portage hierarchy
was last updated and choose to either perform one or more batch-reading
runs, or to fall-back to doing a normal rsync update.

..wayne..