Pipelined rsync proposal (was Re: superlifter design notes)

Sun Jul 28 00:46:04 EST 2002

On Sat, Jul 27, 2002 at 02:04:02PM -0700, Wayne Davison wrote:
> On Sun, 21 Jul 2002, jw schultz wrote:
> > What i am seeing is a Multi-stage pipeline.
> 
> This is quite an interesting design idea.  Let me comment on a few
> things that I've been mulling over since first reading it:
> 
> One thing you don't discuss in your data flow is auxiliary data flow.
> For instance, error messages need to go somewhere (perhaps mixed into
> the main data flow), and they need to get back to the side where the
> user resides.  This can add an extra network transfer after the update
> stage (6) to send errors back to the user (if the user is not on the
> same side as stage 6).

Yes, when the final stage is on a remote node the status
messages from that stage would need to be transmitted.

I haven't made up my mind on whether the status (and error)
messages should be in the primary pipeline (at least until
they reach the initiator) or on a separate stream.

> Another open issue is what we do when a file changes while we're
> transferring it.  Rsync sends a "redo" request to the generator process
> and it reruns all changed files at the end of the run.  If such a thing
> is desirable in this utility (instead of just warning the user that the
> file was unable to be updated), then this "redo" data flow also needs to
> be mapped out.  If this protocol remains more batch oriented, then it
> probably won't need to redo files -- just warn the user.

The redo is a neat idea but does require detecting that a
file has changed.  Some checks should probably be made but
we really can't guarantee that we would even detect a file
changing between scan and transmit.  I think that if a
change is detected we should emit a warning.  A more
elaborate framework could rerun changed files as rsync does
now.

There should be a much lower incidence of files changing
under us because there will be a shorter delay between the
initial scan and the transmit but i know from experience
that it will happen.

> One of the really nice features of your design is that it is easy to
> interrupt the flow of data at any point and continue it later.  This is
> a useful thing if the cached information remains valid and thus saves us
> time/resources on either the next run or on multiple updates to
> different destination systems.
> 
> One downside to your protocol is that it requires several socket
> connections between systems.  This either mandates using multiple
> rsh/ssh connections (possibly with multiple password prompts for a
> single transfer) OR using some kind of socket-forwarding protocol (such
> as the one provided by ssh).  When I proposed adding extra sockets to
> the rsync protocol a while back, at least one fellow mentioned that a
> requirement of using ssh would not be an acceptable solution to him, so
> this area could be a little controversial (depending on what kind of a
> solution we can come up with).

I don't want to tie it to ssh's port forwarding.  I'm more
inclined to have two communication supervisor processes on
each node that would multiplex the streams.  Where stages
are on the same node, their interconnect could be simple
pipes.

> Another question is whether we need to support the bi-directional
> transfer of files in a single connection.  My rZync test app supports
> sending files in both directions just because it was so simple to add --
> having a message-based protocol makes this a breeze.
> 
> Your first protocol (the one without any backchannels) looks like it
> would be a snap to setup using separate processes.  It does, as you
> note, add quite a bit of extra data transmission (such as an extra 2x
> hit in filename transfer alone).  The backchannels add some complicating
> factors to the file I/O that will need to be carefully designed to avoid
> deadlocks.  Since the data is strictly ordered with one chunk for pipe-A
> and one chunk for pipe-B (for each file), the code should be fairly
> straight-forward, though, so hopefully this won't be a big problem.
> 
> Caching off data from the backchannel utility might be pretty complex,
> though -- think about interrupting the stream after step 3, you'd need
> to buffer off the backchannel data from step 1 plus the main output and
> backchannel data from step 3 and then restart things at steps 4 and 5
> with the appropriate main-stream input and backchannel flows.  That
> would be much harder than saving off the one single output flow from
> step 3 and starting up step 4 later on using it, so either the
> backchannel algorithm may not be very useful in a batch scenario, or
> we'd need to have a helper script that can figure out how to interrupt
> and restart the chain of processes at any point.

The backchannel idea was simply a way to allow less data to
be sent over the network.  One possible (probably bad)
optimization if you will.  The the functional code would
read and write fully populated streams.  The network I/O
routines would have set up the backchannels and split out
the changed fields for transmit.

> I find your idea to allow the first 4 steps of the scan/compare/checksum
> sequence to be reversed intriguing.  At first I thought that it would be
> too fragile since the server's data tends to be updating constantly (and
> this protocol needs to have the server data remain constant from the
> moment the checksum blocks are created until the client(s) all fetch the
> updated data).  However, I can see that this may well be a really nice
> way to update an archive and let multiple (non-identical) clients
> request updates.  This will require an extension to librsync that would
> allow a reversed rolling-checksum diff option, and an option to separate
> the diff and transmit stages (which are currently done at the same
> time), so this idea has a bigger overhead than the rest of the tool as
> far as the rsync protocol is concerned.

Once i broke the process down into its component stages it
became obvious that the direction of file transfer only becomes an
issue in the transmit and update stages.  Which side
performs block checksums vs which one does compares them to
rolling checksums is destination agnostic.

The first two stages (tree scan and scan compare) generate
lists of files that are node a only, node b only, same on
both, different.  The block checksum and subsequent
comparison of rolling checksums with the blocksum produces a
similar description for extents in each differing file:
offset a, length a, offset b, length b.  With this info and
knowledge of which nodes are a and b i can update node a from b
or b from a with the same descriptions.

I have not factored bidirectional updates into this.  The
key for bidirectionality is assigning master status on a
file-by-file basis.  An issue dealt with at length in
intermezzo and coda.  Bidirectional updates requires either
some other metadata reset by each transfer or assumes that
system clocks are reliable for mtime tests to be more than
equality based.

> The most efficient multi-server duplication process would be to save off
> the output of the transmit phase and send it to multiple systems for
> just the final update phase.  This does require that the destination
> machines all have identical file trees for the updating to work, though,
> so this only works on tightly-controlled mirrors.  The advantage is that
> the server expends no further resources than to just get the update
> stream transmitted to the clients (who can duplicate the stream without
> the server's help).
> 
> Since your proposed protocol seems to fit so well with batch-oriented
> scenarios while potentially having problems in the more interactive
> scenarios, I'm wondering if this should be a separate utility-set from a
> more interactive  program (which I think should use a message oriented
> protocol over a single 2-way socket/pipe).  The alternative is to add
> batch-output code to an interactive program (like what was done with
> rsync), which would probably be harder to maintain and less flexible
> than a set of batch-oriented utilities.
> 
> What do you think?

I'm glad you liked my idea.  Perhaps you could be more
explicit as to the problems that you see in interactive
scenarios that would not be present in batch.

I haven't actually specified a protocol but have focused on
the process stage definitions that determine the protocol
requirements.  I see the actual protocol as being very much
message oriented.  The distinction i would make is that i
would break up the process.  Instead of one monolithic piece
(or two connected by sockets) use smaller simple components
in a framework.

I acknowledge that i am very UNIX centric.  I have no problem
with the idea of forking or doing the equivalent with
threads.  Some platforms may have difficulty with this and
need to use a different framework, perhaps incorporating an
event loop.  If the components and protocol are well defined
and implemented with portable functions the framework they
operate in becomes a matter of choice.  Some platforms may
want to use spawn, others clone, fork+exec, and still others
may need to build a monolith incorporating all parts and
having an internal dispatcher.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt