superlifter design notes (was Re: Latest rZync release: 0.06)

Sun Jul 21 21:48:02 EST 2002

On Mon, Jul 22, 2002 at 02:00:21PM +1000, Martin Pool wrote:
> On 21 Jul 2002, jw schultz <jw at pegasys.ws> wrote:
> > .From what i can see rsync is very clever.  The biggest
> > problems i see with its inability to scale for large trees,
> > a little bit  of accumulated cruft and featuritis, and
> > excessively tight integration.
> 
> Yes, I think that's basically the problem.
> 
> One question that may (or may not) be worth considering is to what
> degree you want to be able to implement new features by changing only
> the client.  So with NFS (I'm not proposing we use it, only an
> example), you can implement any kind of VM or database or whatever on
> the client, and the server doesn't have to care.  The current protocol
> is just about the opposite: the two halves have to be quite intimately
> involved, so adding rename detection would require not just small
> additions but major surgery on the server.
> 
> > What i am seeing is a Multi-stage pipeline.  Instead of one
> > side driving the other with comand and response codes each
> > side (client/server) would set up a pipeline containing
> > those components that are needed with the appropriate
> > plumbing.  Each stage would largly look like a simple
> > utility reading from input; doing one thing; writing to
> > output, error and log.  The output of each stage is sent to
> > the next uni-directionally with no handshake required.
> 
> So it's like a Unix pipeline?  (I realize you're proposing pipelines
> as a design idea, rather than as an implementation.)

I'm kinda, sorta proposing both.  What i'm looking at is to
keep each stage as simple as possible without sharing
datastructures with other stages.  And that it should be
possible to break/intercept the pipeline at any point.

> 
> So, we could in fact prototype it using plain Unix pipelines?

For local-to-local yes.

> 
> That could be interesting.
> 
>   Choose some files:
>     find ~ | lifter-makedirectory > /tmp/local.dir
>   Do an rdiff transfer of the remote directory to here:
>     rdiff sig /tmp/local.dir /tmp/local.dir.sig
>     scp /tmp/local.dir.sig othermachine:/tmp
>     ssh othermachine 'find ~ | lifter-makedirectory | rdiff delta /tmp/local.dir.sig - ' >/tmp/remote.dir.delta
>     rdiff patch /tmp/local.dir /tmp/remote.dir.delta /tmp/remote.dir
> 
>   For each of those files, do whatever
>     for file in lifter-dirdiff /tmp/local.dir /tmp/remote.dir
>     do
>       ...
>     done
> 
> Of course the commands I've sketched there don't fix one of the key
> problems, which is that of traversing the whole directory up front,
> but you could equally well write them as a pipeline that is gradually
> consumed as it finds different files.  Imagine
> 
>   lifter-find-different-files /home/mbp/ othermachine:/home/mbp/ | \
>     xargs -n1 lifter-move-file ....
> 
> (I'm just making up the commands as I go along; don't take them too
> seriously.)
> 
> That could be very nice indeed.

I'm not seriously suggesting that each stage be a seperate
utility but there would be times when being able to treat
them as such would be advantageous.

> 
> I am just a little concerned that a complicated use of pipelines in
> both directions will make us prone to deadlock.  It's possible to
> cause local deadlocks if e.g. you have a child process with both stdin
> and stdout connected to its parent by pipes.  It gets potentially more
> hairy when all the pipes are run through a single TCP connection.

Where in+out are connected to the same parent (multiplexing
TCP) that parent would have to use poll or select.  In the
ssh case it might be possible to use the port forwarding
features of ssh or borrow the code from there.  We should
plagiarise where sensible.

One key advantage of the looser coupling and of stages is that
they are immune to changes in the plumbing.

> 
> I don't think that concern rules this design out by any means, but we
> need to think about it.

Absolutely! 

> 
> One of the design criteria I'd like to add is that it should
> preferably be obvious by inspection that deadlocks are not possible.
> 
> > 	timestamps should be represented as seconds from
> > 	Epoch (SuS) as unsigned 32 int.  It will be >90 years
> > 	before we exceed this by which time the protocol
> > 	will be extended to use uint64 for milliseconds.
> 
> I think we should go to milliseconds straight away: if I remember
> correctly, NTFS already stores files with sub-second precision, and
> some Linux filesystems are going the same way.  A second is a long
> time in modern computing!  (For example, it's possible for a command
> started by Make to complete in less than a second, and therefore
> apparently not change a timestamp.)  
> 
> I think there will be increasing pressure for sub-second precision in
> much less than 90 years, and it would be sensible for us to support it
> from the beginning.  The Java file APIs, for example, already work in
> nanoseconds(?).
> 
> Transmitting the precision of the file sounds good.
> 
> > 	I think by default user and groups only be handled
> > 	numerically.
> 
> I think by default we should use names, because that will be least
> surprising to most people.  I agree we need to support both.

And for id sqashing of unknown users.

> 
> Names are not universally unique, and need to be qualified, by a NIS
> domain or NT domain, or some other means.  I want to be able to say:
> 
>   map "MAPOOL2 at ASIAPAC" <-> "mbp at samba.org" <-> "s123123 at student.anu.edu.au"
> 
> when transferring across machines.
> 
> We probably cannot assume UIDs are any particular length; on NT they
> correspond to SIDs (?) which are 128-bit(?) things, typically
> represented by strings like
> 
>  S1-212-123-2323-232323
> 
> So on the whole I think I would suggest following NFSv4 and just using
> strings, with the intreptation of them up to the implementation,
> possibly with guidance from the admin.

I haven't been following the NFSv4 internals.  I'd rather
avoid variable length fields when we can.  It sounds to me
like the thing to do is to use 32 bit ints in our structure
and have a translation mechanism that can turn those into
local or remote numbers and names.  On systems where UIDs
fit in 32 bits we can use the local UID, where it doesn't we
store the native number in the table along with the name.
In this way we only have to transmit the variable length
ID values once each way.

> 
> > 	When textual names are used a special chunk in the
> > 	datastream would specify a "node+ID -> name"
> > 	equivalency immediately before the first use of that
> > 	number.
> 
> It seems like in general there is a need to have a way of "interning"
> strings (users, files, ...?) to shorter representations.  
> 
> On the other hand, perhaps this is an overoptimization and just using
> compression, at least at first, would be more sensible.

I'm not convinced filenames need enough persistance to worry
about it.  Especially if they only need to be CWD relative
and can be passed on backchannels or otherwise shared
between stages within a node.  I'm less inclined toward
compressing the whole stream.

You do bring up the idea though that we might want a more
generalized string indirection mechanism instead of one for
UIDs, another for directories, etc.  I'm not sure yet.

> 
> > 	In the tree scan stage the first time we hit a given
> > 	inode with st_nlink > 1 we add it to a hardlink list and
> > 	decrement st_nlink.  Each time we find another path
> > 	that references the inode we indicate it is a link
> > 	in the datastream and decrement st_nlink of the one
> > 	in our list.  When the entry in the list has
> > 	st_nlink == 0 we remove it from the list.
> 
> Yes, that's the right algorithm.  It may need some refinement to be
> safe with filesystems changing underneath us.

I'm not going to worry about that.  If filesystem changes
are someones problem they either need another tool or can
run rsync on a versioning filesystem or a snapshot.

I suspect that processing files as we walk the tree with the
stages in parallel will be much less prone to the "file
moved out from under me" problems that what we have
currently.

Thanks for your response.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt