superlifter design notes (was Re: Latest rZync release: 0.06)

Sun Jul 21 21:02:02 EST 2002

On 21 Jul 2002, jw schultz <jw at pegasys.ws> wrote:
> .From what i can see rsync is very clever.  The biggest
> problems i see with its inability to scale for large trees,
> a little bit  of accumulated cruft and featuritis, and
> excessively tight integration.

Yes, I think that's basically the problem.

One question that may (or may not) be worth considering is to what
degree you want to be able to implement new features by changing only
the client.  So with NFS (I'm not proposing we use it, only an
example), you can implement any kind of VM or database or whatever on
the client, and the server doesn't have to care.  The current protocol
is just about the opposite: the two halves have to be quite intimately
involved, so adding rename detection would require not just small
additions but major surgery on the server.

> What i am seeing is a Multi-stage pipeline.  Instead of one
> side driving the other with comand and response codes each
> side (client/server) would set up a pipeline containing
> those components that are needed with the appropriate
> plumbing.  Each stage would largly look like a simple
> utility reading from input; doing one thing; writing to
> output, error and log.  The output of each stage is sent to
> the next uni-directionally with no handshake required.

So it's like a Unix pipeline?  (I realize you're proposing pipelines
as a design idea, rather than as an implementation.)

So, we could in fact prototype it using plain Unix pipelines?

That could be interesting.

  Choose some files:
    find ~ | lifter-makedirectory > /tmp/local.dir
  Do an rdiff transfer of the remote directory to here:
    rdiff sig /tmp/local.dir /tmp/local.dir.sig
    scp /tmp/local.dir.sig othermachine:/tmp
    ssh othermachine 'find ~ | lifter-makedirectory | rdiff delta /tmp/local.dir.sig - ' >/tmp/remote.dir.delta
    rdiff patch /tmp/local.dir /tmp/remote.dir.delta /tmp/remote.dir

  For each of those files, do whatever
    for file in lifter-dirdiff /tmp/local.dir /tmp/remote.dir
    do
      ...
    done

Of course the commands I've sketched there don't fix one of the key
problems, which is that of traversing the whole directory up front,
but you could equally well write them as a pipeline that is gradually
consumed as it finds different files.  Imagine

  lifter-find-different-files /home/mbp/ othermachine:/home/mbp/ | \
    xargs -n1 lifter-move-file ....

(I'm just making up the commands as I go along; don't take them too
seriously.)

That could be very nice indeed.

I am just a little concerned that a complicated use of pipelines in
both directions will make us prone to deadlock.  It's possible to
cause local deadlocks if e.g. you have a child process with both stdin
and stdout connected to its parent by pipes.  It gets potentially more
hairy when all the pipes are run through a single TCP connection.

I don't think that concern rules this design out by any means, but we
need to think about it.

One of the design criteria I'd like to add is that it should
preferably be obvious by inspection that deadlocks are not possible.

> 	timestamps should be represented as seconds from
> 	Epoch (SuS) as unsigned 32 int.  It will be >90 years
> 	before we exceed this by which time the protocol
> 	will be extended to use uint64 for milliseconds.

I think we should go to milliseconds straight away: if I remember
correctly, NTFS already stores files with sub-second precision, and
some Linux filesystems are going the same way.  A second is a long
time in modern computing!  (For example, it's possible for a command
started by Make to complete in less than a second, and therefore
apparently not change a timestamp.)  

I think there will be increasing pressure for sub-second precision in
much less than 90 years, and it would be sensible for us to support it
from the beginning.  The Java file APIs, for example, already work in
nanoseconds(?).

Transmitting the precision of the file sounds good.

> 	I think by default user and groups only be handled
> 	numerically.

I think by default we should use names, because that will be least
surprising to most people.  I agree we need to support both.

Names are not universally unique, and need to be qualified, by a NIS
domain or NT domain, or some other means.  I want to be able to say:

  map "MAPOOL2 at ASIAPAC" <-> "mbp at samba.org" <-> "s123123 at student.anu.edu.au"

when transferring across machines.

We probably cannot assume UIDs are any particular length; on NT they
correspond to SIDs (?) which are 128-bit(?) things, typically
represented by strings like

 S1-212-123-2323-232323

So on the whole I think I would suggest following NFSv4 and just using
strings, with the intreptation of them up to the implementation,
possibly with guidance from the admin.

> 	When textual names are used a special chunk in the
> 	datastream would specify a "node+ID -> name"
> 	equivalency immediately before the first use of that
> 	number.

It seems like in general there is a need to have a way of "interning"
strings (users, files, ...?) to shorter representations.  

On the other hand, perhaps this is an overoptimization and just using
compression, at least at first, would be more sensible.

> 	In the tree scan stage the first time we hit a given
> 	inode with st_nlink > 1 we add it to a hardlink list and
> 	decrement st_nlink.  Each time we find another path
> 	that references the inode we indicate it is a link
> 	in the datastream and decrement st_nlink of the one
> 	in our list.  When the entry in the list has
> 	st_nlink == 0 we remove it from the list.

Yes, that's the right algorithm.  It may need some refinement to be
safe with filesystems changing underneath us.

-- 
Martin