superlifter design notes and rZync feedback

Thu Jul 18 21:05:02 EST 2002

On 18 Jul 2002, Wayne Davison <wayned at users.sourceforge.net> wrote:

> (definitely NOT rzync).

Great.  (Excuse my overreaction :-)

> Re: rzync's variable-length fields:  Note that my code allows more
> variation than just 2 or 4 bytes -- e.g., I size the 8-byte file-size
> value to only as many bytes as needed to actually store the length.  I
> agree that we should question whether this complexity is needed, but I
> don't agree that it is wrong on principal.  There are two areas where
> field-sizing is used:  in the directory-info compression (which is very
> similar to what rsync does, but with some extra field-sizing thrown in
> for good measure), and in the transmission protocol itself:

OK.  If the protocol said that all integers are encoded in a UTF-8-ish
or BER-ish variable length scheme that would sound perfectly
reasonable to me.  I had misunderstood the document as suggesting that
some fields should be defined to be different lengths to others that
would worry me.

There is still a question on the relative merits of having
known-length headers (easier to manage buffers, know how much to read,
etc), vs making them as small as possible.

I think I mentioned this -- I'd like to have a reasonable means to
choose a compression scheme at connection time.  bzip2 would be good
for modems; lzo for 100Mbps.  (I think of bzip2 as simmering on the
stove all day, and lzo as lightly blanching :-)

> I still have questions about how best to handle the transfer of
> directory info.  I'm thinking that it might be better to remove the
> rsync-like downsizing of the data and to use a library like zlib to
> remove the huge redundancies in the dir data during its
> transmission.

Ben Escoto suggested a stack like this:

> > 1.  The specification for an abstract protocol designed to allow a
> > single threaded application get good performance using a single,
> > possibly low bandwidth/high latency pipe.  No specific file commands
> > would enter in at this stage, but error reporting and recovery, some
> > kind of security policy, and some other stuff I'm omitting would be
> > included.

> > 2.  A library to make it easy for applications to work with protocols
> > that have the form in 1.  A well-written interface to a scripting
> > language (probably python) would be considered a core part of this.

> > 3.  Specification for a more specific, rsync-like protocol, and maybe
> > another library (again with at least a scripting wrapper) to make it
> > easy for applications to implement the protocol.

> > 4.  The model application rsync3 which shows off what the protocol can
> > do.  Ideally this part should be really short and sweet.

I think that's a good way to play it, because there is enough work in
each section that they're non-trivial layers, but they're also
sufficiently separate to allow a lot of good experimentation or
adaption.

I'd hope that by getting a good foundation in #1 and #2, we would be
able to experiment with doing binary deltas on directories, or not, or
something else again.  I would hope that working only at layer 4,
you'd be able to implement a client that could detect remote renames
(by scanning for files with the same size, looking at their checksums,
etc.)

I wonder if this layering is excessive, but I think that all the
layers are necessary, and a first implementation could be simple in
many cases.  For example, 2 could initially be trivially implemented
in a way that only supports non-pipelined operation.

> In the protocol itself, there are only two variable-size elements that
> goes into each message header.  While this increases complexity quite a
> bit over a fixed-length message header, it shouldn't be too hard to
> automate a test that ensures that the various header combinations
> (particularly boundary conditions) encode and decode properly.  I don't
> know if this level of message header complexity is actually needed (this
> is one of the things that we can use the test app to check out), but if
> we decide we want it, I believe we can adequately test it to ensure that
> it will not be a sinkhole of latent bugs.

OK, good.

> Re: rzync's name cache.  I've revamped it to be a very dependable design
> that no longer depends on lock-step synchronization in the expiration of
> old items (just in the creation of new items, which is easy to achieve).
> 
> Some comments on your registers:
> 
> You mention having something like 16 registers to hold names.  I think
> you'll find this to be inadequate, but it does depend on exactly how
> much you plan to cache names outside of the registers, how much
> retransmission of names you consider to be acceptable, and whether you
> plan to have a "move mode" where the source file is deleted.

Yes, I agree that 16 is probably too small; the next round number
would be 256.  If we use something like BER it could be unboundedly
big.  However, since using a name causes server-side resources to be
allocated, that's probably no good.  We don't want somebody abusing a
public server by allocating a zillion names; on the other hand I have
a feeling that having the client have sole control over name
allocation may make things simpler.

One potentially interesting idea about pipelining is that the pipeline
is not unboundedly deep.  I think that TCP window will never be much
bigger than about 64kB, though I need to check this.  That puts a
limit on the depth of our pipeline.  Of course, if the operations were
very small, 64kB would still be a lot of operations.

Another interesting case to consider for pipelining is error handling.
For example, consider the case where the remote disk runs out of
space.  (rsync currently handles this gracelessly.)

Suppose the client sent a command to open the file, and then a very
long literal data stream, which causes an error.  Ideally, as soon as
the server runs out of space, it needs to tell the client to stop
sending.  I can see a few approaches:

 1- Just let the client keep sending the file, and return an error at
    the end.  Kind of wasteful.

 2- Drop the connection, so that the client eventually gets an EPIPE
    message.  Some protocols do this, but I don't think we can because
    we may want to do some more work.

 3- Have a means to return an error for a request before transmission
    is complete.  Structuring the client so that it can see the error
    and interrupt its work may be hard, but not impossible.  Ideally,
    it would not merely generate all the data and throw it away, but
    actually stop generation early.  Also, the client cannot just
    interrupt the request at some arbitrary point: there has to be a
    way for the server to see when the interrupted request ends, so
    that they can both resynchronize for the next one.

 4- Have some kind of smaller connection inside the TCP stream, and
    drop that.  I think this is really ugly and error-prone; I don't
    want to do it.

 5- Make the client send several smaller requests, and when it sees an
    error stop sending them.  So, say, multiple 8kB WRITE commands,
    rather than one 100MB one.

At the moment I think 5 is the best tradeoff.

I really like the NFSv4 protocol spec.  I don't think we could just
adopt it wholesale, but I think the general idea of e.g. chained
operations is quite promising.

> My first test app had no name-cache whatsoever.  It relied on external
> commands to drive it, and it sent the source/destination/basis trio of
> names from side to side before every step of the file's progress.  While
> this was simple, the increased bandwidth necessary to retransmit the
> names was not acceptable to me.
> 
> If we just register the active items that are currently being sent over
> the wire, the name will need to live through the entire sig, delta,
> patch, and (optionally) source-side-delete steps.  When the files are
> nearly up-to-date, having only 16 of them will, I believe, be overly
> restrictive.  Part of the problem is that the buffered data on the
> sig-generating side delays the source-side-delete messages quite a bit.
> If we had a high-priority delete channel, that would help to alleviate
> things, but I think you'll find that having several hundred active names
> will be a better lower limit in your design thinking.
> 
> Another question is whether names are sent fully-qualified or relative
> to some directory.  My protocol caches directory names in the name cache
> and allows you to send filenames relative to a cached directory.  Just
> having a way to "chdir" each side (even if the chdir is just virtual)
> and send names relative to the current directory should help a lot.

I think a nice way might be to have a LOOKUP operation that takes an
optional "working directory" register (using my term), and a name to
look up inside that directory.

With one line per request:

  $0 = ROOT
  $1 = LOOKUP($0, "home")
  $2 = LOOKUP($1, "mbp")
  $3 = LOOKUP($2, "hello.c")
  STAT($3)
  SIGNATURE($3)
  ...

So the client knows that after it has sent the third instruction, it
can rely on register $2 as a way to refer to /home/mbp, at least until
it assigns something else.  This is pretty similar to your name cache,
as I understand it, but with explicit IDs, assigned by the client.

Assigning IDs in this way allows the client to go ahead and send all
of these requests without needing to wait for the response to the
first one.  In that way, it's very vaguely like a superpipelined CPU,
which explains the name.  In the example I just gave, the client could
issue all of those requests without blocking until it needed the
results of the signature.  While it was waiting for that, it could
continue on to examine other files or make modfications.

In general the client wouldn't need to know the contents of the
registers; it would just know that there's something in them that lets
the server find the right file.

I had been wondering about handling failed requests by setting the
"output register" to a value meaning "invalid".  If that register was
later used on the right-hand-side of an operation, that operation
would fail.  If the client sends these requests along the pipeline:

  $0 = ROOT
  $1 = LOOKUP($0, "hoome")
  $2 = LOOKUP($1, "mbp")

then $1 and $2 will be set to INVALID, and the third operation will
always fail.  Later, perhaps much later, the client will find all of
this out, and presumably try to recover somehow, or at least stop trying.

> An additional source of cached names is in the directory scanning when
> doing a recursive transfer.  My protocol has specific commands that
> refer to a name index within a specified directory so that the receiving
> side can request changed files using a small binary value instead of a
> full pathname.

That makes sense; I don't really have anything to add at the moment.

> One more area of complexity that you don't mention (and I don't either
> in my new-protocol doc):  there are some operations where 2 names need
> to be associated with one operation.  This happens when we have both a
> destination file and a basis file.  My current cache implementation
> allows both of these names to be associated with a single cache element
> (though I need to improve this a bit in rzync) and lets the sig/patch
> stage snag them both.

I would suggest:

  ...
  $2 = LOOKUP($1, "oldfile")
  $3 = NAME($1, "newfile)
  $4 = RENAME($2, $3, overwrite)

The NAME operation assigns a register without requiring that the file
already exist.  The $4 register in this case probably ends up the same
as $3 -- I just want it there as a way to make later operations depend
on the successful outcome.

The perl-ish terminology is kind of rough I admit, but hopefully you
can see where I'm heading.

-- 
Martin