superlifter design notes and a new proposal

Sun Aug 4 02:52:06 EST 2002

On  4 Aug 2002, Wayne Davison <wayned at users.sourceforge.net> wrote:

> Your previous proposal sounded quite a bit more fine-grained than what
> rZync is doing.  For instance, it sounded like you would have much more
> primitive building-block messages and move much of the controlling
> smarts into something like a python-language scripting layer.  While
> rZync allows ftp-level control (such as "send this file", "send this
> directory tree", "delete this file", "create this directory") it does
> this with a small number of higher-level command messages.

OK, good.

> I think that's a good idea.  My rZync app currently operates on each arg
> independently, but I recently discovered that this makes it incompatible
> with rsync when merging directories and such.  For instance, the command
> "rsync -r dir1/ dir2/ dir3" merges the file list and removes duplicates
> before starting the transfer to dir3.

This is a substantial source of cruft in the current code, and one of
the reasons claimed to make an up-front traversal necessary.

I think a more efficient, and possibly simpler solution, would be to
first examine all of the source directories and determine their
relationships.  Basically, you might discover that dir2 is in fact a
subdirectory of dir1, or the same (or vice versa), in which case you
can eliminate it.  Or you might discover that they're disjoint.  Given
that directories are trees, I don't think any there are any other
possibilities.

Doing this in a way that properly respects various symlink options
will be a little complex, but I think it is in principle possible.  It
is also something quite amenable to being thoroughly exercised in
isolation as a unit test.

I am pretty sure that you can do this by just examining dir1 and dir2.
You do need to look at the filesystem to find out about symlinks and
so on, but I think you do not need to traverse their contents.

It is pretty complex, so there might be some case I've missed.

> I got rid of the "multi-IO" idiom of rsync in favor of sending all
> data via messages and limiting each chunk to 32K to allow other
> messages to be mixed into the middle of a large file's data-stream
> (such as verbose output).

OK, that makes sense.  I guess 32k is as good a number as any.

> I think the basic idea of how rZync envisions a new protocol working is
> a good one -- not so much the specifics of the bytes sent in the
> message-header format, but how the messages flow, how each side handles
> the messages in a single process, how all I/O is handled by a single
> function, etc.  There's certainly lots of room for improvement,
> though.

I've started looking at the code, and it looks very nice.  It's
certainly easier to read that rsync.  Would you mind putting in some
more comments to help me along though?

I had a couple of "internal" thoughts about how the code for a next
release ought to go.  Please don't take them as criticisms of your
right to write experimental code however you want, or as an attempt to
dictate how we run things.  I just want to raise the issues.

Global names should be distinguished with some kind of prefix, as in
librsync: "rz_" or whatever.  If this ever turns into a library that
gets linked into something else it will help; in the meantime it helps
keep clear what is part of the project and what's pulled in from
elsewhere.

I really liked mkproto.awk when I first saw it, but now I'm not so
keen.  I think maintaining header files by hand is in some ways a
good thing, because it forces you to think about whether a particular
function really needs to be exported to rest of the program, or to the
world at large.