Improving the rsync protocol (RE: Rsync dies)

Fri May 17 13:46:02 EST 2002

On Fri, 17 May 2002, Allen, John L. wrote:
> In my humble opinion, this problem with rsync growing a huge memory
> footprint when large numbers of files are involved should be #1 on
> the list of things to fix.

I have certainly been interested in working on this issue.  I think it
might be time to implement a new algorithm, one that would let us
correct a number of flaws that have shown up in the current approach.

Toward this end, I've been thinking about adding a 2nd process on the
sending side and hooking things up in a different manner:

The current protocol has one sender process on the sending side, while
the receiving side has both a generator process and a receiver process.
There is only one bi-directional pipe/socket that lets data flow from
the generator to the sender in one direction, and from the sender to the
receiver in the other direction.  The receiver also has a couple pipes
connecting itself to the generator in order to get data to the sender.

I'd suggest changing things so that a (new) scanning process on the
sending side would have a bi-directional link with the generator process
on the receiving side.  This would let both processes descend through
the tree incrementally and simultaneously (working on a single directory
at a time) and figure out what files were different.  The list of files
that needed to be transferred PLUS a list of what files need to be
deleted (if any) would be piped from the scanner process to the sender
process, who would have a bi-directional link to the receiver process
(perhaps using ssh's multi-channel support?).  There would be no link
between the receiver and the generator.

The advantage of this is that the sender and the receiver are really
very simple.  There is a list of file actions that is being received on
stdin by the sending process, and this indicates what files to update
and which files to delete.  (It might even be possible to make sender be
controlled by other programs.)  These programs would not need to know
about exclusion lists, delete options, or any of the more esoteric
options, but would get told things like the timeout settings via the
stdin pipe.  In this scenario, all error messages would get sent to the
sender process, who would output them on stdout (flushed).

The scanner/generator process would be the thing that parses the
commandline, communicates the exclude list to its opposite process, and
figures out exactly what to do.  The scanner would spawn the sender, and
field all the error messages that it generates.  It would then either
output the errors locally or send them over to the generator for output
(depending on whether we're pushing or pulling files).

As for who spawns the receiver, it would be nice if this was done by the
sender (so they could work alone), but an alternative would be to have
the generator spawn the receiver and then then let the receiver hook up
with the sender via the existing ssh connection.

This idea is still in its early stages, so feel free to tell me exactly
where I've missed the boat.

..wayne..