Improving the rsync protocol (RE: Rsync dies)

Fri May 17 15:17:01 EST 2002

On Fri, May 17, 2002 at 01:42:31PM -0700, Wayne Davison wrote:
> On Fri, 17 May 2002, Allen, John L. wrote:
> > In my humble opinion, this problem with rsync growing a huge memory
> > footprint when large numbers of files are involved should be #1 on
> > the list of things to fix.
> 
> I have certainly been interested in working on this issue.  I think it
> might be time to implement a new algorithm, one that would let us
> correct a number of flaws that have shown up in the current approach.
> 
> Toward this end, I've been thinking about adding a 2nd process on the
> sending side and hooking things up in a different manner:
> 
> The current protocol has one sender process on the sending side, while
> the receiving side has both a generator process and a receiver process.
> There is only one bi-directional pipe/socket that lets data flow from
> the generator to the sender in one direction, and from the sender to the
> receiver in the other direction.  The receiver also has a couple pipes
> connecting itself to the generator in order to get data to the sender.
> 
> I'd suggest changing things so that a (new) scanning process on the
> sending side would have a bi-directional link with the generator process
> on the receiving side.  This would let both processes descend through
> the tree incrementally and simultaneously (working on a single directory
> at a time) and figure out what files were different.  The list of files
> that needed to be transferred PLUS a list of what files need to be
> deleted (if any) would be piped from the scanner process to the sender
> process, who would have a bi-directional link to the receiver process
> (perhaps using ssh's multi-channel support?).  There would be no link
> between the receiver and the generator.

With 4 stages I don't know that there needs to be any bidirectional pipes.
Below i will dis-recommend this unidirectional structure.

	scanner - output to generator.
		generates stat info (one directory at a time)

	generator - input from scanner and output to sender
		compare stat info from scanner and
		generate ADD, DEL and CHANGE orders with
		checksums for change or --checksum

	sender - input from generator and output to receiver
		send ADD, DEL and CHANGE orders + generate
		checksums and transmit file contents

	receiver - input from sender output is logging
		do the ADD, DEL and CHANGEs

> 
> The advantage of this is that the sender and the receiver are really
> very simple.  There is a list of file actions that is being received on
> stdin by the sending process, and this indicates what files to update
> and which files to delete.  (It might even be possible to make sender be
> controlled by other programs.)  These programs would not need to know
> about exclusion lists, delete options, or any of the more esoteric
> options, but would get told things like the timeout settings via the
> stdin pipe.  In this scenario, all error messages would get sent to the
> sender process, who would output them on stdout (flushed).

In most ways i like this description much better.

scanner+generator create a dataset that can be captured
or created another way.  Similarly the sender output could
be captured or broadcast to update multiple locations or
redo somewhat like --batch*.  To summarize the outputs:

	scanner+generator -- changeset without data
	sender		-- changeset with data.

This means that it doesn't matter where scanner or generator
run except that you must invert the changeset directives.

> The scanner/generator process would be the thing that parses the
> commandline, communicates the exclude list to its opposite process, and
> figures out exactly what to do.  The scanner would spawn the sender, and
> field all the error messages that it generates.  It would then either
> output the errors locally or send them over to the generator for output
> (depending on whether we're pushing or pulling files).

I would describe it more as a case that we parse the
commandline and set up the communication channels then split
up into whatever parts are needed per Having options and
argv[0]. 
	scanner+generator | sender | receiver
	scanner+generator | sender > csetdata
	scanner+generator > cset
	receiver <csetdata
	sender < cset | receiver

> As for who spawns the receiver, it would be nice if this was done by the
> sender (so they could work alone), but an alternative would be to have
> the generator spawn the receiver and then then let the receiver hook up
> with the sender via the existing ssh connection.
> 
> This idea is still in its early stages, so feel free to tell me exactly
> where I've missed the boat.
> 
> ..wayne..

I really like this idea.

I havn't been stung yet by these issues but changing the
scanner+generator to work on one directory at a time will
not only remove the memory footprint problem but also should
take care of the network timeouts.

Breaking it up into 4 processes in 3 distinct stages with
clearly defined ABIs will vastly increase the flexibilty.

Sadly, this will require my moving the link-dest
functionality into receiver where it belongs anyway.
New patch forthcomming.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt