Improving the rsync protocol (RE: Rsync dies)
jw at pegasys.ws
Fri May 17 15:17:01 EST 2002
On Fri, May 17, 2002 at 01:42:31PM -0700, Wayne Davison wrote:
> On Fri, 17 May 2002, Allen, John L. wrote:
> > In my humble opinion, this problem with rsync growing a huge memory
> > footprint when large numbers of files are involved should be #1 on
> > the list of things to fix.
> I have certainly been interested in working on this issue. I think it
> might be time to implement a new algorithm, one that would let us
> correct a number of flaws that have shown up in the current approach.
> Toward this end, I've been thinking about adding a 2nd process on the
> sending side and hooking things up in a different manner:
> The current protocol has one sender process on the sending side, while
> the receiving side has both a generator process and a receiver process.
> There is only one bi-directional pipe/socket that lets data flow from
> the generator to the sender in one direction, and from the sender to the
> receiver in the other direction. The receiver also has a couple pipes
> connecting itself to the generator in order to get data to the sender.
> I'd suggest changing things so that a (new) scanning process on the
> sending side would have a bi-directional link with the generator process
> on the receiving side. This would let both processes descend through
> the tree incrementally and simultaneously (working on a single directory
> at a time) and figure out what files were different. The list of files
> that needed to be transferred PLUS a list of what files need to be
> deleted (if any) would be piped from the scanner process to the sender
> process, who would have a bi-directional link to the receiver process
> (perhaps using ssh's multi-channel support?). There would be no link
> between the receiver and the generator.
With 4 stages I don't know that there needs to be any bidirectional pipes.
Below i will dis-recommend this unidirectional structure.
scanner - output to generator.
generates stat info (one directory at a time)
generator - input from scanner and output to sender
compare stat info from scanner and
generate ADD, DEL and CHANGE orders with
checksums for change or --checksum
sender - input from generator and output to receiver
send ADD, DEL and CHANGE orders + generate
checksums and transmit file contents
receiver - input from sender output is logging
do the ADD, DEL and CHANGEs
> The advantage of this is that the sender and the receiver are really
> very simple. There is a list of file actions that is being received on
> stdin by the sending process, and this indicates what files to update
> and which files to delete. (It might even be possible to make sender be
> controlled by other programs.) These programs would not need to know
> about exclusion lists, delete options, or any of the more esoteric
> options, but would get told things like the timeout settings via the
> stdin pipe. In this scenario, all error messages would get sent to the
> sender process, who would output them on stdout (flushed).
In most ways i like this description much better.
scanner+generator create a dataset that can be captured
or created another way. Similarly the sender output could
be captured or broadcast to update multiple locations or
redo somewhat like --batch*. To summarize the outputs:
scanner+generator -- changeset without data
sender -- changeset with data.
This means that it doesn't matter where scanner or generator
run except that you must invert the changeset directives.
> The scanner/generator process would be the thing that parses the
> commandline, communicates the exclude list to its opposite process, and
> figures out exactly what to do. The scanner would spawn the sender, and
> field all the error messages that it generates. It would then either
> output the errors locally or send them over to the generator for output
> (depending on whether we're pushing or pulling files).
I would describe it more as a case that we parse the
commandline and set up the communication channels then split
up into whatever parts are needed per Having options and
scanner+generator | sender | receiver
scanner+generator | sender > csetdata
scanner+generator > cset
sender < cset | receiver
> As for who spawns the receiver, it would be nice if this was done by the
> sender (so they could work alone), but an alternative would be to have
> the generator spawn the receiver and then then let the receiver hook up
> with the sender via the existing ssh connection.
> This idea is still in its early stages, so feel free to tell me exactly
> where I've missed the boat.
I really like this idea.
I havn't been stung yet by these issues but changing the
scanner+generator to work on one directory at a time will
not only remove the memory footprint problem but also should
take care of the network timeouts.
Breaking it up into 4 processes in 3 distinct stages with
clearly defined ABIs will vastly increase the flexibilty.
Sadly, this will require my moving the link-dest
functionality into receiver where it belongs anyway.
New patch forthcomming.
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync