Improving the rsync protocol (RE: Rsync dies)

Mon May 20 04:39:01 EST 2002

On 17 May 2002, Wayne Davison <wayned at users.sourceforge.net> wrote:
> On Fri, 17 May 2002, Allen, John L. wrote:
> > In my humble opinion, this problem with rsync growing a huge memory
> > footprint when large numbers of files are involved should be #1 on
> > the list of things to fix.
> 
> I have certainly been interested in working on this issue.  I think it
> might be time to implement a new algorithm, one that would let us
> correct a number of flaws that have shown up in the current
> approach.

(Only my opinion, all of this is debateable, etc.  In particular, I
have deep reservations about proposing a rewrite, because I know
rewrites always seem attractive but rarely work out well.
<http://www.joelonsoftware.com/articles/fog0000000348.html>)

I've been thinking about this too.  I think the top-level question is

  Start from scratch with a new protocol, or try to work within the
  current one?

This largely determines whether we'll be able to implement a new
algorithm or codebase, or need to evolve the current one.  I think the
nature of the current protocol is that it will be hard to make really
fundamental improvements without rewriting it.

rsync3.txt in CVS contains some ideas and features people have
proposed for what a reimplementation.

If we're going to change the protocol, I think it would be good to
move to one that allows us to experiment with changing the
implementation without breaking compatibility.  You can see the way
people have written very diverse implementations of HTTP or SMTP, but
rsync doesn't really encourage that.

Just one example of wanting flexibility in implementation is that
having two processes at one end of the pipe has caused several
problems in

 - making a native W32 port
 - various hangs on Linux
 - porting to VMS and other potential non-Unix systems

I'm not saying that we shouldn't ever decide that forking on one end
was a good solution, but rather than we shouldn't require it in the
protocol.  It would seem to make sense to do the first version with
the traditional setup of one client and one daemon.

Beyond that, I think there are a couple of things about the protocol
we can be pretty sure about:

 - try to use constant memory regardless of tree size
 - try to use time & traffic proportional to deltas
 - no upfront tree traversal
 - pipelining

I wrote librsync.  There is some documentation and I can add more if
there's anything undocumented.

I haven't looked at pysync as much as it deserves, but it could be a
good foundation.

I think Tim said he'd written his own program, and there are also
others around from which we might scrounge ideas or even code.

-- 
Martin