[librsync-devel] Re: state of the rsync nation? (revisited 6/2003 from 11/2000)

Wed Jun 11 16:13:35 EST 2003

On 11 Jun 2003, Donovan Baarda <abo at minkirri.apana.org.au> wrote:
> On Wed, 2003-06-11 at 13:59, Martin Pool wrote:
> > On 11 Jun 2003, Donovan Baarda <abo at minkirri.apana.org.au> wrote:
> > 
> > > The vcdiff standard is available as RFC3284, and Josh is listed as one
> > > of the authors. 
> > 
> > Yes, I've just been reading that.
> > 
> > I seem to remember that it was around as an Internet-Draft when I
> > started, but it didn't seem clear that it would become standard so I
> > didn't use it.
> 

> I'm not sure if this is the same one... I vaguely recall something like
> this too, but I think it was an attempt to add delta support to http and
> had the significant flaw of not supporting rsync's
> "delta-from-signature". It may have come out of the early xdelta http
> proxy project. IMHO rproxy's http extensions for delta support were
> better because they were more general.

Yes, the most recent version of the Mogul delta-http proposal I read
assumed that the server had a complete history of the document to
generate diffs.  This is fine if you're serving e.g. software
distributions or content from a version control system and have the
history, but not very general.

> I forget if I saw this in Tridge's thesis, but I definitely noticed that
> librsync uses a modified zlib to make feeding data to the compressor and
> throwing away the compressed output more efficient. I have implemented
> this in pysync too, though I don't use a modified zlib... I just throw
> the compressed output away.

Yes, I remember that, but that's not rzip.

By the way the gzip hack is an example of a place where I think a bit
of extra compression doesn't justify cluttering up the code.  I think
I'd rather just compress the whole stream with plain gzip and be done.

See http://samba.org/~tridge/phd_thesis.pdf pg 86ff

rzip is about using block search algorithms to find widely-separated
identical blocks in a file.  (I won't go into detail because tridge's
explanation is quite clear.)

I am pretty sure you could encode rzip into VCDIFF.  I am not sure if
VCDIFF will permit an encoding as efficient as you might get from a
format natively designed for rzip, but perhaps it will be good enough
that using a standard format is a win anyhow.  Perhaps building a
VCDIFF and then using bzip/gzip/lzop across the top would be
acceptable.

In fact rzip has more in common with xdelta than rsync, since it works
entirely locally and can find blocks of any length. 

rzip's advantage compared to gzip/bzip2 is that it can use compression
windows of unlimited size, as compared to a maximum of 900kB for
bzip2.  Holding an entire multi-100MB file in memory and compressing
it in a single window is feasible on commodity hardware.

> The self referencing compression idea is neat but would be a...
> challenge to implement. For it to be effective, the self-referenced
> matches would need to be non-block aligned like xdelta, which tends to
> suggest using xdelta to do the self-reference matches on top of rsync
> for the block aligned remote matches. Fortunately xdelta and rsync have
> heaps on common, so implementing both in one library would be easy (see
> pysync for an example).
> 
> If I didn't have paid work I would be prototyping it in pysync right
> now. If anyone wanted to fund something like this I could make myself
> available :-)

I may get a chance to work full time on replication again soon, so I'm
trying to work out  where we're up to.

> Yeah, my big complaint about librsync at the moment is it is messy. Just
> cleaning up the code alone will be a big improvement. I would guess that
> at least 30% of the code could be trimmed away, leaving a cleaner and
> more extensible core, and because "messy" leads to "inefficient", it
> would be faster too.

"If I'd had more time this letter would have been shorter."

-- 
Martin