[librsync-devel] Re: state of the rsync nation? (revisited 6/2003 from 11/2000)

Donovan Baarda abo at minkirri.apana.org.au
Wed Jun 11 19:20:12 EST 2003


On Wed, 2003-06-11 at 16:13, Martin Pool wrote:
> On 11 Jun 2003, Donovan Baarda <abo at minkirri.apana.org.au> wrote:
> > On Wed, 2003-06-11 at 13:59, Martin Pool wrote:
[...]
> > > On 11 Jun 2003, Donovan Baarda <abo at minkirri.apana.org.au> wrote:
> > I forget if I saw this in Tridge's thesis, but I definitely noticed that
> > librsync uses a modified zlib to make feeding data to the compressor and
> > throwing away the compressed output more efficient. I have implemented
> > this in pysync too, though I don't use a modified zlib... I just throw
> > the compressed output away.
> 
> Yes, I remember that, but that's not rzip.

Just had a look at rzip... interesting... haven't yet digested its
impact on things like xdelta yet though... hence the rambling
explorations in this email.

> By the way the gzip hack is an example of a place where I think a bit
> of extra compression doesn't justify cluttering up the code.  I think
> I'd rather just compress the whole stream with plain gzip and be done.

I originally had the same thoughts... but then testing on real-world
data showed me a 5~10% improvement on compression when using it, so I
included it in pysync. "priming" the compressor with match data does
make a difference when you have a lot of match data interspersed with
your miss data.

However, I agree that it is probably much less efficient than using a
compression scheme designed around the "match encoding" rsync already
uses.

> See http://samba.org/~tridge/phd_thesis.pdf pg 86ff
> 
> rzip is about using block search algorithms to find widely-separated
> identical blocks in a file.  (I won't go into detail because tridge's
> explanation is quite clear.)
> 
> I am pretty sure you could encode rzip into VCDIFF.  I am not sure if
> VCDIFF will permit an encoding as efficient as you might get from a
> format natively designed for rzip, but perhaps it will be good enough
> that using a standard format is a win anyhow.  Perhaps building a
> VCDIFF and then using bzip/gzip/lzop across the top would be
> acceptable.

I'm sure you could encode rzip into vcdiff, and you would loose very
little, and possibly gain a bit (if you add byte-extending matches
backwards, matches are never byte aligned, and vcdiff can efficiently
encode any short or long offsets _and_ lengths).

> In fact rzip has more in common with xdelta than rsync, since it works
> entirely locally and can find blocks of any length. 

Yeah, I noticed it was very similar in some ways... though xdelta is a
bit more "brute force" in that it builds a massive hash table for the
whole file using a small block size. rzip is a very nice way to minimise
the size of the hash table without sacrificing too much. 

Also Tridge doesn't have byte-extending matches backwards, which xdelta
does. rzip can't extend matches backwards because the exponent based
"offset encoding" can't support arbitary byte aligned long offsets. 

Right now I'm thinking the big innovation rzip provides is not the
exponent based offset coding, but a neat way of minimising the hash
table.

> rzip's advantage compared to gzip/bzip2 is that it can use compression
> windows of unlimited size, as compared to a maximum of 900kB for
> bzip2.  Holding an entire multi-100MB file in memory and compressing
> it in a single window is feasible on commodity hardware.

I just realised you already said what I typed above :-)

> > The self referencing compression idea is neat but would be a...
> > challenge to implement. For it to be effective, the self-referenced
> > matches would need to be non-block aligned like xdelta, which tends to
> > suggest using xdelta to do the self-reference matches on top of rsync
> > for the block aligned remote matches. Fortunately xdelta and rsync have
> > heaps on common, so implementing both in one library would be easy (see
> > pysync for an example).

After looking at rzip, I'm convinced xdelta needs to use rzip's
"expiring hash table" approach to minimise the hash table.... xdelta is
evil on gigabyte size files.

-- 
Donovan Baarda <abo at minkirri.apana.org.au>
http://minkirri.apana.org.au/~abo/




More information about the rsync mailing list