Compressed backup

Sat Jun 1 06:51:01 EST 2002

On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote:
> On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
> > On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
[...]
> When i said "content-aware compressor"  what i meant was
> that the compressor would actually analize the plaintext to
> find semantically identifiable blocks.  For example, a large
> HOWTO could be broken up by the level-2 headings  This would
> be largely (not always) consistant across plaintext changes
> without requiring any awareness of file history.  Have rsync
> be compression aware.  When rsync hits a gziped file it
> could treat that file as multiple streams in series where it
> would restart the checksumming each time the compression
> table is reset.

That's actually pretty clever... It provides a neat way around the "XML
meta-data at the front changing" problem... just have the XML compression
reset at suitable boundaries in the XML file...

You wouldn't need rsync to be compression aware. Provided the adjacent
unchanged segments between compression resets were larger than the rsync
blocksize, you would only miss block fragments at the beginning and end of
each matching sequence(as rsync already does). However, by making rsync
gzip-reset aware, you could do interesting things with variable block sizes,
where the file itself specifies the rsync block boundaries. Hmmm, more
though needed on how this would integrate with the rolling checksum
though... I suspect you could toss the rolling checksum and just look for
matching blocks as defined by the resets, because its not going to match at
an arbitary byte boundary anyway.

> I can't see this actually happening but it could work where
> the compression is done by the application that creates the
> file.  If, and only if, that were to be done so that there
> were enough to be worthwhile then rsync could be made
> compression aware in this way but that would require a
> protocol change.

putting content-aware compression resets at appropriate points will make
files (a bit) rsyncable with rsync as it already stands. Making rsync
compression-reset aware would only improve things a little bit. However, the
resets _must_ be content-aware, they must occur on "likely to match"
boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch
currently does.

> Your idea of having rsync actually do the checksums on the
> plaintext of compressed files might have some merit in
> future.  It would mean essentially that we would zcat the
> source twice and the destination would be ungzipped, merged
> and then regzipped.  Gastly as far as CPU goes but would
> help save us network bandwidth which is growing at a lower
> rate.  The questions are, what is the mean offset of first
> change as a proportion of file size and are enough files
> gzipped to merit the effort?

This is what I believe xdelta already does for compressed files. The deltas
are computed for the uncompressed data on files that are identified as
compressed.

I don't see why you would zcat the source twice... 

The destionation would ungzip the basis file, calculate and send a sig. The
source would zcat the target, feeding it through the rolling checksum,
looking for matches and sending the delta to the destination. The
destination would receive and apply the delta, then gzip the result.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------