Compressed backup

Sat Jun 1 17:26:02 EST 2002

On Sat, Jun 01, 2002 at 11:46:37PM +1000, Donovan Baarda wrote:
> On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote:
> > On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
> > > On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
> [...]
> > When i said "content-aware compressor"  what i meant was
> > that the compressor would actually analize the plaintext to
> > find semantically identifiable blocks.  For example, a large
> > HOWTO could be broken up by the level-2 headings  This would
> > be largely (not always) consistant across plaintext changes
> > without requiring any awareness of file history.  Have rsync
> > be compression aware.  When rsync hits a gziped file it
> > could treat that file as multiple streams in series where it
> > would restart the checksumming each time the compression
> > table is reset.
> 
> That's actually pretty clever... It provides a neat way around the "XML
> meta-data at the front changing" problem... just have the XML compression
> reset at suitable boundaries in the XML file...
> 
> You wouldn't need rsync to be compression aware. Provided the adjacent
> unchanged segments between compression resets were larger than the rsync
> blocksize, you would only miss block fragments at the beginning and end of
> each matching sequence(as rsync already does). However, by making rsync
> gzip-reset aware, you could do interesting things with variable block sizes,
> where the file itself specifies the rsync block boundaries. Hmmm, more
> though needed on how this would integrate with the rolling checksum
> though... I suspect you could toss the rolling checksum and just look for
> matching blocks as defined by the resets, because its not going to match at
> an arbitary byte boundary anyway.
> 
> > I can't see this actually happening but it could work where
> > the compression is done by the application that creates the
> > file.  If, and only if, that were to be done so that there
> > were enough to be worthwhile then rsync could be made
> > compression aware in this way but that would require a
> > protocol change.
> 
> putting content-aware compression resets at appropriate points will make
> files (a bit) rsyncable with rsync as it already stands. Making rsync
> compression-reset aware would only improve things a little bit. However, the
> resets _must_ be content-aware, they must occur on "likely to match"
> boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch
> currently does.

chunk: section of file gzipped without reset.
block: section of file that dest has checksumed.

The buy-back of the rolling checksums is to allow subsequent
blocks to match at different offsets when a length change
occurs mid-file.

Without a gzip reset aware variable blocksizes rsync
wouldn't gain anything from a gzip reset because the reset
would occur in the middle of a block.  The rolling checksums
would be droppable, however because matching blocks within a
gzip chunk would not be relocatable within that chunk.
For any gain rsync blocks need to have block-aligned offsets
within the gzip chunks.

> 
> > Your idea of having rsync actually do the checksums on the
> > plaintext of compressed files might have some merit in
> > future.  It would mean essentially that we would zcat the
> > source twice and the destination would be ungzipped, merged
> > and then regzipped.  Gastly as far as CPU goes but would
> > help save us network bandwidth which is growing at a lower
> > rate.  The questions are, what is the mean offset of first
> > change as a proportion of file size and are enough files
> > gzipped to merit the effort?
> 
> This is what I believe xdelta already does for compressed files. The deltas
> are computed for the uncompressed data on files that are identified as
> compressed.
> 
> I don't see why you would zcat the source twice... 
> 
> The destionation would ungzip the basis file, calculate and send a sig. The
> source would zcat the target, feeding it through the rolling checksum,
> looking for matches and sending the delta to the destination. The
> destination would receive and apply the delta, then gzip the result.

I have misstated slightly due to the time it takes for 
rsync's algorithm to sink in.  Let me rephrase.

Destination would zcat to generate block checksums,
Source zcats and buffers to generate rolling checksums and
send deltas, Destination zcats again to merge deltas into
new file.

The first reason the destination zcats twice is that we might lack
the disk space to store the uncompressed files.  The only
reason to go to this much work for gziped files is because
there are many large ones.  Therefore, we dare not leave
them around uncompressed even when rsync works on one
directory at a time.  The other reason is that the scanning
and generation processes should not alter the trees.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt