Compressed backup

jw schultz jw at pegasys.ws
Sat Jun 1 05:01:01 EST 2002


On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
> On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
> > On Fri, May 31, 2002 at 11:45:43AM +1000, Donovan Baarda wrote:
> > > On Thu, May 30, 2002 at 03:35:05PM -0700, jw schultz wrote:
[...]
> > > I don't think it is possible to come up with a scheme where the reset
> > > windows could re-sync after a change and then stay sync'ed until the next
> > > change, unless you dynamiclly alter the compression at sync time... you may
> > > as well rsync the decompressed files.
> > 
> > The only way to do it is to make a content-aware compressor
> > that compresses large chunks and then pads the compresstext
> > to an aligned offset.  That would be too much waste to be a
> > good compression system.
> 
> even this wouldn't do it... the large chunks would have to be split on
> identical boundaries over unchanged uncompressedtext in the basis and the
> target. The only way this could be achieved would be if the target was
> compressed using resets on boundarys determined by analysing the changes and
> boundaries used when the basis was compressed. If the end that has the
> target file has that degree of intimate knowege about the other end's basis
> file, then you can toss the whole rsync algorithm and revert to some sort of
> compressed xdelta.

I guess i wasn't clear enough but that's OK because your
response made be think a bit more on the subject so ignore
my idea of padding the compresstext blocks.

When i said "content-aware compressor"  what i meant was
that the compressor would actually analize the plaintext to
find semantically identifiable blocks.  For example, a large
HOWTO could be broken up by the level-2 headings  This would
be largely (not always) consistant across plaintext changes
without requiring any awareness of file history.  Have rsync
be compression aware.  When rsync hits a gziped file it
could treat that file as multiple streams in series where it
would restart the checksumming each time the compression
table is reset.

I can't see this actually happening but it could work where
the compression is done by the application that creates the
file.  If, and only if, that were to be done so that there
were enough to be worthwhile then rsync could be made
compression aware in this way but that would require a
protocol change.

Your idea of having rsync actually do the checksums on the
plaintext of compressed files might have some merit in
future.  It would mean essentially that we would zcat the
source twice and the destination would be ungzipped, merged
and then regzipped.  Gastly as far as CPU goes but would
help save us network bandwidth which is growing at a lower
rate.  The questions are, what is the mean offset of first
change as a proportion of file size and are enough files
gzipped to merit the effort?


-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt




More information about the rsync mailing list