Compressed backup

jw schultz jw at pegasys.ws
Thu May 30 15:39:02 EST 2002


This whole discussion on the efficiency of rsyncing
pre-compressed files is probably pointless for Matthias
Munnich.  He is trying to do backups.  Therefore, he doesn't
want the originals compressed.

On Thu, May 30, 2002 at 03:45:16PM -0500, Dave Dykstra wrote:
> On Thu, May 23, 2002 at 04:03:56PM -0400, David Bolen wrote:
> > Matthias Munnich [munnich at atmos.ucla.edu] writes:
> > 
> > > No! Only the sender side has to compress the data. The comparison
> > > could be done in the compressed data format. With the -z option 
> > > the sender compresses the data anyway. The checksum test should
> > > be faster for the smaller compressed pieces.

Except that with the -z option only the changed blocks are
compressed.  The checksums are done at both ends on the
uncompressed files.

Storing the files blockwize compressed on the rsync server
would be major departure from current design.  I can see how
it would be done but i certainly don't want it.  With
--link-dest i'm getting 24 backups in the space of 3.6 in a
near-worst case, the equivalent of 73% compression that
even compounds on file compression.

> > 
> > Except that you'll probably end up retransmitting the whole thing due
> > to the change in compressed output.  Since a compression function is
> > essentially a data randomizer (the better the compression the better
> > the randomization of the output), tiny changes in input can result in
> > huge changes in output.  That's the traditional problem of trying to
> > use an algorithm like rsync's with compressed file formats.
> > 
> > You really need to apply the rsync algorithm to the uncompressed files
> > if you hope to gain any real efficiencies in terms of reduction of
> > traffic transmitted.

Myth: compression randomizes data.
Reality: gzip and other compression systems are
entirely deterministic solely based on the input data.
Unlike encryption the output is in no way randomized.

For gzip, bzip, pkzip and most other general purpose
compressors a change in the plaintext will only alter the
compresstext representing that point until the compression
algorithm resets.  If a file is altered near the end the
compresstext of the earlier blocks will remain the same.

> 
> There is a patch available to gzip to add an option --rsyncable that's
> supposed to make it work better with rsync.  It's been put into the
> "patches" directory for the next release of rsync, or you can get it at
> 
>     http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff

I took a quick look at this patch and i think it does what i expected.
It resets the compression algorithm after each 4KB of
compresstext.  This means that if you change 1 byte early in
the file it might or might not affect the blocks later on.
The reason for the equivication is that if the change alters
the compression ratio the savings are gone.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt




More information about the rsync mailing list