Compressed backup

jw schultz jw at pegasys.ws
Fri May 31 17:29:01 EST 2002


On Fri, May 31, 2002 at 11:45:43AM +1000, Donovan Baarda wrote:
> On Thu, May 30, 2002 at 03:35:05PM -0700, jw schultz wrote:
> [...]
> > > There is a patch available to gzip to add an option --rsyncable that's
> > > supposed to make it work better with rsync.  It's been put into the
> > > "patches" directory for the next release of rsync, or you can get it at
> > > 
> > >     http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff
> > 
> > I took a quick look at this patch and i think it does what i expected.
> > It resets the compression algorithm after each 4KB of
> > compresstext.  This means that if you change 1 byte early in
> > the file it might or might not affect the blocks later on.
> > The reason for the equivication is that if the change alters
> > the compression ratio the savings are gone.
> 
> If that is how it works, and I think you are right, then it would only work
> for the smallest of cases, rendering the gzip-rsyncable patch worse than
> useless for the vast majority of cases.
> 
> Regular resets hurt the compression ratio. Resets must occur at the same
> begin/end boundary points of an unchanged sequence of uncompresstext for the
> resultant compresstext to be unchanged. The only changes that will result in
> resets occuring at the same boundary points for any unchanged text following
> the change _must_ result in compresstext that is an exact multiple of 4KB.
> This means any insertion/deletion/replacement must not change the size of
> the resulting compresstext unless it is by an exact multiple of 4KB.
> 
> I would guess that the number of changes meeting this criteria would be
> almost non-existant. I suspect that the gzip-rsyncable patch does nearly
> nothing except produce worse compression. It _might_ slightly increase the
> rsyncability up to the point where the first change in the uncompresstext
> occurs, but the chance of it re-syncing after that point would be extremely
> low.

Actually many file modifications do just fine.  The key
being to recognize that any plaintext modification will
alter the compresstext from that point to the end.
Most content modifications alter the blocks nearest the end
of the file.  Think about how you edit text and Word processor
documents.

What this does bring up in my mind is a trend i see in data
formats.  Specifically, the use of compressed XML.
StarOffice/OpenOffice, KOffice and i think several others
are going this route.  Maintaining volatile meta-data at the
beginning of their files will defeat rsync's rolling
checksums.  I'm not sure how but perhaps we could encourage
the developers to somehow isolate the volatile meta-data at
the end of the file or in a fixed size block at the
beginning.  Otherwise a user opening a file and changing the
view-mode or fixing a single typo in the last paragraph
would alter the entire binary file.

This trend will also affect several other aspects of systems
and network administration.  We are rapidly approaching a
day when most application files stored in home directories
and shared work areas will be compressed.  This means that
that those areas will not benifit from network or filesystem
compression.  And our so-called 200GB tape drives will
barely exceed 1:1 compression and only hold 100GB of these
types of files.  I expect non-application files to remain
uncompressed for the forseeable future but we should
recognize that the character of the data stored is changing
in ways that disrupt the assumptions many of our tools are
built upon.

> I tried to think of a way of doing this so that it would eventualy re-sync,
> with things like resets every <some-prime> bytes so that the reset window
> moves, but the problem is the source and target reset windows must move
> together for it to work, so any scheme that moves the reset window into sync
> will also move the window _out_ of sync. 
> 
> I don't think it is possible to come up with a scheme where the reset
> windows could re-sync after a change and then stay sync'ed until the next
> change, unless you dynamiclly alter the compression at sync time... you may
> as well rsync the decompressed files.

The only way to do it is to make a content-aware compressor
that compresses large chunks and then pads the compresstext
to an aligned offset.  That would be too much waste to be a
good compression system.



-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt




More information about the rsync mailing list