Compressed backup

Sat Jun 1 03:55:01 EST 2002

On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
> On Fri, May 31, 2002 at 11:45:43AM +1000, Donovan Baarda wrote:
> > On Thu, May 30, 2002 at 03:35:05PM -0700, jw schultz wrote:
[...]
> > I would guess that the number of changes meeting this criteria would be
> > almost non-existant. I suspect that the gzip-rsyncable patch does nearly
> > nothing except produce worse compression. It _might_ slightly increase the
> > rsyncability up to the point where the first change in the uncompresstext
> > occurs, but the chance of it re-syncing after that point would be extremely
> > low.
> 
> Actually many file modifications do just fine.  The key
> being to recognize that any plaintext modification will
> alter the compresstext from that point to the end.
> Most content modifications alter the blocks nearest the end
> of the file.  Think about how you edit text and Word processor
> documents.

So it is not possible for rsync to get any matches on a gzip-rsyncable
compressed file after the first modification. Does the gzip-rsyncable patch
actually improve the rsyncability of compressed files at all? AFAIKT, files
compressed normaly should be pretty much rsyncable up to the same point.
Reseting the compression every 4K probably does allow you to rsync closer up
to that point, but only because the resets make the compression less
efficient... ie any savings from matching closer to the modification are
lost because of the overall larger file.

[...]
> This trend will also affect several other aspects of systems
> and network administration.  We are rapidly approaching a
> day when most application files stored in home directories
> and shared work areas will be compressed.  This means that
> that those areas will not benifit from network or filesystem
> compression.  And our so-called 200GB tape drives will
> barely exceed 1:1 compression and only hold 100GB of these
> types of files.  I expect non-application files to remain
> uncompressed for the forseeable future but we should
> recognize that the character of the data stored is changing
> in ways that disrupt the assumptions many of our tools are
> built upon.

I think that the increased use of compressed files is going to require that
rsync-like tools become compression aware, and be smart enough to
decompress/recompress files when syncing them. I see no way around it, other
than throwing heaps of bandwidth at the problem :-). Needless to say this
will make the load on servers even worse. However server side signature
caching and client side delta calculation would probably end up making the
load on servers even lower than it currently is.

[...]
> > I don't think it is possible to come up with a scheme where the reset
> > windows could re-sync after a change and then stay sync'ed until the next
> > change, unless you dynamiclly alter the compression at sync time... you may
> > as well rsync the decompressed files.
> 
> The only way to do it is to make a content-aware compressor
> that compresses large chunks and then pads the compresstext
> to an aligned offset.  That would be too much waste to be a
> good compression system.

even this wouldn't do it... the large chunks would have to be split on
identical boundaries over unchanged uncompressedtext in the basis and the
target. The only way this could be achieved would be if the target was
compressed using resets on boundarys determined by analysing the changes and
boundaries used when the basis was compressed. If the end that has the
target file has that degree of intimate knowege about the other end's basis
file, then you can toss the whole rsync algorithm and revert to some sort of
compressed xdelta.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------