Compressed backup

Sat Jun 1 22:09:01 EST 2002

On Sat, Jun 01, 2002 at 05:18:42PM -0700, jw schultz wrote:
> On Sat, Jun 01, 2002 at 11:46:37PM +1000, Donovan Baarda wrote:
> > On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote:
> > > On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
> > > > On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
[...]
> > putting content-aware compression resets at appropriate points will make
> > files (a bit) rsyncable with rsync as it already stands. Making rsync
> > compression-reset aware would only improve things a little bit. However, the
> > resets _must_ be content-aware, they must occur on "likely to match"
> > boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch
> > currently does.
> 
> chunk: section of file gzipped without reset.
> block: section of file that dest has checksumed.

good terminology :-) let me add;

basis : the oldfile on the destination that is being updated to the target
target: the newfile on the source.

> The buy-back of the rolling checksums is to allow subsequent
> blocks to match at different offsets when a length change
> occurs mid-file.
> 
> Without a gzip reset aware variable blocksizes rsync
> wouldn't gain anything from a gzip reset because the reset

yes it would...

> would occur in the middle of a block.  The rolling checksums
> would be droppable, however because matching blocks within a
> gzip chunk would not be relocatable within that chunk.

yes it would, provided the block size is smaller than the compressed size of
a sequence of unchanged adjacent chunks. The first and last block fragments
of the matching sequence of chunks would be lost, but all the blocks in the
middle would match. The rolling checksum would re-sync at the first complete
matching block.

Provided the context-sensitive location of compression resets ensures they
occur at the same points in the target and basis around unchanged chunks,
rsync will re-sync and find matches after any change, provided the
matching sequence of chunks is larger than the blocksize.

> For any gain rsync blocks need to have block-aligned offsets
> within the gzip chunks.

There are two things block-aligned offsets in gzip chunks buys you; slightly
better sig's for the basis, and slightly less missed block fragments in the
delta. The reason you get better sigs is any block with a reset in the
middle has a useless checksum in the sig unless both chunks on either side
of the reset are unchanged. The reason you get a slightly better delta is
you avoid missing the block fragment at the beginning and end of a matching
sequence of chunks. 

Both of these are related and negligable provided the block size is small
compared to the size of matching sequences of chunks. These problems already
apply to rsync on uncompressed files, and are the reason xdelta can produce
more optimal deltas. Attempting to align compressor resets with block
boundaries is akin to tweaking block sizes on uncompressed data; you can
improve things a little, but the effort vs return is only really worth it
for very special cases. If you align resets to block at the expense of
locating them contextually, you will just make things worse.

> The first reason the destination zcats twice is that we might lack
> the disk space to store the uncompressed files.  The only
> reason to go to this much work for gziped files is because
> there are many large ones.  Therefore, we dare not leave
> them around uncompressed even when rsync works on one
> directory at a time.  The other reason is that the scanning
> and generation processes should not alter the trees.

I don't think you can efficiently get away with just using zcat when merging
deltas. The reason is the basis has to be seekable... because rsync can find
and use matches in files that have been re-ordered, you sometimes have to
seek backwards. The only way I can see to 'seek' without a full
decompression is to restart decompression every time you seek backwards. You
can sortof improve this by preserving the decompressor state at various
points so you can start seeking from there... I see a "seekable-zcat" class
in my mind already :-)

conclusion:

I don't think gzip-rsyncable helps much at all, certainly not enough to
warrent including it in the official gzip (figures that demonstrate
otherwise are welcome).

Context sensitive location of compression resets by applications that try to
place resets at the same points around unchanged chunks will result in
rsyncable compressed files (Compressed-XML'ers take note, a compression
reset after the meta-data section and at significant points in the XML
could make a huge difference).

Application specific rsync-alike updates of compressed data with contextual
location of compression resets will get better results if they toss the
rsync rolling checksum and synchronise using sigs for 'chunks' rather than
rsyncs 'blocks'.

The effort vs return of accurately locating compression resets contextually
around unchanged 'chunks' depends heavily on the type of data. The best
generic solution is to rsync the uncompressed data.

-- 
----------------------------------------------------------------------
ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
----------------------------------------------------------------------