Compressed backup

Tue Jun 4 00:48:42 EST 2002

> On Sat, Jun 01, 2002 at 05:18:42PM -0700, jw schultz wrote:
> > On Sat, Jun 01, 2002 at 11:46:37PM +1000, Donovan Baarda wrote:
> > > On Sat, Jun 01, 2002 at 04:57:15AM -0700, jw schultz wrote:
> > > > On Sat, Jun 01, 2002 at 08:51:26PM +1000, Donovan Baarda wrote:
> > > > > On Fri, May 31, 2002 at 05:25:15PM -0700, jw schultz wrote:
> [...]
> > > putting content-aware compression resets at appropriate points will make
> > > files (a bit) rsyncable with rsync as it already stands. Making rsync
> > > compression-reset aware would only improve things a little bit. However, the
> > > resets _must_ be content-aware, they must occur on "likely to match"
> > > boundaries, not at every 4K of compressedtext as the gzip-rsyncable patch
> > > currently does.
> > 
> > chunk: section of file gzipped without reset.
> > block: section of file that dest has checksumed.
> 
> good terminology :-) let me add;
> 
> basis : the oldfile on the destination that is being updated to the target
> target: the newfile on the source.
> 
> > The buy-back of the rolling checksums is to allow subsequent
> > blocks to match at different offsets when a length change
> > occurs mid-file.
> > 
> > Without a gzip reset aware variable blocksizes rsync
> > wouldn't gain anything from a gzip reset because the reset
> 
> yes it would...
> 
> > would occur in the middle of a block.  The rolling checksums
> > would be droppable, however because matching blocks within a
> > gzip chunk would not be relocatable within that chunk.
> 
> yes it would, provided the block size is smaller than the compressed size of
> a sequence of unchanged adjacent chunks. The first and last block fragments
> of the matching sequence of chunks would be lost, but all the blocks in the
> middle would match. The rolling checksum would re-sync at the first complete
> matching block.
> 
> Provided the context-sensitive location of compression resets ensures they
> occur at the same points in the target and basis around unchanged chunks,
> rsync will re-sync and find matches after any change, provided the
> matching sequence of chunks is larger than the blocksize.
> 
> > For any gain rsync blocks need to have block-aligned offsets
> > within the gzip chunks.
> 
> There are two things block-aligned offsets in gzip chunks buys you; slightly
> better sig's for the basis, and slightly less missed block fragments in the
> delta. The reason you get better sigs is any block with a reset in the
> middle has a useless checksum in the sig unless both chunks on either side
> of the reset are unchanged. The reason you get a slightly better delta is
> you avoid missing the block fragment at the beginning and end of a matching
> sequence of chunks. 
> 
> Both of these are related and negligable provided the block size is small
> compared to the size of matching sequences of chunks. These problems already
> apply to rsync on uncompressed files, and are the reason xdelta can produce
> more optimal deltas. Attempting to align compressor resets with block
> boundaries is akin to tweaking block sizes on uncompressed data; you can
> improve things a little, but the effort vs return is only really worth it
> for very special cases. If you align resets to block at the expense of
> locating them contextually, you will just make things worse.
> 
> > The first reason the destination zcats twice is that we might lack
> > the disk space to store the uncompressed files.  The only
> > reason to go to this much work for gziped files is because
> > there are many large ones.  Therefore, we dare not leave
> > them around uncompressed even when rsync works on one
> > directory at a time.  The other reason is that the scanning
> > and generation processes should not alter the trees.
> 
> I don't think you can efficiently get away with just using zcat when merging
> deltas. The reason is the basis has to be seekable... because rsync can find
> and use matches in files that have been re-ordered, you sometimes have to
> seek backwards. The only way I can see to 'seek' without a full
> decompression is to restart decompression every time you seek backwards. You
> can sortof improve this by preserving the decompressor state at various
> points so you can start seeking from there... I see a "seekable-zcat" class
> in my mind already :-)
> 
> conclusion:
> 
> I don't think gzip-rsyncable helps much at all, certainly not enough to
> warrent including it in the official gzip (figures that demonstrate
> otherwise are welcome).
> 
> Context sensitive location of compression resets by applications that try to
> place resets at the same points around unchanged chunks will result in
> rsyncable compressed files (Compressed-XML'ers take note, a compression
> reset after the meta-data section and at significant points in the XML
> could make a huge difference).
> 
> Application specific rsync-alike updates of compressed data with contextual
> location of compression resets will get better results if they toss the
> rsync rolling checksum and synchronise using sigs for 'chunks' rather than
> rsyncs 'blocks'.
> 
> The effort vs return of accurately locating compression resets contextually
> around unchanged 'chunks' depends heavily on the type of data. The best
> generic solution is to rsync the uncompressed data.
> 
> -- 
> ----------------------------------------------------------------------
> ABO: finger abo at minkirri.apana.org.au for more info, including pgp key
> ----------------------------------------------------------------------
> 

If you'll indulge me, I'll just restate the problem (as I see it, anyway) 
before chiming in with my idea...

Basically, if any change is made to a source file which is then gzipped, the 
gzipped version (the compressedtext) will have completely changed (relative to
the gzipped version of the original) after the point where the change in the
source file is encountered.  This means rsync will never find a matching block
after that point.

One idea is to reset the Huffman tree occasionally while gzipping, because if
the source text is the same after such a reset, the compressed text will be
as well.  Clearly, these resets must happen in the same place relative to the
unchanged text when compressing both the original and the modified source 
files, and as we do not have access to the original source file when we are 
compressing the modified source file, this presents somewhat of a challenge.

Performing such a compression reset every N bytes of compressed text will
clearly not work very well at all, because a change in the source text will 
usually change the number of bytes of compressed text output, and therefore the
compresion resets when compressing the original and modified files will not
happen in the same place relative to the unchanged text.

(This is what JW stated the "gzip-rsyncable" patch does in an earlier post
in this thread - however, this *isn't* what the patch actually appears to do).

Performing such a compression reset every N bytes of *source* text will work
a little better - if only modifications, rather than additions or deletions,
are made to the file, then the resets will remain synchronised with respect to
the unchanged plaintext.  However, this still only represents a minority of the
changes that rsync might be expected to encounter, so it's also not a great 
solution.

(This isn't what the "gzip-rsyncable" patch does either).

Clearly, as the compression resets need to be performed in the same place 
relative to the unchanged data in the source file, some method of deciding when
to reset based on the source data should be used.  This is what you guys were
getting at - using some knowledge of the source data format to perform 
compression resets at data-relative boundaries (like the end of a header, or
the start of a section).  However, it should be possible to make this decision
in a manner that is independent of the actual format of the data.

My idea was that you could choose some heuristic that will be true for 
approximately 1 in N of the bytes in any file, and is calculated from a small 
window of the data in the file (so that it is position-independent, but 
data-dependent).  You simply perform a compression reset whenever this 
heuristic is true (and N should be a number that's large enough so that you
don't reset too often, yet small enough so that rsync can find the most
matching data).  I was thinking a heuristic along the lines of "the least 
significant bits of the previous 12 bytes are 101111010101" (a number derived
from my initials), giving an N of 4096.  This would allow the compression
resets to sync up at the first run of 12 identical bytes after the change in
the source text.

When I finally took the time to properly read Rusty's "gzip-rsyncable" patch[1]
while writing this mail, I discovered that it appears to use this same general
technique, although the heuristic he has chosen is "the sum of the previous
4096 bytes of source text is divisible by 4096".  I think my heuristic could 
allow the gzips to sync up faster (after 12 identical bytes of source text 
following a change, compared to 4096), whilst still having the same compression
ratio hit (resets every 4096 bytes, on average), but I've only come up with
this today so I haven't done as much thinking on the subject as Rusty - 
likely there is some "gotcha" that I haven't seen.

What do you think?

	- Kevin Easton.

[1] http://rsync.samba.org/ftp/unpacked/rsync/patches/gzip-rsyncable.diff