[PATCH] Compressed output files

Wed Jul 3 10:02:02 EST 2002

On Wed, 3 Jul 2002, Ph. Marek wrote:

> The GZIP Standard http://www.faqs.org/rfcs/rfc1952.html defines
> the field ISIZE:
> 	This contains the size of the original (uncompressed) input
> 	data modulo 2^32.
> I'd expect that zlib sets that data and has a way to read this?

[...]

> BTW: we might even save the MD4 checksum of the original file
> in a gzip field. See the RFC:
> 	XLEN (eXtra LENgth)
> 		If FLG.FEXTRA is set, this gives the length of the optional
> 		extra field.  See below for details.

I used the quick and dirty gzopen(), gzwrite(), gzread() and gzclose(),
which don't have these features.  However, there may be an additional
function you can call to set this kind of data, or the more full-featured
zlib functions may support it.  I'll take a look.

There are several extra attributes it would be nice to store for our
gzip'd files: original uncompressed file size (without the 2^32 limit),
md4sum, md4sum for different chunks of the file, owner, permissions,
modification time, etc.  In short, everything we need to know 1. if the
file has changed, 2. how to update it most efficiently if it has changed,
and 3. how to restore it to its original state.  These attributes would be
useful if you wanted to create an archive that preserved ownership, you
couldn't use tar, and you didn't have root permissions on the machine
where the archive was stored.

I see three ways of doing this:

1. Add extra data to the compressed file using FLG.FEXTRA.  We might
contribute code to read/write this extra data back to the zlib project in
case other people have similar needs.

2. Extend the gzip standard... hrmmm...

3. Have an "adjunct" file of some sort which contains the meta information
for all files that rsync has copied.  Rsync could also use this file
instead of scanning the directory structure when starting up, and there
could be rsync options to "generate adjunct file from what's on disk" and
"apply permissions etc.  from adjunct file to what's on disk".

> Maybe even the checksums of the individual blocks - but that's depending on
> the blocksize (which can vary with every invocation), and needs some space.
> If there would be a way to decompress only from the middle of the file it
> could make sense, as we wouldn't have to unzip the complete file just to send
> the checksums over the wire ...

I agree this would be nice to have, but I can't think of an easy way to do
it.  There is a function to seek through a gzFile in gzio.c, but I believe
it just uncompresses the file behind your back.

Really this may require an extension to the gzip standard ... something
like a table of contents listing the points in the file you can seek to
and begin decompressing, and lists both the offset in the uncompressed
file and the offset in the compressed file for these points.  (I should
read the gzip standard before throwing out any more ideas ...)

> It's amazing. I'll have a use for that if the problem with the sizes
> is solved - completly unzipping and checksumming the file doesn't make
> sense for local file systems.

Thanks!  Let's see what we can do to get the file size thing working; I
agree that is an important feature to have.

BTW, checking time stamps should still work, so unless the file was
extended without changing its mtime you may be able to get by with
ignoring filesizes and checksums on (say) your daily runs, but using
--checksum once a week.  I've seen a few cases in the field where a file's
contents changed without its size or mtime changing, so I like to run with
--checksum periodically anyway.

	-Joel