[clug] CFile testers needed :-) Now reading and writing XZ files!

Fri Nov 22 22:48:37 MST 2013

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

After my success copying the XZ program's mechanism for determining the
uncompressed size of a xz-compressed file, I thought I'd have a look at my
other problem implementation: bzip2.

Now, gzip puts the uncompressed size of the file as a little-endian 32-bit
unsigned integer in the last four bytes of the file.  Bzip2 (unfortunately)
didn't do anything this simple; according to an email I had from the author
of bzip2 (Julian Seward), there isn't an uncompressed length stored in the
file.  At the moment I do an awful hack: I open a pipe to a shell running
'bzcat filename | wc -c', read the result in as an ascii string, convert it
to an integer, store that if possible in an extended attribute so we don't
have to do this again, and return the size.  But I got to thinking...

Bzip2 is basically blocks of uncompressed file; each one is fed through the
Burrows-Wheeler Transform (a wonderful thing to read up on if you're
interested in clever algorithms) and then compressed using a dynamically
generated Huffman tree.  Each block is a multiple of 100kb, with larger
blocks giving better but slower compression (more data to transform and
compress); maximum (9) compression uses 900kb blocks.  Each fixed-length
uncompressed block ends up as a variable-length compressed block.

Now, my thought was: if there's something in the header of each block that
indicates its compressed length, then I can find the next header easily.  I
know (from the fourth byte of the file) what the uncompressed size of the
blocks are.  The last block has to also have some kind of end symbol or
length, otherwise all output would be in multiples of x*100K.  So presumably
there's some way for the decompressor to know when its hit the end of the block.

- From reading bzip2's decompress.c, I've been able to work out the following.
 The file starts with "BZh" then the numeral 1-9 for the size of the BWT
blocks in 100k increments.  Then it has bytes in hex 31 41 59 26 53 59
(which interestingly spell out 1AY&SY but are much more likely just to be
the first twelve digits of Pi recorded in nibbles), presumably as a 'start
of compressed data' header that's unlikely to occur in regular data.  Then
there's the 32-bit CRC and a 32-bit 'origPtr' ('for doing the block
sorting', whatever that means), then the meat of the actual compression
information starts: a 16-bit mapping table, a 3-bit number of 'groups' ,
15-bit number of 'selectors', and so forth into the technical details.

That's as far as I've got.  There certainly isn't an obvious 'compressed
block size', nor an obvious 'single compression block header' to look for.
The whole thing is driven by an ingenious state machine that also looks like
linear coding and makes my head hurt.  But maybe I'm missing something, or
someone else that's into these things that reads this list will be able to
make sense of what's going on in there better than I can.

It may be that the best I can do without a pipe open is to open the file
using the bzip2 routines, read through the entire thing without any
processing of data, count the bytes read, and then use that.  I may have to
do some speed trials.

Anyway, I thought I'd leave this here as food for thought.  And advice
welcome :-)

Have fun,

Paul
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.15 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlKQQbUACgkQu7W0U8VsXYLZkgCfTo6la8ZftuGxl6TsnbxwHKWd
zHwAoLdeKPN5uF5wZIJFT5gBjaQYFY6c
=N/Qz
-----END PGP SIGNATURE-----