[clug] Anyone using 'snappy', Google's fast compression?

Paul Wayper paulway at mabula.net
Sun May 15 06:46:58 MDT 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/12/2011 05:41 PM, Michael Cohen wrote:
> This is basically the same as compressing the input anyway. Most LZW
> like algorithms do this internally - compare a block against the
> dictionary and if its not there they emit a huffman code for "literal"
> and then literally append the output. Your overall size will still be
> enlarged by this code. If you take a file already compressed by gzip
> and recompress it - its size will be enlarged by a constant - which is
> the file header and essentially this literal escape code.

The question I've had in my brain for a Masters thesis some day is:

What are the ways of reversibly transforming input data to make it better
compressible?

For example, Bzip2 does a Burroughs-Wheeler Transform on the input data before
a basic compression on it (block look-up, I think).  Without going into any
details at all, the BWT groups areas of the input that are similar together,
so that localised compression on that section will be better (because block
references have are more local and are therefore smaller).  Uncompress and
then reverse the BWT to get the original data back.

What if there was a reversible algorithm that would slightly *expand* the data
but, in the process, make the compression work better than on the original
block.  If you've ever stepped through some of the the algorithms (and, let's
face it, who hasn't done that for fun every once in a while? ;-) ) then there
are definite points when you can say "the way the lookup table is organised
right now, if we received $LONG_INPUT_BLOCK it would compress into one output
symbol!

And a lot of algorithms - Huffman, Arithmetic and even LZW with a stretch -
can benefit from precalculating the symbol lookup table so that the more
common inputs have smaller outputs.  Most of them are run in adaptive mode,
where you build the symbol table up as you read input bytes, because this
means you don't have to do two passes over the input.  Yet that's really only
necessary when you can't backtrack, and with memory sizes and processor speeds
today you can easily hold a megabyte of buffer to calculate your lookup tables
on and then compress before moving on to the next block.

And I still need to write my Zeckendorff representation library up for proper
- - I've got a Perl and a Python version and I'm sure C wouldn't be too hard -
maybe there's a project for CodeCave 2011.

If only I had time to do a Masters...

Have fun,

Paul

P.S. I still want to look into the details for snappy as soon as I get some
free time.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAk3Py0IACgkQu7W0U8VsXYKvvACeITYp5TbtQgrohJnmMgzrfSDh
cUAAn3A5V11DQPw5jB84O94pmTOu6kRy
=VPr5
-----END PGP SIGNATURE-----


More information about the linux mailing list