[clug] Anyone using 'snappy', Google's fast compression?

Sun May 15 14:04:14 MDT 2011

Hi Paul,

----- Original Message -----
> What if there was a reversible algorithm that would slightly *expand*
> the data but, in the process, make the compression work better than on
> the original block. If you've ever stepped through some of the the
> algorithms (and, let's face it, who hasn't done that for fun every once
> in a while? ;-) ) then there are definite points when you can say "the
> way the lookup table is organised right now, if we received
> $LONG_INPUT_BLOCK it would compress into one output symbol!
> 
> And a lot of algorithms - Huffman, Arithmetic and even LZW with a
> stretch - can benefit from precalculating the symbol lookup table so
> that the more common inputs have smaller outputs. Most of them are
> run in adaptive mode, where you build the symbol table up as you read
> input bytes, because this means you don't have to do two passes over
> the input. [...]

Original Huffman did 2-pass. If you chunk that (block-wise precalc rather than adaptive), you'd get a decent balance.

LZ is adaptive by nature, but original it used to toss away its entire tale. What Welch did was only toss away the leaf nodes of the lookup tree. Simple tweak, very good results.

I was very impressed with arithmetic encoding/compression when it first appeared in OSS code (late 80s I believe, the algorithm stems from late 70s though). At the time the CPU power wasn't there, so it was soon combined with an adaptive Huffman (such as the LZARI derivative of LZH).
(on a side note, Huffman is essentially a specialised form of arithmetic coding)

I understand that bzip used to use arithmetic coding but changed in bzip2 due to patent concerns; ref http://en.wikipedia.org/wiki/Arithmetic_coding obviously most of those patents have now expired and the ones remaining are probably un-enforcible due to public prior art (and the earlier patents).

I think your idea is interesting, but perhaps mainly for research (your Masters) and indulgent purposes. In the real world, compression algorithms compromise for the data source and storage/transport medium. Static files can be handled differently from streams, but most tools optimise for a stream (sometimes chunking it). They also optimise for "cheap decompression" (less CPU power required on the decoding end).
CPU power is generally abundant now, and storage cheap, transport in many cases is also cheap. Of course there are exceptions... on a mobile device you've got lower transport speed (or at least higher latency which makes things annoying), relatively less storage, and CPU power is a nasty compromise against power consumption. So storing info that gets read on/by a mobile device would still want to do "cheap decompression", but could definitely benefit from optimisations.

Since CPU on the compression side is generally not a hindrance any more, I'd like to see new pure or modified/combined implementations of arithmetic encoding, as the compression ratios were pretty amazing. Actually, googling for "arithmetic compression" and "arithmetic coding" there are plenty of papers and also code (mostly research afaics) so there should be plenty of experimentation to learn from.

I don't know enough about the legal status of arithmetic encoding (or its perceived status which may be just as important) to say whether any FLOSS complementation could now be accepted for use/distribution within the community for real use. Heck JPEG could start using it again, it's in the specs ;-) we could revisit original bzip, etc. Might be worth a look.

Cheers,
Arjen.
-- 
Arjen Lentz, Exec.Director @ Open Query (http://openquery.com)
Remote expertise & maintenance for MySQL/MariaDB server environments.

Follow us at http://openquery.com/blog/ & http://twitter.com/openquery