[clug] Compressing similar text files

Sun Sep 9 04:19:57 MDT 2012

On 9 September 2012 20:03, steve jenkin <sjenkin at canb.auug.org.au> wrote:

> For a project, I've downloaded ~5,000 files (3.25M lines) taking around
> 200Mb.
>
> They compress with gzip to 58Mb, around 4 times.
> bzip2 is very slightly better with default parameters.
>
> Whilst the following test destroys information, it indicates the amount
> of redundancy
> "sort -u *.html" produces a 19Mb file (192161 lines. nearly 17:1),
> which gzip reduces to 1.8Mb [bzip2 1.3Mb]
>
> By manually replacing common strings to 2char groups, the file reduced
> to 7.5Mb, but gzip only compressed it to 1.6Mb [1.3Mb bzip2]
>
> I'm very surprised and pleased that gzip & bzip2 do so well on the
> sorted/uniquified file.
> bzip2 seems to notice all those long common prefixes/suffixes.
>
> ==> Question.
>
> Does anyone either know:
>  - if there exists a package to code (text) files according to a
> keyfile?  The simplest key set for highly similar files is 'sort -u'.
>

Would tar + (b|g)zip work for your application?

David