[clug] Compressing similar text files
David Austin
david at d-austin.net
Sun Sep 9 04:19:57 MDT 2012
On 9 September 2012 20:03, steve jenkin <sjenkin at canb.auug.org.au> wrote:
> For a project, I've downloaded ~5,000 files (3.25M lines) taking around
> 200Mb.
>
> They compress with gzip to 58Mb, around 4 times.
> bzip2 is very slightly better with default parameters.
>
> Whilst the following test destroys information, it indicates the amount
> of redundancy
> "sort -u *.html" produces a 19Mb file (192161 lines. nearly 17:1),
> which gzip reduces to 1.8Mb [bzip2 1.3Mb]
>
> By manually replacing common strings to 2char groups, the file reduced
> to 7.5Mb, but gzip only compressed it to 1.6Mb [1.3Mb bzip2]
>
> I'm very surprised and pleased that gzip & bzip2 do so well on the
> sorted/uniquified file.
> bzip2 seems to notice all those long common prefixes/suffixes.
>
> ==> Question.
>
> Does anyone either know:
> - if there exists a package to code (text) files according to a
> keyfile? The simplest key set for highly similar files is 'sort -u'.
>
Would tar + (b|g)zip work for your application?
David
More information about the linux
mailing list