[clug] Compressing similar text files

Sun Sep 9 04:38:22 MDT 2012

> -----Original Message-----
> From: linux-bounces at lists.samba.org [mailto:linux-
> bounces at lists.samba.org] On Behalf Of steve jenkin
> Sent: Sunday, 9 September 2012 8:03 PM
> To: CLUG List
> Subject: [clug] Compressing similar text files
> 
> For a project, I've downloaded ~5,000 files (3.25M lines) taking around
> 200Mb.
> 
> They compress with gzip to 58Mb, around 4 times.
> bzip2 is very slightly better with default parameters.
> 
> Whilst the following test destroys information, it indicates the amount of
> redundancy "sort -u *.html" produces a 19Mb file (192161 lines. nearly
17:1),
> which gzip reduces to 1.8Mb [bzip2 1.3Mb]
> 
> By manually replacing common strings to 2char groups, the file reduced to
> 7.5Mb, but gzip only compressed it to 1.6Mb [1.3Mb bzip2]
> 
> I'm very surprised and pleased that gzip & bzip2 do so well on the
> sorted/uniquified file.
> bzip2 seems to notice all those long common prefixes/suffixes.
> 

Before trying to mangle the data, take a look at lrzip, which sorts first
(similar to rzip), then LZMAs the result.
http://ck.kolivas.org/apps/lrzip/

-- 
Alastair D'Silva           mob: 0423 762 819
twitter: evildeece   msn: alastair at d-silva.org
blog: http://alastair.d-silva.org