[clug] Compressing similar text files
alastair at d-silva.org
Sun Sep 9 04:38:22 MDT 2012
> -----Original Message-----
> From: linux-bounces at lists.samba.org [mailto:linux-
> bounces at lists.samba.org] On Behalf Of steve jenkin
> Sent: Sunday, 9 September 2012 8:03 PM
> To: CLUG List
> Subject: [clug] Compressing similar text files
> For a project, I've downloaded ~5,000 files (3.25M lines) taking around
> They compress with gzip to 58Mb, around 4 times.
> bzip2 is very slightly better with default parameters.
> Whilst the following test destroys information, it indicates the amount of
> redundancy "sort -u *.html" produces a 19Mb file (192161 lines. nearly
> which gzip reduces to 1.8Mb [bzip2 1.3Mb]
> By manually replacing common strings to 2char groups, the file reduced to
> 7.5Mb, but gzip only compressed it to 1.6Mb [1.3Mb bzip2]
> I'm very surprised and pleased that gzip & bzip2 do so well on the
> sorted/uniquified file.
> bzip2 seems to notice all those long common prefixes/suffixes.
Before trying to mangle the data, take a look at lrzip, which sorts first
(similar to rzip), then LZMAs the result.
Alastair D'Silva mob: 0423 762 819
twitter: evildeece msn: alastair at d-silva.org
More information about the linux