[clug] Compressing similar text files

Brad Hards bradh at frogmouth.net
Sun Sep 9 04:20:05 MDT 2012


On Sunday 09 September 2012 20:03:11 steve jenkin wrote:
> For a project, I've downloaded ~5,000 files (3.25M lines) taking around
> 200Mb.
> 
> They compress with gzip to 58Mb, around 4 times.
> bzip2 is very slightly better with default parameters.
[Stuff relating to the real question, which I'm not addressing at all, removed]

So the limiting factor here is probably the window size that bzip2 "looks 
over" to find redundancy.

Something with a larger window (ideally larger than the input data) will 
presumably find more commonality.
Can you try it with rzip (http://rzip.samba.org/) and lrzip (see 
http://ck.kolivas.org/apps/lrzip/README)?

Brad



More information about the linux mailing list