[clug] Compressing similar text files

steve jenkin sjenkin at canb.auug.org.au
Sun Sep 9 04:03:11 MDT 2012


For a project, I've downloaded ~5,000 files (3.25M lines) taking around
200Mb.

They compress with gzip to 58Mb, around 4 times.
bzip2 is very slightly better with default parameters.

Whilst the following test destroys information, it indicates the amount
of redundancy
"sort -u *.html" produces a 19Mb file (192161 lines. nearly 17:1),
which gzip reduces to 1.8Mb [bzip2 1.3Mb]

By manually replacing common strings to 2char groups, the file reduced
to 7.5Mb, but gzip only compressed it to 1.6Mb [1.3Mb bzip2]

I'm very surprised and pleased that gzip & bzip2 do so well on the
sorted/uniquified file.
bzip2 seems to notice all those long common prefixes/suffixes.

==> Question.

Does anyone either know:
 - if there exists a package to code (text) files according to a
keyfile?  The simplest key set for highly similar files is 'sort -u'.

 - If there isn't a package, how I might construct a self-extracting
Perl script along these lines.
   - A Big Array with each of the Unique lines as constants
   - An Ordered Data array with the sequence
 - The script can be bzip'ed, and because of that Big Arrray, will
compress well..


Many thanks in advance.

Cheers
steve

PS: having gone to the trouble to describe this problem, it occurred to
me that with the 15:1 duplication, a versioning or diff system would
work well. And those files could be bzip2'd.

==> Anyone done stuff like that?

Proof of concept.

Telstra_RIM_exchanges steve$ ls -l rimexchangejs_RWTE.html
rimexchangejs_RYAN.html
-rw-r--r--  1 steve  steve  20325  9 Sep 12:05 rimexchangejs_RWTE.html
-rw-r--r--  1 steve  steve  10733  9 Sep 12:05 rimexchangejs_RYAN.html

Telstra_RIM_exchanges steve$ diff -e rimexchangejs_RWTE.html
rimexchangejs_RYAN.html|wc -c
     706

====================================


Background:

These are the two pages I've been playing with:

The map, from which I want to save "RIM Information"
<http://www.adsl2exchanges.com.au/viewexchange.php?Exchange=SCLN>

The URL that I save. Returns javascript to setup map overlay/co-ords.
Called by script attached to the "RIM Information" button.
Uses session cookie, no params.
<http://www.adsl2exchanges.com.au/rimexchangejs.php>


-- 
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin


More information about the linux mailing list