[clug] Compressing similar text files
sjenkin at canb.auug.org.au
Sun Sep 9 04:03:11 MDT 2012
For a project, I've downloaded ~5,000 files (3.25M lines) taking around
They compress with gzip to 58Mb, around 4 times.
bzip2 is very slightly better with default parameters.
Whilst the following test destroys information, it indicates the amount
"sort -u *.html" produces a 19Mb file (192161 lines. nearly 17:1),
which gzip reduces to 1.8Mb [bzip2 1.3Mb]
By manually replacing common strings to 2char groups, the file reduced
to 7.5Mb, but gzip only compressed it to 1.6Mb [1.3Mb bzip2]
I'm very surprised and pleased that gzip & bzip2 do so well on the
bzip2 seems to notice all those long common prefixes/suffixes.
Does anyone either know:
- if there exists a package to code (text) files according to a
keyfile? The simplest key set for highly similar files is 'sort -u'.
- If there isn't a package, how I might construct a self-extracting
Perl script along these lines.
- A Big Array with each of the Unique lines as constants
- An Ordered Data array with the sequence
- The script can be bzip'ed, and because of that Big Arrray, will
Many thanks in advance.
PS: having gone to the trouble to describe this problem, it occurred to
me that with the 15:1 duplication, a versioning or diff system would
work well. And those files could be bzip2'd.
==> Anyone done stuff like that?
Proof of concept.
Telstra_RIM_exchanges steve$ ls -l rimexchangejs_RWTE.html
-rw-r--r-- 1 steve steve 20325 9 Sep 12:05 rimexchangejs_RWTE.html
-rw-r--r-- 1 steve steve 10733 9 Sep 12:05 rimexchangejs_RYAN.html
Telstra_RIM_exchanges steve$ diff -e rimexchangejs_RWTE.html
These are the two pages I've been playing with:
The map, from which I want to save "RIM Information"
Called by script attached to the "RIM Information" button.
Uses session cookie, no params.
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA
sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin
More information about the linux