[clug] Help explaining a sed command to delete last 10 lines from a file

Bob Edwards Robert.Edwards at anu.edu.au
Tue Jan 14 17:42:21 MST 2014


On 14/01/14 18:03, steve jenkin wrote:
...
> I've ~250,000 email messages I want to trim off some XML added to the end of each message, then compute the hash sum of, so I can identify duplicates... "Just Works" is good. Not horrendously slow is better.
>
> Alternatively, I could just look for the Message ID header :)
> Those should be unique.
>
> cheers
> steve
>

Hi Steve,

I have little to no sed foo (I know it exists...), but was thinking:

isn't stripping the last 10 lines from each e-mail a little
"brute-force"?

If you know the added material to trim is proper XML, can't you just
use that knowledge to find and remove it (or exclude it from the hash
sum)? May be a little slower, but surely "more correct"?

As for the Message ID's, they should be unique to each thread. You
have 250,000 messages, so you should be able to run through them and
compile a list and check them for uniqueness?

Maybe a "two pass" approach: hash parts of the header (date, subject
etc.) to find likely candidates for further uniqueness testing, then
go through those and check the bodies for complete uniqueness?

Depending upon what you are trying to achieve (remove duplicates,
reduce disk space etc.) you may want to consider loading the whole
thing into an SQL (or other) database and then letting the database
engine do all the work of looking for the various types of uniqueness
you are interested in.

I'm also thinking that a primary source of duplication in e-mails is
the constant quoting of previous e-mails. It is possible to encode the
fact that some part of some other message is being quoted without
repeating it? Can you encode the depths and styles of the quoting etc.
in such a way as to be able to losslessly reproduce the subordinate
message from it predecessors etc. Beginning to sound a bit like git
and what it is doing with all it's little hashes of code fragments.

Just some thoughts.

Cheers,

Bob Edwards.


More information about the linux mailing list