[clug] Help explaining a sed command to delete last 10 lines from a file

steve jenkin sjenkin at canb.auug.org.au
Tue Jan 14 00:03:17 MST 2014


We have gurus of many types on-list and I was hoping someone could explain just how this bit of sed magic works. I've confirmed it works as advertised, but would rather understand how it works before relying on it.
<http://www.unix.com/unix-advanced-expert-users/59631-delete-lines.html>

> sed -e ':a' -e '$d;N;2,10ba' -e 'P;D' filename

there was also a version using 'tac' (reverse 'cat')  [does it use memory or /tmp??? no idea]:

> tac filename | tail -n +10 | tac

I saw some interesting perl as well, though it read whole files into memory, but that's a guess.

> perl -0777pe's/(?:.*\n){10}\z//'  filename

I haven't gone looking for a SED tutorial page - it's so ancient there are probably some great free resources around that'd tell me everything I wanted to know and more :(

As always, _better_ and/or alternative means, scripted or not, more than welcome.

I've ~250,000 email messages I want to trim off some XML added to the end of each message, then compute the hash sum of, so I can identify duplicates... "Just Works" is good. Not horrendously slow is better.

Alternatively, I could just look for the Message ID header :)
Those should be unique.

cheers
steve

From the sed manpage:

The form of a sed command is as follows:

  [address[,address]]function[arguments]

-> 'Cycles'
Normally, sed cyclically copies a line of input, not including its terminating newline character, into a pattern space, (unless there is something left after a ``D'' function), applies all of the commands with addresses that select that pattern space, copies the pattern space to the standard output, appending a newline, and deletes the pattern space.

[2addr]d
Delete the pattern space and start the next cycle.

--> -e '$d' =>  delete the line, for last line (addr = $)


[2addr]b[label]
Branch to the ``:'' function with the specified label.  If the label is not specified, branch to the end of the script.

-> the "2,10,ba" fragment in 2nd command string, loops back to the 1st command, "-e ':a'

[2addr]N
Append the next line of input to the pattern space, using an embedded newline character to separate the appended material from the original contents.  Note that the current line number changes.

[2addr]P
Write the pattern space, up to the first newline character to the standard output.

[2addr]D
Delete the initial segment of the pattern space through the first newline character and start the next cycle.

--> -e 'P;D'  command. Write then delete and start next cycle. Cycles???? Is that just 'next line'?

=> I'm not seeing how it gets to the last line of the file and then backtracks...
   _or_ is it running the loop on every line in the file and somehow buffering?


steve jenkin
sjenkin at canb.auug.org.au





More information about the linux mailing list