[clug] awk or Perl regex question

fj.whittle at gmail.com fj.whittle at gmail.com
Sun Jul 21 03:29:02 UTC 2019


In Perl I'd be more tempted to extract the part needed and print it...
Assuming Unicode to match the ’ in the O’SHEA example - should this be
O'SHEA?  But Steve said single quote...  Data came from MS Office?

This will do it in one regex, for more than just ANSI: 

perl -CS -nE '/(?=\S) \b (?<surname> [\p{Lu}\x{2019}\s]+) (?<=\S) \s*
$/x and say $+{surname}' < names.txt

\p{Lu} is all characters matching the Unicode Uppercase_letter
property, \x{2019} is ’
-CS turns on UTF-8 encoding for standard I/O streams.

Of course this will still only work for writing systems where uppercase
is even a thing.  If any of your names are in e.g. Chinese (and not
pinyin) you're out of luck, because it will give you the whole name.

Or in Perl6:

perl6 -ne '/« $<surname> = <:Lu + [’\s]>+ » \s* $/ and put
$<surname>' < names.txt

(« is a start of word assertion, » end of word)

– Francis

Overcomplicating things since forever ago

On Sun, 2019-07-21 at 07:23 +1000, Kim Holburn via linux wrote:
> Does this do what you want?
> 
>  perl -p -e 's/\b[A-Z][a-z]+\b//g;s#^[/\s]*##;' < names.txt
> 
> I sent this but it never seemed to have arrived.  Perhaps filtered by
> AV?
> 
> > On 2019/Jul/20, at 6:08 pm, steve jenkin via linux <
> > linux at lists.samba.org <mailto:linux at lists.samba.org>> wrote:
> > 
> > In awk, I’m trying to remove First Names from Full Name strings.
> > There might be multiple first names and alternative separated by a
> > ‘/‘
> > 
> > Surnames as UPPERCASE and happen at the end of the string [and may
> > contain single quote (O’SHEA) or  a blank (DE SMETS).
> > 
> > Currently I’ve got a working version doing two different subs, the
> > first is unanchored, the second is anchored to the start of the
> > string (^)
> > 
> > 	sub(/Mc[A-Z][a-z]* /, "", A[1]); 
> > 	sub(/^([A-Z][a-z\047]*[ /])+/, "", A[1]);
> > 
> > I’ve tried this regex, unachored and not, with ‘?’ for 0 or 1
> > repeats of the group or ‘*’ for 0 or more repeats.
> > 
> > 	(Mc)?([A-Z][a-z\047]*[ /])+
> > 
> > Any suggestions for other things to try?
> > 
> > --
> > Steve Jenkin, IT Systems and Design 
> > 0412 786 915 (+61 412 786 915)
> > PO Box 38, Kippax ACT 2615, AUSTRALIA
> > 
> > mailto:sjenkin at canb.auug.org.au <mailto:sjenkin at canb.auug.org.au> 
> > http://members.tip.net.au/~sjenkin <
> > http://members.tip.net.au/~sjenkin>
> > 
> > 
> > -- 
> > linux mailing list
> > linux at lists.samba.org <mailto:linux at lists.samba.org>
> > https://lists.samba.org/mailman/listinfo/linux
> 
> -- 
> Kim Holburn
> IT Network & Security Consultant
> T: +61 2 61402408  M: +61 404072753
> mailto:kim at holburn.net <mailto:kim at holburn.net>  aim://kimholburn
> <aim://kimholburn>
> skype://kholburn <skype://kholburn> - PGP Public Key on request 
> 
> 
> 




More information about the linux mailing list