Non greedy pattern match in sed

Kim Holburn kim.holburn at anu.edu.au
Tue Aug 13 22:03:01 EST 2002


At 9:35 PM +1000 2002/08/13, Paul Bryan wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On Tue, 13 Aug 2002 21:15, Kim Holburn wrote:
>
>> If you use perl you can do this. From the perl manual (man perlre)
>>
>>        By default, a quantified subpattern is "greedy", that is,
>>        it will match as many times as possible (given a particu-
>>        lar starting location) while still allowing the rest of
>>        the pattern to match.  If you want it to match the minimum
>>        number of times possible, follow the quantifier with a
>>        "?".  Note that the meanings don't change, just the
>>        "greediness":
>>
>>            *?     Match 0 or more times
>>            +?     Match 1 or more times
>>            ??     Match 0 or 1 time
>>            {n}?   Match exactly n times
>>            {n,}?  Match at least n times
>>            {n,m}? Match at least n but not more than m times
>>
>>  .... | perl -p -e 's/\/\*.*?\*\//g'
>>
>
>I don't think this is quite right because the regex isn't matching mulitple
>instances of ".*"

Well,
1) it works
2) it is matching multiple instances of "." -  "*" means multiple instances of (whatever came first in this case ".") .  .*? means the smallest possible multiple instance of "." .   

Here is what it says in my perl book:

If you say /.*foo/, for example, it will try to match the maximal number of "any" characters (represented by the dot) clear out to the end of the line before it ever tries looking for "foo", and then when the "foo" doesn't match there (and it can't, because there's not enough room for it at the end of the string), the Engine will back off one character at a time until it finds a "foo".  If there is more than one "foo" in the line, it'll stop on the last one, and throw away all the shorter choices it could have made.

By placing a question mark after any of the greedy quantifiers, they can be made to choose the smallest quantity for the first try.  So if you say /.*?foo/, the .*? first tries to match 0 characters, then 1 character, then 2, and so on until it can match the "foo".  Instead of backtracking backward, it backtracks forward, so to speak, and ends up finding the first "foo" on the line instead of the last.


>
>What it's doing is looking for the last occurance of "*/". So using a
>quantifier like "?" isn't going to make any difference because it's the last
>"*/" that's causing the problem.
>
>In other words in the command:
>
>echo "/* This is a */ test /* line */" | sed 's/\/\*.*\*\//g'
>
>it finds "*/" before "test", but doesn't stop there! This is where the greedy
>part comes in to it. It keeps on looking to see if there is another occurance
>of "*/", which it finds at the end of the line. Of course ".*" still matches
>everything in between including the first occurance of "*/".
>
>Just thought I should try and clear that one up.
>
>Cheers,
>Paul.
>
>p.s.
>
>PHP has the "U" pattern modifier which specifies ungreedy matches. This stops
>looking once it finds the first "*/". The only problem there is that the ".*"
>part of the pattern will match the rest of the line including the second
>comment.
>- --
>Paul Bryan
>E-Mail: pa_bryan at yahoo.co.uk
>
>PGP Key
>http://www.keyserver.net:11371/pks/lookup?op=get&search=0xB1D405DA
>
>It's always darkest just before it gets pitch black.
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.0.6 (GNU/Linux)
>Comment: For info see http://www.gnupg.org
>
>iD8DBQE9WO703qGyTLHUBdoRAuWRAKCJPz/Z5yl42Ay8X6xHnf/ZO0G7FACgqvZR
>RQ56SE0Tt8zpWxypoaNai0E=
>=4loX
>-----END PGP SIGNATURE-----


-- 
--
Kim Holburn  Network Consultant  P: +61 2 61258620 M: +61 0417820641
Email: kim.holburn at anu.edu.au - PGP Public Key on request

Life is complex - It has real and imaginary parts.
     Andrea Leistra (rec.arts.sf.written.Robert-jordan)



More information about the linux mailing list