[clug] OT meta-regex question

Paul Bryan paul at pabryan.mine.nu
Tue Feb 3 05:30:35 GMT 2004


On Tue, Feb 03, 2004 at 02:43:39PM +1100, Paul Bryan wrote:
> On Tue, Feb 03, 2004 at 02:24:22PM +1100, Gough, Chris wrote:
> > For example:
> >  A = '^.*$'
> >  B = '^foo$'
> 
> C = '(A|B)'
> 
> or more clearly:
> 
> C = '(^.*$|^foo$)'
> 
> aPatternExists(C)

Sorry, that was complety brain-dead. It doens't even come close to your 
question. I think it's time for a siesta.

Back to the question at hand. I think it would be possible to do what your
talking about. Unfortunately, I don't know of a solution. So, grab the source
for libregex and off we go. Does the thought of this make your brain feel
any better? It doesn't help my brain any.

Seriously though, I think that's where you might have to look. What I'm 
thinking is, regex parsers normally walks through the regex and the string to
be matched together (I think that's right from memory). What you want is for it
to walk through both regex'es together and identify the set of possible matches
for each regex as it goes along. As soon as the two sets no longer coincide, 
you know they won't match.

This is almost coceptually simple, but given regex syntax and the joy of 
parsing it against a string, not so simple. I think the main problem is that 
regexes aren't normally parsed it one step. I believe the process often 
involves walking back through the string and/or regex itself a number of times
depending on the regex. How does this affect you? 

Well, some sets of matching strings are pretty straight forward, e.g. ".*" - no
problem there. Some sets are fairly self-contained. If you had "^abc.*", then 
the set of matching strings begins with "abc", after that, it doesn't care. 
You could match this entire set directly against the second regex no problem.
But matching more complex sets against the second regex is where you come 
unstuck.

I think the trick is finding a way to represent the entire set with a string
that you could then match against the second regex. Rember, regex'es are fairly
linear so once a portion is matched, you can move on. Of course, there is the
possiblity for branching with the "|" character. I think your in the realm of 
some pretty hard-core recursion here. This is where walking back up and down
the regex will be vital.

I haven't even mentioned greediness modifiers yet ;)

How's your brain feeling now?

As I said I haven't heard of anything that already does this, but I do think
it could be done. In what time frame and by whom, I can't say :(

Cheers,
Paul.

p.s. sorry for going on so much. Just wanted to try and make up for the dodgy
first post!


More information about the linux mailing list