[clug] OT meta-regex question
Paul Bryan
paul at pabryan.mine.nu
Tue Feb 3 05:30:35 GMT 2004
On Tue, Feb 03, 2004 at 02:43:39PM +1100, Paul Bryan wrote:
> On Tue, Feb 03, 2004 at 02:24:22PM +1100, Gough, Chris wrote:
> > For example:
> > A = '^.*$'
> > B = '^foo$'
>
> C = '(A|B)'
>
> or more clearly:
>
> C = '(^.*$|^foo$)'
>
> aPatternExists(C)
Sorry, that was complety brain-dead. It doens't even come close to your
question. I think it's time for a siesta.
Back to the question at hand. I think it would be possible to do what your
talking about. Unfortunately, I don't know of a solution. So, grab the source
for libregex and off we go. Does the thought of this make your brain feel
any better? It doesn't help my brain any.
Seriously though, I think that's where you might have to look. What I'm
thinking is, regex parsers normally walks through the regex and the string to
be matched together (I think that's right from memory). What you want is for it
to walk through both regex'es together and identify the set of possible matches
for each regex as it goes along. As soon as the two sets no longer coincide,
you know they won't match.
This is almost coceptually simple, but given regex syntax and the joy of
parsing it against a string, not so simple. I think the main problem is that
regexes aren't normally parsed it one step. I believe the process often
involves walking back through the string and/or regex itself a number of times
depending on the regex. How does this affect you?
Well, some sets of matching strings are pretty straight forward, e.g. ".*" - no
problem there. Some sets are fairly self-contained. If you had "^abc.*", then
the set of matching strings begins with "abc", after that, it doesn't care.
You could match this entire set directly against the second regex no problem.
But matching more complex sets against the second regex is where you come
unstuck.
I think the trick is finding a way to represent the entire set with a string
that you could then match against the second regex. Rember, regex'es are fairly
linear so once a portion is matched, you can move on. Of course, there is the
possiblity for branching with the "|" character. I think your in the realm of
some pretty hard-core recursion here. This is where walking back up and down
the regex will be vital.
I haven't even mentioned greediness modifiers yet ;)
How's your brain feeling now?
As I said I haven't heard of anything that already does this, but I do think
it could be done. In what time frame and by whom, I can't say :(
Cheers,
Paul.
p.s. sorry for going on so much. Just wanted to try and make up for the dodgy
first post!
More information about the linux
mailing list