Regular Expression Matching Can Be Simple And Fast

With Peter's observation that everything good in Computer Science happened during the "Golden Age" freshly in mind, I found Russ Cox's recent article on regular expressions to be enjoyable reading.

This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics ... The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 10^15 years.

Combining implementation details, finite automata, and a foray into decades-old theory, this article shows how most of our favorite little languages have an enormous performance bottlenecks for certain categories of string comparisons.

An additional data point: The Shootout benchmarks have a large string comparison test. It's interesting that Tcl is at the top of the heap for performance. Guess which one is using the Thompson NFA algorithm for regular expressions?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

...and occasionally buggy

Chris Kuklewicz has been doing interesting work with regular expressions in Haskell recently, motivated by the standard Haskell regex module, which is terribly slow, and strict to boot. He found a few moderately serious problems with the popular TRE, which has a name for being robust and very fast.

Details around the middle. It's not clear to me how he found these problems, alas. If he used the fabulous QuickCheck, that would be interesting to know.

Summoned

This seems like the best moment to stop lurking and post.

Short answer: QuickCheck caught errors in TRE and OS X's regex.h

Answer to different question: I had written a response to Russ Cox's page as part of the Haskell wiki.
and Russ Cox and I have been slowly discussing this via e-mail.

Longer answer:

After wrapping posix regex.h and libpcre in Haskell, I got a tip on the mailing list to do the same for TRE. Which had a very similar API, so that was quickly done. So I had several C backends.

Later I started replacing the impossibly slow regex.h backend library that comes with the GHC compiler. So I naturally wanted to use the fast methods of TRE. Using just the thesis and papers, I built regex-tdfa in pure Haskell. In debugging my code I wrote QuickCheck code to generate test regexps and test strings and compare the results of the different backends (there was no canonical right answer for the tests). And this quickly killed TRE, which has bugs relating to empty alternatives as well as anchors (^ and $). And TRE has a Posix design issue with respect to ambiguous matching with * operators.

And QuickCheck also found a bug in the regex.h library that comes with OS X (latest 10.4.8) that in turn affects programs like "sed". This bug can be triggered by "((a)|(b)){2,}" against "ab".

Currently, my regex-tdfa code works like TRE without (known) bugs. In the future the discussion with Russ Cox may lead to a DFA that efficiently employs the left-biased rule of Perl matching.

Henry Spencer wrote an article

stating that one reason he chose his particular implementation method was because backreferences became trivial to handle, and this article basically says the same thing -

"if your regex has no backreferences, your engine should switch to a DFA" , or some such -

I thought the TCL 8 regex implementation was also by Spencer, and it's a DFA that allows some backreferencing.

Link?

Do you happen to have a pointer to the Henry Spencer article you mentioned? Thanks.

the article I'm thinking of was in a book, not electronic

edited IIRC by Doug Gwyn, bunch of articles explaining some well known pieces of C code.

I've been trying to remember its name for a while ...

Software Solutions in C, ed Dale Schumacher,

Academic Press, 1998

I lost the floppy then I lost the book.

Gwyn had an article in it as well, he wasn't the editor as I had mis-remembered. That's probably why I couldn't find it earlier.

Left-bias

The Perl matching semantics are to choose the left-biased match. Rather than poorly re-designing the wheel, does anyone know an NFA or DFA using algorithm that can be employed to find the left-biased match?

Biased regex matching

The only algorithm I know of involves running an NFA right-to-left over the input; for every symbol of the input, you store the set of NFA states for that symbol. Then you run the NFA again left-to-right in a deterministic mode; you can use the right-to-left symbol states to determine which transitions would lead to success in your left-to-right pass. (The biggest annoyance is if your NFA has loops with only epsilon transitions; you need to ignore such loops in your left-to-right pass. The Frisch and Cardelli paper has one solution for ignoring such loops, involving duplicating every state in your NFA; much simpler techniques can also be used.)

I know of three very different presentations of this basic idea: Greedy Regular Expression Matching (by Frisch and Cardelli), Automatic construction of parse trees for lexemes (by Dubé and Kadiri), and Extending Regular Expressions with Context Operators and Parse Extraction (by Kearns).

Ignoring loops

Since the 2-pass algorithm requires to store sets of states for each symbol of the input, it's possible to replace the first pass with running the NFA forward (left-to-right) with backtracking and memoization, on demand (as required by the second pass to drive its choices). Also, looking ahead for a few symbols might be enough to resolve most choices.

Note that the three papers you mention are interesting in extracting complete parse trees. The regexp packages are usually interested in much weaker semantics that can only extract a finite number of fragments from the input string.

I'm curious about the simpler techniques you describe for dealing with loops of epsilon transitions (and whether they preserve the greedy semantics).

Ignoring loops

I've been meaning to do a proper writeup of this ever since I talked to you at ICFP, but haven't gotten around to it... So, quickly:

We translate a regular expression to an NFA. For every state with multiple outgoing epsilon transitions, we label the transitions 1,...,n such that lower numbers are preferred (for (A|B), the transition to A is preferred over the transition to B, for A* the transition from the end of A to the beginning of A is preferred, and for A*? the transition from the end of A to the successor is preferred).

Now we restrict ourselves to maximal subgraphs of the NFA containing only epsilon transitions. In the second, left-to-right pass, we will have one of these subgraphs, a start node (the current NFA state), and a set of end nodes (the NFA states which will allow us to finish the match, as recorded by the right-to-left pass).

We would like to find the lexicographically smallest path from the start node to one of the end nodes, where edges are compared according to their labels. Unfortunately, in the presence of cycles, it is possible that every path will have a lexicographically smaller path; for instance, in the regular expression (a*?)*, the "preferred" choice would be to match the inner expression against epsilon infinitely often.

Instead, we revise our goal to finding the lexicographically smallest cycle-free path from the start node to one of the end nodes. I think it's likely that there's a polynomial-time algorithm for this problem, but my implementation actually uses a stupid, exponential-time algorithm: find all cycle-free paths from the start node to one of the end nodes using depth-first search (there are at most exponentially many), and then pick the lexicographically smallest.

I have not been able to prove that this always gives the same result as the algorithm in your paper (especially in the presence of non-greedy regexp operators), but I believe that if they differ in some cases, my algorithm actually gives preferable results (because it will be easier to document).

This is a very rushed presentation; let me know if it was insufficient.

NFA construction

Hmmm... "(a*?)*" is equivalent to "()" according to the pcre library.

The problem with ((a)*?)* is much like with a naive ((a)*)* where you might create an epsilon loop. Epsilon transition between states can be avoided when building the NFA.

For instance: the Posix NFA that my regex-tdfa Haskell package constructs avoids this cycle problem. The only epsilon-like transitions are from a normal state to the winning state (with no further transitions). This is equivalent to marking some states as final states.

I did not get my algorithm from a reference, so I assume I reinvented someone's wheel. It is definitely not Thompson's NFA -- it looks more like a Glushkov NFA. But I can have fewer states, which was harder for me to come across elsewhere until today when I found "Constructing NFAs by optimal use of positions in regular expressions" and the follow on, also by Ilie and Yu. My algorithm is not as powerful as the second paper, but seems equivalent to the pictures in the first one.

The trick with "((a)*)*" and similar "((a)*?)*" is that there are two ways (not infinite ways) to transition after accepting "a" to accepting "aa": you can iterate the inner or the outer *-operator. To keep track of tags needed for submatch capture, my NFA construction creates both transitions explicitly. I select between them after they are constructed by using a ordering function on the tag information that gives Posix semantics (I do this when building the DFA). There should also be an ordering function that produces left-biased semantics (for properly decorated transitions).

I don't have a formal analysis to go with my code, but it wrote it as three passes over the parse tree of the regular expression (simplification, analysis, and NFA creation). For a regexp with N literals it creates at most N+1 NFA states, and often fewer.

New location for

New location for "Constructing NFAs by optimal use of positions in regular expressions" -- the original URL doesn't work, but it's still available from the web archive.

For what it's worth...

I don't believe that a Laurikari-style tagged (N)DFA algorithm can actually deliver Posix semantics, although I think the Laurikari made a significant contribution to the field. In my opinion, this should lead us to question Posix semantics rather than the value of Laurikari's algorithm.

In essence, Posix semantics ask that we (conceptually, at least), first find the correct substring of the target string -- that is, the longest prefix of the earliest suffix which matches the regular expression -- and then recursively apply a "best-match" procedure which takes a node in the AST for the regex and a substring, and includes at least the following rules:


best-match[ ( R ), X ]
  Set capture[i] to the start and end index of X
  best-match[R, X]

best-match[ R S, X ]
  Find all X', X'' such that X = X' X'', R matches X' and S matches X''
  Of those select the longest X' (or equivalently the last X'')
  best_match[R, X']
  best_match[S, X'']

best-match[ R | S, X]
  if R matches X then best-match[R, X]
  if R does not match T then best-match[S, X]

For the purposes of finding the best-match, we must treat concatenation as associating to the right; so in the case of R S T, we first find the longest X' which matches R and then the longest X'' which matches S, as opposed to first finding the longest X' which matches R S and then the longest X'' which matches R.

The rules are harder to interpret in the case of R+ and R* but it appears that we're supposed to treat these as a sort of concatenation; that is, we treat it as though it were R' R'... R for some number, possibly 0, of R', where R' is R with all tagging deleted, with the provision that R can only match the empty string if there are no R' in the concatenation.

If that's the case, then R+ must be right-associative as well, in which case the correct semantics should be:
best-match[ R+, X] -->
  if R matches X then best-match[R, X]
  if R does not match X then:
    find the longest prefix X' of X such that R matches X'
    and the corresponding suffix X'' such that X = X' X''
    best-match[R+, X'']

However, the Laurikari algorithm cannot make a decision based on the first match in a repetition; it can only choose on the basis of the last match: either the longest (earliest) last match, or the shortest (latest) last match. The T(N)DFA explicitly deletes tag information when it takes a transition corresponding to a repetition.

Now, consider the regex: (<|<a|<ab|<aba|abab|baba|b>|>)* matched against the target string <ababab...> If the target string has an even number of repetitions of ab then the possible parses are: <|abab+|> or <aba|baba+|b>. If the target string has an odd number of repetitions of ab then the possible parses are <a|baba+|b> and <ab|abab+|>. In the first case, opting for the shortest last match implies selecting the shortest first match; in the second case, selecting the shortest last match implies selecting the longest first match.

(I shared this example with Ville some years ago. Since then, Glenn Fowler has added a test to the testregex suite; see REP_LONGEST on the categorize.dat test.)

Now, does this matter? In other words, how much should we care about Posix semantics?

First of all, the testregex page demonstrates that well-respected regex implementations differ in corner cases such as match priority inside repetitions. So in practice, there does not seem to have been much demand for strict interoperability on the basis of Posix semantics. I suspect that this is largely because it is very rare to actually make use of a capture inside a repetition, and even rarer to make use of such a capture when the regex is ambiguous. Most users of regex libraries probably wouldn't notice if captures inside repetitions were completely random (and that's why the bugs in libtre which were recently caught by Chris Kuklewicz went undetected for so long.)

Secondly, it seems unlikely that there is an algorithm which can comply with strict Posix requirements in unbounded storage. If there were some practical reason to prefer the Posix specification over the "longest-last match" (or "shortest-non-empty last match") rules, then that wouldn't matter, but since there doesn't seem to be a clear practical advantage one way or the other, the requirements ought to allow for an efficient implementation, such as tagged (N)DFAs.

Posix quite reasonably removed back-references from EREs, presumably to allow for efficient implementations, and despite the fact that there are practical uses for back-references. (Indeed, most regex libraries implement them anyway, which is arguably a design flaw since it means that an untrusted regex could be a DoS attack.)

Ambiguous * matching

I don't believe that a Laurikari-style tagged (N)DFA algorithm can actually deliver Posix semantics

You are correct: Ville's TRE does not handle Posix matching of ambiguous * captures. But with a small extension (that I have in my regex-tdfa) it does deliver Posix semantics.

The categorization page at Glen Fowler's site has this bug in the REP_LONGEST column as you said. All but 4 of the 22 implementations do manage the correct "first" longest semantics. As does my implementation, by augmenting Ville's tags with additional bookkeeping. Currently my code needs O(length of searched string) storage in the worst case, but I know how to reduce that to O(number of *-operators) by performing pre-comparisons. So strict Posix requirements do not appear to require horrendous storage.

I designed my Posix bookkeeping extension for ambiguous *-operator subpatterns without checking other implementations, so there may be an even more clever scheme out there.

Personally, I will also have an option in regex-tdfa to match the last-iteration is longest semantics that are easier (and faster) to do with tags (and perhaps the equally easy and symmetric last-is-shortest mode).

But I claim that people expect the Posix first-is-longest mode because they expect the matching to be greedy. The Posix first-is-longest also means that the *-operator could be replaced by a finite repetition of the sub-pattern without changing the submatching of the last iteration. This is a useful invariant.

I am still performance tuning my code (to get memory usage down) but I am too busy with real work this week to finish it yet.

I'm anxious to see the bounded storage implementation

The beauty of Ville's implementation is that the memory requirements of a match operation are known in advance. Adding O(length of searched string) -- I assume that it's actually O(number of matched repeats), which is effectively the same -- makes it impossible to make the guarantee that a match operation will not allocate memory, which is a useful guarantee in some environments.

But I claim that people expect the Posix first-is-longest mode because they expect the matching to be greedy.

Perhaps. I claim that, by and large, people don't expect anything of repeated submatches because they don't use them; that is, they are forced to put the () in the regex because Posix conflates precedence grouping with tagging, but they don't actually use the capture. In my opinion, that's the worst flaw in the Posix design.

I could be wrong, of course. But I think I could count on the fingers of one hand the number of times I've seen a program use such a capture.

The Posix first-is-longest also means that the *-operator could be replaced by a finite repetition of the sub-pattern without changing the submatching of the last iteration. This is a useful invariant.

Only in the case of a particular string for which you know the finite number of repetitions. Otherwise, there are all sorts of breakages of that invariant: for example, if the repetition can match the empty string, you can get an empty match in the case that there are not enough repetitions available to make up the finite count. Also, restricting the match to a particular finite count can eliminate the match which happened to work on fewer matches; for example: (aaaaaaaaaaaa|aa|a)+.

The shortest last-match rule allows you to replace (R)+ with R* (R), which I would argue is a more useful invariant. Similarly, longest last-match would replace (R)+ with R*? (R). One assumes that at least some people have that expectation of repeated captures; otherwise, there would have been more complaints about libtre's lack of Posix compatibility.

pre-comparisons

Looks like it's time for me too to register and stop lurking.

Like Rici, I'm anxious to see the version of your algorithm which fully complies to POSIX and does not require unbounded memory allocation during matching.

At one point I used a lot of time trying to figure out how it could be done. I have no proof (or even notes that I could find :-) but I remember that I convinced myself that it's not possible. It would be great if you can show that my intuition was wrong.

[Retracted]

[Retracted]

almost there!

In fact there is a linear-time version of what you described that behaves almost exactly like traditional backtracking (Perl-like).

The final observation is that, just like in the regular NFA/DFA simulation, you can cut off a path if it leads you to a state you've been in before. Then you just need to find the first match on the highest-priority (lexicographically smallest) path. So at any step there are at most k paths (k states in the NFA) active at any given time. If you keep them in a list ordered lexicographically, then the step() and addstate() functions in my article will preserve the lexicographic ordering already, without any need to track it explicitly.

When you find a match during step(), you can stop processing the rest of the current list -- they'll all be lower-priority matches if they ever match -- and thus you should record the match as the best so far. Once the list has dwindled to nothing, stop processing and return the most recently found match.

This basically mimics the traditional backtracking except that it runs all the choices in parallel instead of sequentially w/ backtracking. Running everything in parallel means that one can recognize when two different paths have converged and need not be explored separately anymore. This is harder to do with backtracking -- one must allocate a larger amount of storage to memoize with, like Perl is supposed to do these days but doesn't quite.

By changing the "do I record this match as best?" logic slightly, one could choose to record the leftmost-longest match instead of the left-biased, but still following the traditional backtracking rules for deciding submatch locations. This is what the Plan 9 regexp library turns out to do, and is an interesting compromise.

It turns out that this is not quite exactly like Perl in one particular case: if re can match the empty string and also other strings, then when matching re*, Perl always adds one final iteration during the * to match a trailing empty string. This is non-minimal and complicates the simulation a little, but I think it can be accomodated by allowing nodes to be visited at most twice during addstate() instead of at most once. So it's not quite cycle-free as it almost-cycle-free. This strikes me as pretty ugly, but hey!, it's Perl. ;-)

Here is simple Yacc/C source code implementing all of this. It is instructive to compare it against the original NFA implementation from the article. (Boy do I feel dirty posting C code to LTU!)

Thanks!

Thank you!

I have not yet fully understood your code, but it does look nicer than what I had been doing.

Perl regexes have always been a mystery to me

But I don't think that your code gets the longest match case right:

rlake@freeb:~/src/rscre$ ./a.out  "(a|ab)(a|ab)" "abab"
abab: (0,3)(0,2)(2,3)
rlake@freeb:~/src/rscre$ ./a.out -l "(a|ab)(a|ab)" "abab"
abab: (0,3)(0,2)(2,3)
rlake@freeb:~/src/rscre$ ./a.out -l "(a|ab)(ab|a)" "abab"
abab: (0,4)(0,2)(2,4)

silly option parsing bug, sorry

Sorry, that's what I get for cleaning up the code quickly before posting. The -l flag wasn't actually making it do "longest" matches:

--- nfa-perl.y.old	2007-02-21 08:21:32.000000000 -0500
+++ nfa-perl.y	2007-02-23 12:08:20.000000000 -0500
@@ -578,7 +578,7 @@
 			argv[1] = argv[0]; argc--; argv++;
 		}
 		if(argc > 1 && strcmp(argv[1], "-l") == 0){
-			matchtype = LeftmostBiased;
+			matchtype = LeftmostLongest;
 			argv[1] = argv[0]; argc--; argv++;
 		}
 		else if(argc > 1 && strcmp(argv[1], "-p") == 0){

Interesting...

The first paper looks *very* interesting, as it tackles ambiguity and the rules for picking the best result seems to be the left-biased greedy rules I am looking for!

The second paper has this line: "As we see
below, a single pair of a regular expression and a word may lead to
more than one parse tree. In such a case, the lexical analyzer is free
to return any of these." This makes it less interesting.

The third paper punts ambiguity : "Foster describes an algorithm for rewriting p1+ that can be adapted to solve this problem. "

I have not finished studying it, but the first paper does describe an initial right-to-left pass over the string to be searched. This is not practical in the general case of writing a regexp implementation, since one is looking for the leftmost match and scanning the whole input to find it is inefficient for large input sizes.

But I think I can mine the paper for some ideas.

Confused

I am confused as to what the conclusions of the article are. OK, the languages are all using differently efficient algorithms for RE matching. But the way to speed things up is pretty obvious, I think. DFAs are easier to handle, and if you do not need the full DFA, perform on-the-fly power-set construction. So, what am I missing?

...

What you're missing (me too) is why no one knows or does this!

That's not entirely fair. I do know why many don't -- you'd still need a backtracking implementation for backreferences, so you'd be carrying around two implementations to debug. However, the number of people who actually understood this seemed to be vanishingly small: there are web sites and book shelves are filled with discussion of regular expressions as "pathological" with no mention that this is really just an artifact of the implementation choice, not something fundamental. Even implementations that use a real Thompson-style NFA (aka DFA) still go quadratic trying to find out where the subexpressions matched, but that's a topic for another time.

I did consider titling the article "Regular Expressions Cannot Be Pathological".

Slow is in the eye of the user

What you're missing (me too) is why no one knows or does this!

I'm not so sure you're missing the reason, since you wrote in the article:

It may be hard to believe the graphs: perhaps you've used Perl, and it never seemed like regular expression matching was particularly slow. Most of the time, in fact, regular expression matching in Perl is fast enough.

So if it's fast enough most of the time, what's the motivation for changing it? For example, if the Perl world were awash in complaints about slow regexes, you can bet people would be examining and discussing the problem, and I suspect awareness of the Thompson-style NFA would quickly be raised. As the saying almost goes, necessity is the mother of rediscovery of invention.

Bounded Space and Posix

I have had Ville look this over, and now I posting my proposed bounded space Posix algorithm in more detail. I put it up at the haskell wiki at RegexpDesign.

The text file was not the best wiki formatting, so I will clean up the formatting when I get more time.

Nice

For those following along at home, I posted some longer comments on the libtre mailing list, in the thread Repeat histories

Thompson's algorithm was patented

The full text of the patent is here.
Dennis Ritchie's QED notes are here.
edit:
Russ,
I wasn't disputing your conclusion, just linking.
I'm not sure whether the algorithm was widely known. I just skimmed 17 compiler books, 7 cite Thompson's paper and only Compiler Design in C explains the algorithm.

... and the patent (granted 1971) expired in 1988

That's not why everyone is using slow algorithms.

That plus historical force

That plus historical force conceivably can account for a lot, though.

What about the broader case?

First off, let's settle on a single implementation to talk about. As the most widely used, I think the Perl Compatible Regular Expression (PCRE) library is a decent choice. It's not what Perl uses, but it's compatible to a degree that nothing else I'm aware of is, and it's used by several languages and a broad selection of tools.

So, the question is: can all of PCRE (including back-references, lookahead/behind assertions, recursive subexpressions, call-outs, etc.) be implemented more efficiently for the general case, or would such optimizations allow only for the optimization of pessimistic cases such as those provided in the paper, while rendering the more arcane, but oft-used features of PCRE less optimial? I'm not going to be shocked if there's a better way to match, but the PCRE and Perl community have been struggling with the real-world gap between the theory of regular languages and the practice of programmers trying to implement the kitchen sink using regexen for quite a few years now. I would imagine that there has been more than one attempt to improve performance, some of which have been successful, and some not.

one of my few area of expertise

I am the author of a free software regexp matcher called "Rx". The most up-to-date version can be found deeply embedded in the source code for "GNU Arch" (a revision control system). Be cautious of certain widely distributed earlier versions (most commonly known as "GNU Rx") because they have some known, severe bugs.

My implementation is, for most purposes, among the top 3 performers among widely distributed regexp engines. Last I checked, anyway. None of the top 3 is a clear winner, performance-wise (it depends on the pattern and the string being tested).

I spent over a decade on that piece of software, with plenty of mistakes along the way. Naturally, I therefore regard it a fascinating and widely under-appreciated and under-documented area of systems software. I am capable of, yet manage to mostly refrain from, boring fellow hackers for hours on end with observations about subtle details.

A few words about some highlights:

Careful distinction must be drawn, as comments above note, between true regular expressions (always equivalent to a DFA) and "regexps" as commonly found in the wild. "Regexps" have features like back-referencing and so forth that take them out of the regular language category, so simple conversion to a DFA is not an option.

Conversion of a true regular expression to a DFA is problematic for on-line algorithms, anyway, since certain patterns result in an exponentially larger number of DFA states. (This is why offline tools like unix's "lex" spend so much effort on state-reduction.)

The famous "dragon book" (compiler textbook) has a couple of paragraphs describing the lazy, on-the-fly construction of a DFA in a system that merely caches DFA states and transitions within a bounded amount of memory. Private communication from some of the old unix-lab Bell Labs folks tells me that is because they experimented with that approach - and ultimately gave up on it as "too complicated". Nevertheless, that technique works very nicely - mostly giving DFA performance and degrading to the performance of a back-tracking matcher when the DFA is too large and DFA state access patterns too irregular.

Extending a caching-lazy-DFA for true regular expressions to handle such things as back-referencing is tricky. I learned how via personal communication from Henry Spencer who described to me his invention of an algorithm he dubbed "recursive decomposition". What you can find in my Rx implementation is recursive decomposition augmented with a bunch of heuristics to speed it up in common situations.

Recursive decomposition does the search algorithm needed for such things as back-tracking and the left-to-right bias of Posix regexps, but uses the lazy-DFA technique wherever it can. When recursive decomposition "blows up" performance-wise, it tends to blow-up on that back-tracking, in time.

The algorithm currently used in GNU libc is an interesting variation: it does its back-tracking in parallel. When it "blows up" it blows up in space.

The linked article disappoints me a bit because it rehearses the Thompson construction, swell, but only mentions in passing how the non-regular features people want (like back-references) make everything harder.

There is a whole class of optimization techniques for these "more than regular, less than turing" languages that are strictly black arts in two senses: (1) How to implement them well is poorly documented; (2) How to use them while avoiding performance "blow-ups" is poorly documented.

Regexps (true regular expressions and the extended varieties) are under-utilized, imo. With a good regexp engine, you can really push the limits of these things and all kinds of crazy applications show up (think about adaptive real-time protocol-analyzers, for example). It'd be nice to see the field better documented but all we ever get are these starter articles that retrace the initial steps taken about 30 years ago.

-t

Pray continue...

I am capable of, yet manage to mostly refrain from, boring fellow hackers for hours on end with observations about subtle details.

The nice thing about forum software is it has infinite patience... and those who aren't interested can skip and those who are (like me) can delve.

One of my favourite tricks is to map arrays of completely other things on to chars and then feed the resulting string to a Regexp engine.

I have played this trick with vectors bounding the edges of pixels and tokens and non-terminal symbols in a grammar and... Let your imagination riot.

The advent of Unicode (and other) has broaden the atoms of Regexps considerably beyond that of 7 bit ascii.

ie. The scope for this class of trick has grown massively.

Perl and now perl6 has expanded Regexps way beyond what you call "true regular expressions".

Packrat parsers and peg's also open up new opportunities.

One of the Doh! moments everyone has when encountering unicode in regexps for the first time is the [^abc] construct suddenly results in a vast set. Then you ask yourself does [a-f] still mean what you think it means? (Yes, but only by the "accident" that unicode was designed to include ascii 7-bit as a subset) How do I match all "word" characters? What does '.' match any single character mean in a variable length encoding?

Now imagine if instead of eating characters (or codepoints) regex engines ate sequences of objects on which you could invoke predicates like "vowel?" "word?" "lower_case?" "white_space?"

Now imagine if we freed ourselves from limiting the predicates to talking just about characteristics of characters, but were _any_ predicate method...

Now imagine if we freed ourselves from limiting our interpretation of "adjacent" as "next to each other on a page" to meaning "partially ordered"...

Regexps engines are powerful beasts indeed... start your engines folks... the race is just beginning.

This approach is similar to

This approach is similar to the APL/J approach of formulating computational problems in such a way that they are solvable using matrix operators without explicit loops. I love the puzzle-solving flavor of these approaches...

I was kinda expecting you to prod Thomas to write more about implementing re. I would sure love to read more (though LtU is probably not the best place for this. How about Software: Practice and Experience)?

flattery will get us.... where?

How about Software: Practice and Experience)?

There are few things I would enjoy better as an expression of my being as a hacker than to take many hours over a few months for such a paper. Two big things are missing: One is money - I lack the financial freedom for such a thing by far. I worry (lately with increasing urgency) about my (relative to market) rent and day-to-day living expenses, as things stand. It's pathetic, I know. Color me blushing with embarrassment over that. The other is a need for a partner / writing mentor for that kind of project. I handle English fairly well, I think, and can from time to time knock out a really swell essay but I have, in spite of my advanced age of 43, precious little experience in doing good academic writing (actually, zero experience) and I would need kind help. I am largely socially disconnected from the community of hackers other than via forums like this so I have nobody to call upon for such a thing. That's slightly exaggerated in that I live near UC Berkeley and I know of one or maybe two faculty members whose doors I could (for good cause) knock upon and ask for some introductions / help. I'm not anxious to abuse those potential connections until there is a really convincingly good reason.

-t

Just do it

Technical writing isn't that hard. Make an outline that covers your results, flesh it out, add an abstract and references & submit. Referees & editors will tell you what you need to fix. Ehud's right, S-P&E loves to publish this sort of stuff.

Berkeley, eh? Are you the Thomas Lord on Dohr Street? That's about a mile from here. (I'm up the hill, near Whole Foods.)

+ I am sure that if you have

+ I am sure that if you have a draft you want comments on, people here will be willing to help.

nike

I will add it to my list. Being (well, soon to be) between gigs, maybe it's worth a shot. Thank you for the encouragement.

Yes, I'm Thomas Lord of Dohr St. I guess that you are Duff of Eponymous Device? Small world. Maybe some day I can tell you funny stories about my first boss (Tilbrook). Quite a harsh taskmaster: when he found out peers and I were arguing Emacs v. VI, naturally it became a condition of employment that I learn QED :-)

-t

continuing, but barely

Regexp engines are, I agree, powerful beasts indeed.

One little "practice an experience" tid-bit I'll pass along, since you mentioned Unicode, is an element of my extending Rx from an 8-bit ASCII (say, Latin-1) matcher to a Unicode matcher. One thing that became interesting in that area was to compress various tables (e.g., of Unicode codepoint properties) into sparse arrays. Look at the code (again, found in GNU Arch).

-t

The linked article

The linked article disappoints me a bit because it rehearses the Thompson construction, swell, but only mentions in passing how the non-regular features people want (like back-references) make everything harder.

But he also mentions they are hard in general: matching with back-references is NP complete. I assume that the nature of this manifests in the Thompson construction as a doubling of the number of paths which have to be followed simultaneously in each step.

This isn't an implementation problem though. I suppose the Thompson algorithm will always only just match regular regexps but this isn't much of a big deal because back-references for example can be implemented as additional constraints which eliminate paths to be followed. The Thompson algorithm creates a somewhat bigger set of admissible solutions which has to be minimized. I don't see a reason why this isn't feasible.

complexity of regular language recognition

(edit: Nuts. I mis-threaded this. It is in response to Kay Schluehr, above.)

No. Thompson's construction deals with regular languages and those are not NP complete. It's the non-regular addition of back-references and sub-pattern-position-reporting to the pattern language that I believe Spenser showed was NP-complete (although I am not, blush, familiar with his proof).

For a handy upper bound on the complexity of matching a regular expression r against a string s you can consider a simple algorithm that in O(|r|) steps converts r to an NFA and simple NFA algorithm that then takes O(|r| * |s|) steps to test for a match.

If you batch-convert r into a DFA in O(2^|r|) steps, giving Dn states, then you can match using that DFA in O(|s|) steps (or O(|s|*log(Dn)) if you are a stickler for the true cost of memory access).

If you do the "lazy, caching DFA construction" that I was talking about you wind up with an algorithm that floats between O(|s|) and O(|s|*|r|). You can get very close to O(|s|) in a lot of common cases without paying the O(2^|r|) worst case for building the DFA.

If you add in sub-expression position reporting and back-references, then (per Spencer) you are in NP-complete territory. The new complexity arises because you have to do a search. In the general case there can be more than one way to match a given parenthesized sub-expression. Among all the possible matches, you have to find the one that best satisfies a rule like "give preference to subexpressions on the left - give them the longest match that works - but if a non-longest match of them is necessary, use that". The added complexity doesn't have much of anything to do with the exponential size of a DFA for a true regular expression - it has to do with the size of the search space for aligning sub-pattern matches and handling back-references.

-t

No. Thompson's construction

No. Thompson's construction deals with regular languages and those are not NP complete.

There is some subtlety here. I do not assume that the number of states in the original NFA are somehow reduced to yield optimal runtime behavior of the regexp matcher. The Thompson NFA alone still only deals with a regular language but without state reductions in a grossly suboptimal way ( exponential complexity by default ). However, we permit this because the algorithm is augmented by predicates reflecting context sensitive decisions e.g. those of back-reference matching which lead to a reduction of states at runtime - or even not in the worst case. The latter means that we remain in NP territory.

In order to illustrate this just consider the regexp example given here provided along with the NP-completeness proof of regexp matching with back-references.

$regex  = '^(x?)(x?)(x?).*;
                   (?:\1|\2|\3x),
                   (?:\1|\2x|\3),
                   (?:\1x|\2x|\3),
                   (?:\1x|\2x|\3x),'

When you replace each back-reference \1, \2 and \3 with the referred expression in parentheses one would expect major reductions being possible: for example (?:\1|\2|\3x) just becomes (?:x?|x?|x?x) which can be reduced to (?:x?x?).

Without performing the reductions all paths are taken. Those paths which are not permitted by means of back-reference matching are ruled out using predicates which are supplementary and correspond to non-regular extensions of regular expressions.

I hope the idea how to deal with back-references when using Thompson NFAs has become a little more clear.

Complexity

The Thompson NFA alone still only deals with a regular language but without state reductions in a grossly suboptimal way ( exponential complexity by default ).

I don't understand what you're trying to say here. Thompson's algorithm doesn't involve any exponential behavior. Its complexity is bounded by the product of the length of the text and the length of the pattern.

implementing backreferences

A short answer here is that Duff is right. Thompson's algorithm does not handle patterns like the one you are asking about. Given a true regular expression, r, in O(|r|) steps it builds an NFA whose size is O(|r|). It matches a string s in O(|r| * |s|) steps. End of (that) story.

I see that you have some idea in mind for implementing back-references. In particular, you want to transform expressions by replacing backreferences, like "\1", with the pattern subexpression to which they refer, like "x?".

The problem is that that transformation is not, in the general case, valid. For example:

(x?)\1   is not equivalent to   (x?)x?

The first pattern will not match the string "x" but the second pattern will match that string. A backreference does not mean "must match the same pattern as that earlier expression" it does mean "must match the same string as was matched by that earlier pattern. Your proposed transform breaks that as the example I gave shows.

Here is a pattern (a "regexp" not a true regular expression) that is hard to match efficiently:

([ab]*)b*\1

That pattern matches a string of the form XYX where X is a possibly empty string of "a" and "b", and Y is a possibly empty string of "b". Note that the substring X must occur twice for the overall string to match.

Consider trying to match that against:

abbbbbabbb

A matcher, having seen the first "a", knows that the first parenthesized sub-expression matches at least one character. Upon encountering the first "b", there is a question: should that be matched by the "([ab]*)" part or by the "b*" part? The same question recurs for each of the first five "b" characters. Any algorithm that solves this must, in effect, do some searching to discover that the "([ab]*)" must match three "b" characters, in this case. (And note that if the string were, instead, "abbbbbabb" then "([ab]*)" would have to match only two "b" characters.

You might think that gnarly heuristics could find a clever way out of that trap. In the example pattern I just gave, a matcher might notice right away that by counting the number of "b" characters at the end it can quickly find out how many "b" characters the "([ab]*)" must match. If you or I were to hand-compile our example regexp to a C algorithm, we would have little trouble making a matcher that ran in O(|s|) time (for this one pattern).

But the problem quickly gets harder and harder. For example, consider:

(([ab]*)b*\2)*[ab]*\1b*

That's a fairly devilish one.

It is not so hard to write a "backtracking matcher" that will always give correct answers for expressions with backreferences. Unfortunately, such matchers have exponential performance in the length of the string being matched: O(n^|s|) for some n.

Spencer proved (it is said, again, I don't know his proof) that the general problem of doing better than backtracking - getting a polynomial rather than exponential matcher - is of the class NP-complete. So if you come up with a polynomial algorithm then, by gosh, you will be a good candidate to win a Turing Award or some such. There is probably challenge prize money out there, too. And, on the other hand: it is very, very, very unlikely you'll have such a breakthrough and it may very well be impossible (but nobody has proved that). Conversely, if you prove that no polynomial algorithm is possible, you will again be rich and famous -- but again, best of luck on that unlikely endeavor.

Some History

This is some "just so" history but I think there is some literal truth to it.

The simple, naive way to implement true regular expressions in a language like C is with a backtracking search algorithm. That was the state of the art when Thompson wrote his paper. Even after Thompson's paper it remained common practice. Thompson's breakthrough was showing a simple to technique to achieve O(|r|*|s|) performance rather than O(n^|s|).

The theoretical complexity of regular expression matching is O(n^|r| + |s|). The O(n^|r|) part can be done in "batch" and if you pay that cost just once for a given expression r, then thereafter you can match strings in O(|s|). That's what lex is for.

People had around these exponential, backtracking search regular expression algorithms and they noticed that it was easy to add certain non-regular features - such as backreferences and subexpression position reporting. You still have an exponential algorithm, but it can do more tricks. For simple patterns and short strings, the poor worst-case performance doesn't matter much - so people were happy with those features.

Those features got codified in various contexts like POSIX, perl, and GNU Emacs. We're stuck with them for a long time to come.

As software grew more complex and computers grew larger, the problem of more complicated patterns and longer strings started to matter. So, several of us went off to try to find fast ways of matching those - e.g., to try to avoid exponential behavior.

Spencer then proved that we probably won't succeed at that and we probably won't anytime soon prove that we can't succeed.

Just so. The features that put us in NP-complete category are features that were added because, even though they aren't true regular expression features, it was easy to hack them into early backtracking matchers.

Main Research Activities

The main implementation action in recent years has been concerned with four different problems:

1. Efficient batch-oriented DFA construction and state minimization. Some people work on making things like lex work better.

2. Efficient on-line true regular expression algorithms. Some work on tuning true regular expression algorithms with O(|r|*|s|) worst-case but O(|s|) common-case performance.

3. Tuning and extending the functionality of backtracking matchers. While these seem doomed to always have exponential performance people work on different space/time trade-offs and heuristics to make them fast in common cases. People work on adding new non-regular features to regexp languages.

4. Handling large character sets, Unicode in particular. Regexp languages typically require implementations to manipulate sets of characters. When the character set is ASCII or an 8-bit character set, there is no big deal. When the set of all characters is as large as Unicode (and when the encoding of characters in strings is correspondingly complicated) there are interesting performance challenges.

-t

Oh, yeah. Thanks a lot,

Oh, yeah. Thanks a lot, Thomas, for the corrections. Now I can see more clearly what I missed and that my assumptions were too simplistic.

In the general case of "... (P) ... \1" the back-reference \1 doesn't depend functionally on a known string S being matched by P but P's matching of S also depends on what \1 will match.

I wonder if there are characteristics of a regexp that suppress those global interdependencies? Here is a first attempt:

In an expression:

... (P) ... \1

the back-reference \1 can be separated from P when following holds:

1. P doesn't contain back-references which means that it is self contained.

2. It is possible to write the expression in the way

... L(P)R ... \1

where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is the start of an expression.

The no back-reference condition can be checked syntactically.

In practice L and R might be pattern and the delimiter condition is satisfied when following is true:

LAST-SET(L) /\ FIRST-SET(P) = {}

LAST-SET(P) /\ FIRST-SET(R) = {}

This means that FIRST-NEXT ambiguities as we observe them in terms like "([ab]*)b*\1" are not permitted by the rules.

When additionally following holds

FIRST-SET(P) /\ LAST-SET(P) = {}

R is allowed to be empty as well and "... (P)\1 ..." is permitted.

I wonder if we lose much in terms of practicality when certain regexps are rejected at compile time? True, it isn't possible to solve graph coloring problems using regexps but maybe this isn't a big loss?

last word (?) in this subthread

I also, when time permits, contemplate pattern languages that are bigger than true regular expressions yet more constrained than, say, POSIX regexps - hoping to discover one that is practical by many metrics yet which also admits efficient implementation. So, you are asking some kinda sane questions there, but I'm all out of answers for you. By policy, I understand Ltu to not be quite a brainstorming forum and, in general, I don't have time for brainstorming on this particular topic at this particular time. So.... go off and have fun. Read the version of Rx in GNU Arch. Read the crazy-ass space v. time reversal algorithm in GNU libc. Read Spencer's TCL matcher. Have fun. Best of luck.

-t