Regular Expression Matching Can Be Simple And Fast

With Peter's observation that everything good in Computer Science happened during the "Golden Age" freshly in mind, I found Russ Cox's recent article on regular expressions to be enjoyable reading.

This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics ... The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 10^15 years.

Combining implementation details, finite automata, and a foray into decades-old theory, this article shows how most of our favorite little languages have an enormous performance bottlenecks for certain categories of string comparisons.

An additional data point: The Shootout benchmarks have a large string comparison test. It's interesting that Tcl is at the top of the heap for performance. Guess which one is using the Thompson NFA algorithm for regular expressions?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

...and occasionally buggy

Chris Kuklewicz has been doing interesting work with regular expressions in Haskell recently, motivated by the standard Haskell regex module, which is terribly slow, and strict to boot. He found a few moderately serious problems with the popular TRE, which has a name for being robust and very fast.

Details around the middle. It's not clear to me how he found these problems, alas. If he used the fabulous QuickCheck, that would be interesting to know.

Summoned

This seems like the best moment to stop lurking and post.

Short answer: QuickCheck caught errors in TRE and OS X's regex.h

Answer to different question: I had written a response to Russ Cox's page as part of the Haskell wiki.
and Russ Cox and I have been slowly discussing this via e-mail.

Longer answer:

After wrapping posix regex.h and libpcre in Haskell, I got a tip on the mailing list to do the same for TRE. Which had a very similar API, so that was quickly done. So I had several C backends.

Later I started replacing the impossibly slow regex.h backend library that comes with the GHC compiler. So I naturally wanted to use the fast methods of TRE. Using just the thesis and papers, I built regex-tdfa in pure Haskell. In debugging my code I wrote QuickCheck code to generate test regexps and test strings and compare the results of the different backends (there was no canonical right answer for the tests). And this quickly killed TRE, which has bugs relating to empty alternatives as well as anchors (^ and $). And TRE has a Posix design issue with respect to ambiguous matching with * operators.

And QuickCheck also found a bug in the regex.h library that comes with OS X (latest 10.4.8) that in turn affects programs like "sed". This bug can be triggered by "((a)|(b)){2,}" against "ab".

Currently, my regex-tdfa code works like TRE without (known) bugs. In the future the discussion with Russ Cox may lead to a DFA that efficiently employs the left-biased rule of Perl matching.

Henry Spencer wrote an article

stating that one reason he chose his particular implementation method was because backreferences became trivial to handle, and this article basically says the same thing -

"if your regex has no backreferences, your engine should switch to a DFA" , or some such -

I thought the TCL 8 regex implementation was also by Spencer, and it's a DFA that allows some backreferencing.

Link?

Do you happen to have a pointer to the Henry Spencer article you mentioned? Thanks.

the article I'm thinking of was in a book, not electronic

edited IIRC by Doug Gwyn, bunch of articles explaining some well known pieces of C code.

I've been trying to remember its name for a while ...

Software Solutions in C, ed Dale Schumacher,

Academic Press, 1998

I lost the floppy then I lost the book.

Gwyn had an article in it as well, he wasn't the editor as I had mis-remembered. That's probably why I couldn't find it earlier.

Left-bias

The Perl matching semantics are to choose the left-biased match. Rather than poorly re-designing the wheel, does anyone know an NFA or DFA using algorithm that can be employed to find the left-biased match?

Biased regex matching

The only algorithm I know of involves running an NFA right-to-left over the input; for every symbol of the input, you store the set of NFA states for that symbol. Then you run the NFA again left-to-right in a deterministic mode; you can use the right-to-left symbol states to determine which transitions would lead to success in your left-to-right pass. (The biggest annoyance is if your NFA has loops with only epsilon transitions; you need to ignore such loops in your left-to-right pass. The Frisch and Cardelli paper has one solution for ignoring such loops, involving duplicating every state in your NFA; much simpler techniques can also be used.)

I know of three very different presentations of this basic idea: Greedy Regular Expression Matching (by Frisch and Cardelli), Automatic construction of parse trees for lexemes (by Dubé and Kadiri), and Extending Regular Expressions with Context Operators and Parse Extraction (by Kearns).

Ignoring loops

Since the 2-pass algorithm requires to store sets of states for each symbol of the input, it's possible to replace the first pass with running the NFA forward (left-to-right) with backtracking and memoization, on demand (as required by the second pass to drive its choices). Also, looking ahead for a few symbols might be enough to resolve most choices.

Note that the three papers you mention are interesting in extracting complete parse trees. The regexp packages are usually interested in much weaker semantics that can only extract a finite number of fragments from the input string.

I'm curious about the simpler techniques you describe for dealing with loops of epsilon transitions (and whether they preserve the greedy semantics).

Ignoring loops

I've been meaning to do a proper writeup of this ever since I talked to you at ICFP, but haven't gotten around to it... So, quickly:

We translate a regular expression to an NFA. For every state with multiple outgoing epsilon transitions, we label the transitions 1,...,n such that lower numbers are preferred (for (A|B), the transition to A is preferred over the transition to B, for A* the transition from the end of A to the beginning of A is preferred, and for A*? the transition from the end of A to the successor is preferred).

Now we restrict ourselves to maximal subgraphs of the NFA containing only epsilon transitions. In the second, left-to-right pass, we will have one of these subgraphs, a start node (the current NFA state), and a set of end nodes (the NFA states which will allow us to finish the match, as recorded by the right-to-left pass).

We would like to find the lexicographically smallest path from the start node to one of the end nodes, where edges are compared according to their labels. Unfortunately, in the presence of cycles, it is possible that every path will have a lexicographically smaller path; for instance, in the regular expression (a*?)*, the "preferred" choice would be to match the inner expression against epsilon infinitely often.

Instead, we revise our goal to finding the lexicographically smallest cycle-free path from the start node to one of the end nodes. I think it's likely that there's a polynomial-time algorithm for this problem, but my implementation actually uses a stupid, exponential-time algorithm: find all cycle-free paths from the start node to one of the end nodes using depth-first search (there are at most exponentially many), and then pick the lexicographically smallest.

I have not been able to prove that this always gives the same result as the algorithm in your paper (especially in the presence of non-greedy regexp operators), but I believe that if they differ in some cases, my algorithm actually gives preferable results (because it will be easier to document).

This is a very rushed presentation; let me know if it was insufficient.

NFA construction

Hmmm... "(a*?)*" is equivalent to "()" according to the pcre library.

The problem with ((a)*?)* is much like with a naive ((a)*)* where you might create an epsilon loop. Epsilon transition between states can be avoided when building the NFA.

For instance: the Posix NFA that my regex-tdfa Haskell package constructs avoids this cycle problem. The only epsilon-like transitions are from a normal state to the winning state (with no further transitions). This is equivalent to marking some states as final states.

I did not get my algorithm from a reference, so I assume I reinvented someone's wheel. It is definitely not Thompson's NFA -- it looks more like a Glushkov NFA. But I can have fewer states, which was harder for me to come across elsewhere until today when I found "Constructing NFAs by optimal use of positions in regular expressions" and the follow on, also by Ilie and Yu. My algorithm is not as powerful as the second paper, but seems equivalent to the pictures in the first one.

The trick with "((a)*)*" and similar "((a)*?)*" is that there are two ways (not infinite ways) to transition after accepting "a" to accepting "aa": you can iterate the inner or the outer *-operator. To keep track of tags needed for submatch capture, my NFA construction creates both transitions explicitly. I select between them after they are constructed by using a ordering function on the tag information that gives Posix semantics (I do this when building the DFA). There should also be an ordering function that produces left-biased semantics (for properly decorated transitions).

I don't have a formal analysis to go with my code, but it wrote it as three passes over the parse tree of the regular expression (simplification, analysis, and NFA creation). For a regexp with N literals it creates at most N+1 NFA states, and often fewer.

For what it's worth...

I don't believe that a Laurikari-style tagged (N)DFA algorithm can actually deliver Posix semantics, although I think the Laurikari made a significant contribution to the field. In my opinion, this should lead us to question Posix semantics rather than the value of Laurikari's algorithm.

In essence, Posix semantics ask that we (conceptually, at least), first find the correct substring of the target string -- that is, the longest prefix of the earliest suffix which matches the regular expression -- and then recursively apply a "best-match" procedure which takes a node in the AST for the regex and a substring, and includes at least the following rules:


best-match[ ( R ), X ]
  Set capture[i] to the start and end index of X
  best-match[R, X]

best-match[ R S, X ]
  Find all X', X'' such that X = X' X'', R matches X' and S matches X''
  Of those select the longest X' (or equivalently the last X'')
  best_match[R, X']
  best_match[S, X'']

best-match[ R | S, X]
  if R matches X then best-match[R, X]
  if R does not match T then best-match[S, X]

For the purposes of finding the best-match, we must treat concatenation as associating to the right; so in the case of R S T, we first find the longest X' which matches R and then the longest X'' which matches S, as opposed to first finding the longest X' which matches R S and then the longest X'' which matches R.

The rules are harder to interpret in the case of R+ and R* but it appears that we're supposed to treat these as a sort of concatenation; that is, we treat it as though it were R' R'... R for some number, possibly 0, of R', where R' is R with all tagging deleted, with the provision that R can only match the empty string if there are no R' in the concatenation.

If that's the case, then R+ must be right-associative as well, in which case the correct semantics should be:
best-match[ R+, X] -->
  if R matches X then best-match[R, X]
  if R does not match X then:
    find the longest prefix X' of X such that R matches X'
    and the corresponding suffix X'' such that X = X' X''
    best-match[R+, X'']

However, the Laurikari algorithm cannot make a decision based on the first match in a repetition; it can only choose on the basis of the last match: either the longest (earliest) last match, or the shortest (latest) last match. The T(N)DFA explicitly deletes tag information when it takes a transition corresponding to a repetition.

Now, consider the regex: (<|<a|<ab|<aba|abab|baba|b>|>)* matched against the target string <ababab...> If the target string has an even number of repetitions of ab then the possible parses are: <|abab+|> or <aba|baba+|b>. If the target string has an odd number of repetitions of ab then the possible parses are <a|baba+|b> and <ab|abab+|>. In the first case, opting for the shortest last match implies selecting the shortest first match; in the second case, selecting the shortest last match implies selecting the longest first match.

(I shared this example with Ville some years ago. Since then, Glenn Fowler has added a test to the testregex suite; see REP_LONGEST on the categorize.dat test.)

Now, does this matter? In other words, how much should we care about Posix semantics?

First of all, the testregex page demonstrates that well-respected regex implementations differ in corner cases such as match priority inside repetitions. So in practice, there does not seem to have been much demand for strict interoperability on the basis of Posix semantics. I suspect that this is largely because it is very rare to actually make use of a capture inside a repetition, and even rarer to make use of such a capture when the regex is ambiguous. Most users of regex libraries probably wouldn't notice if captures inside repetitions were completely random (and that's why the bugs in libtre which were recently caught by Chris Kuklewicz went undetected for so long.)

Secondly, it seems unlikely that there is an algorithm which can comply with strict Posix requirements in unbounded storage. If there were some practical reason to prefer the Posix specification over the "longest-last match" (or "shortest-non-empty last match") rules, then that wouldn't matter, but since there doesn't seem to be a clear practical advantage one way or the other, the requirements ought to allow for an efficient implementation, such as tagged (N)DFAs.

Posix quite reasonably removed back-references from EREs, presumably to allow for efficient implementations, and despite the fact that there are practical uses for back-references. (Indeed, most regex libraries implement them anyway, which is arguably a design flaw since it means that an untrusted regex could be a DoS attack.)

Ambiguous * matching

I don't believe that a Laurikari-style tagged (N)DFA algorithm can actually deliver Posix semantics

You are correct: Ville's TRE does not handle Posix matching of ambiguous * captures. But with a small extension (that I have in my regex-tdfa) it does deliver Posix semantics.

The categorization page at Glen Fowler's site has this bug in the REP_LONGEST column as you said. All but 4 of the 22 implementations do manage the correct "first" longest semantics. As does my implementation, by augmenting Ville's tags with additional bookkeeping. Currently my code needs O(length of searched string) storage in the worst case, but I know how to reduce that to O(number of *-operators) by performing pre-comparisons. So strict Posix requirements do not appear to require horrendous storage.

I designed my Posix bookkeeping extension for ambiguous *-operator subpatterns without checking other implementations, so there may be an even more clever scheme out there.

Personally, I will also have an option in regex-tdfa to match the last-iteration is longest semantics that are easier (and faster) to do with tags (and perhaps the equally easy and symmetric last-is-shortest mode).

But I claim that people expect the Posix first-is-longest mode because they expect the matching to be greedy. The Posix first-is-longest also means that the *-operator could be replaced by a finite repetition of the sub-pattern without changing the submatching of the last iteration. This is a useful invariant.

I am still performance tuning my code (to get memory usage down) but I am too busy with real work this week to finish it yet.

I'm anxious to see the bounded storage implementation

The beauty of Ville's implementation is that the memory requirements of a match operation are known in advance. Adding O(length of searched string) -- I assume that it's actually O(number of matched repeats), which is effectively the same -- makes it impossible to make the guarantee that a match operation will not allocate memory, which is a useful guarantee in some environments.

But I claim that people expect the Posix first-is-longest mode because they expect the matching to be greedy.

Perhaps. I claim that, by and large, people don't expect anything of repeated submatches because they don't use them; that is, they are forced to put the () in the regex because Posix conflates precedence grouping with tagging, but they don't actually use the capture. In my opinion, that's the worst flaw in the Posix design.

I could be wrong, of course. But I think I could count on the fingers of one hand the number of times I've seen a program use such a capture.

The Posix first-is-longest also means that the *-operator could be replaced by a finite repetition of the sub-pattern without changing the submatching of the last iteration. This is a useful invariant.

Only in the case of a particular string for which you know the finite number of repetitions. Otherwise, there are all sorts of breakages of that invariant: for example, if the repetition can match the empty string, you can get an empty match in the case that there are not enough repetitions available to make up the finite count. Also, restricting the match to a particular finite count can eliminate the match which happened to work on fewer matches; for example: (aaaaaaaaaaaa|aa|a)+.

The shortest last-match rule allows you to replace (R)+ with R* (R), which I would argue is a more useful invariant. Similarly, longest last-match would replace (R)+ with R*? (R). One assumes that at least some people have that expectation of repeated captures; otherwise, there would have been more complaints about libtre's lack of Posix compatibility.

pre-comparisons

Looks like it's time for me too to register and stop lurking.

Like Rici, I'm anxious to see the version of your algorithm which fully complies to POSIX and does not require unbounded memory allocation during matching.

At one point I used a lot of time trying to figure out how it could be done. I have no proof (or even notes that I could find :-) but I remember that I convinced myself that it's not possible. It would be great if you can show that my intuition was wrong.

[Retracted]

[Retracted]

almost there!

In fact there is a linear-time version of what you described that behaves almost exactly like traditional backtracking (Perl-like).

The final observation is that, just like in the regular NFA/DFA simulation, you can cut off a path if it leads you to a state you've been in before. Then you just need to find the first match on the highest-priority (lexicographically smallest) path. So at any step there are at most k paths (k states in the NFA) active at any given time. If you keep them in a list ordered lexicographically, then the step() and addstate() functions in my article will preserve the lexicographic ordering already, without any need to track it explicitly.

When you find a match during step(), you can stop processing the rest of the current list -- they'll all be lower-priority matches if they ever match -- and thus you should record the match as the best so far. Once the list has dwindled to nothing, stop processing and return the most recently found match.

This basically mimics the traditional backtracking except that it runs all the choices in parallel instead of sequentially w/ backtracking. Running everything in parallel means that one can recognize when two different paths have converged and need not be explored separately anymore. This is harder to do with backtracking -- one must allocate a larger amount of storage to memoize with, like Perl is supposed to do these days but doesn't quite.

By changing the "do I record this match as best?" logic slightly, one could choose to record the leftmost-longest match instead of the left-biased, but still following the traditional backtracking rules for deciding submatch locations. This is what the Plan 9 regexp library turns out to do, and is an interesting compromise.

It turns out that this is not quite exactly like Perl in one particular case: if re can match the empty string and also other strings, then when matching re*, Perl always adds one final iteration during the * to match a trailing empty string. This is non-minimal and complicates the simulation a little, but I think it can be accomodated by allowing nodes to be visited at most twice during addstate() instead of at most once. So it's not quite cycle-free as it almost-cycle-free. This strikes me as pretty ugly, but hey!, it's Perl. ;-)

Here is simple Yacc/C source code implementing all of this. It is instructive to compare it against the original NFA implementation from the article. (Boy do I feel dirty posting C code to LTU!)

Thanks!

Thank you!

I have not yet fully understood your code, but it does look nicer than what I had been doing.

Perl regexes have always been a mystery to me

But I don't think that your code gets the longest match case right:

rlake@freeb:~/src/rscre$ ./a.out  "(a|ab)(a|ab)" "abab"
abab: (0,3)(0,2)(2,3)
rlake@freeb:~/src/rscre$ ./a.out -l "(a|ab)(a|ab)" "abab"
abab: (0,3)(0,2)(2,3)
rlake@freeb:~/src/rscre$ ./a.out -l "(a|ab)(ab|a)" "abab"
abab: (0,4)(0,2)(2,4)

silly option parsing bug, sorry

Sorry, that's what I get for cleaning up the code quickly before posting. The -l flag wasn't actually making it do "longest" matches:

--- nfa-perl.y.old	2007-02-21 08:21:32.000000000 -0500
+++ nfa-perl.y	2007-02-23 12:08:20.000000000 -0500
@@ -578,7 +578,7 @@
 			argv[1] = argv[0]; argc--; argv++;
 		}
 		if(argc > 1 && strcmp(argv[1], "-l") == 0){
-			matchtype = LeftmostBiased;
+			matchtype = LeftmostLongest;
 			argv[1] = argv[0]; argc--; argv++;
 		}
 		else if(argc > 1 && strcmp(argv[1], "-p") == 0){

Interesting...

The first paper looks *very* interesting, as it tackles ambiguity and the rules for picking the best result seems to be the left-biased greedy rules I am looking for!

The second paper has this line: "As we see
below, a single pair of a regular expression and a word may lead to
more than one parse tree. In such a case, the lexical analyzer is free
to return any of these." This makes it less interesting.

The third paper punts ambiguity : "Foster describes an algorithm for rewriting p1+ that can be adapted to solve this problem. "

I have not finished studying it, but the first paper does describe an initial right-to-left pass over the string to be searched. This is not practical in the general case of writing a regexp implementation, since one is looking for the leftmost match and scanning the whole input to find it is inefficient for large input sizes.

But I think I can mine the paper for some ideas.

Confused

I am confused as to what the conclusions of the article are. OK, the languages are all using differently efficient algorithms for RE matching. But the way to speed things up is pretty obvious, I think. DFAs are easier to handle, and if you do not need the full DFA, perform on-the-fly power-set construction. So, what am I missing?

...

What you're missing (me too) is why no one knows or does this!

That's not entirely fair. I do know why many don't -- you'd still need a backtracking implementation for backreferences, so you'd be carrying around two implementations to debug. However, the number of people who actually understood this seemed to be vanishingly small: there are web sites and book shelves are filled with discussion of regular expressions as "pathological" with no mention that this is really just an artifact of the implementation choice, not something fundamental. Even implementations that use a real Thompson-style NFA (aka DFA) still go quadratic trying to find out where the subexpressions matched, but that's a topic for another time.

I did consider titling the article "Regular Expressions Cannot Be Pathological".

Slow is in the eye of the user

What you're missing (me too) is why no one knows or does this!

I'm not so sure you're missing the reason, since you wrote in the article:

It may be hard to believe the graphs: perhaps you've used Perl, and it never seemed like regular expression matching was particularly slow. Most of the time, in fact, regular expression matching in Perl is fast enough.

So if it's fast enough most of the time, what's the motivation for changing it? For example, if the Perl world were awash in complaints about slow regexes, you can bet people would be examining and discussing the problem, and I suspect awareness of the Thompson-style NFA would quickly be raised. As the saying almost goes, necessity is the mother of rediscovery of invention.

Bounded Space and Posix

I have had Ville look this over, and now I posting my proposed bounded space Posix algorithm in more detail. I put it up at the haskell wiki at RegexpDesign.

The text file was not the best wiki formatting, so I will clean up the formatting when I get more time.

Nice

For those following along at home, I posted some longer comments on the libtre mailing list, in the thread Repeat histories

Thompson's algorithm was patented

The full text of the patent is here.
Dennis Ritchie's QED notes are here.
edit:
Russ,
I wasn't disputing your conclusion, just linking.
I'm not sure whether the algorithm was widely known. I just skimmed 17 compiler books, 7 cite Thompson's paper and only Compiler Design in C explains the algorithm.

... and the patent (granted 1971) expired in 1988

That's not why everyone is using slow algorithms.

That plus historical force

That plus historical force conceivably can account for a lot, though.

What about the broader case?

First off, let's settle on a single implementation to talk about. As the most widely used, I think the Perl Compatible Regular Expression (PCRE) library is a decent choice. It's not what Perl uses, but it's compatible to a degree that nothing else I'm aware of is, and it's used by several languages and a broad selection of tools.

So, the question is: can all of PCRE (including back-references, lookahead/behind assertions, recursive subexpressions, call-outs, etc.) be implemented more efficiently for the general case, or would such optimizations allow only for the optimization of pessimistic cases such as those provided in the paper, while rendering the more arcane, but oft-used features of PCRE less optimial? I'm not going to be shocked if there's a better way to match, but the PCRE and Perl community have been struggling with the real-world gap between the theory of regular languages and the practice of programmers trying to implement the kitchen sink using regexen for quite a few years now. I would imagine that there has been more than one attempt to improve performance, some of which have been successful, and some not.