Regular Expression Matching Can Be Simple And Fast

With Peter's observation that everything good in Computer Science happened during the "Golden Age" freshly in mind, I found Russ Cox's recent article on regular expressions to be enjoyable reading.

This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics ... The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 10^15 years.

Combining implementation details, finite automata, and a foray into decades-old theory, this article shows how most of our favorite little languages have an enormous performance bottlenecks for certain categories of string comparisons.

An additional data point: The Shootout benchmarks have a large string comparison test. It's interesting that Tcl is at the top of the heap for performance. Guess which one is using the Thompson NFA algorithm for regular expressions?

Comment viewing options

...and occasionally buggy

Chris Kuklewicz has been doing interesting work with regular expressions in Haskell recently, motivated by the standard Haskell regex module, which is terribly slow, and strict to boot. He found a few moderately serious problems with the popular TRE, which has a name for being robust and very fast.

Details around the middle. It's not clear to me how he found these problems, alas. If he used the fabulous QuickCheck, that would be interesting to know.

Summoned

This seems like the best moment to stop lurking and post.

Short answer: QuickCheck caught errors in TRE and OS X's regex.h

Answer to different question: I had written a response to Russ Cox's page as part of the Haskell wiki.
and Russ Cox and I have been slowly discussing this via e-mail.

After wrapping posix regex.h and libpcre in Haskell, I got a tip on the mailing list to do the same for TRE. Which had a very similar API, so that was quickly done. So I had several C backends.

silly option parsing bug, sorry

Sorry, that's what I get for cleaning up the code quickly before posting. The -l flag wasn't actually making it do "longest" matches:

--- nfa-perl.y.old	2007-02-21 08:21:32.000000000 -0500
+++ nfa-perl.y	2007-02-23 12:08:20.000000000 -0500
@@ -578,7 +578,7 @@
argv[1] = argv[0]; argc--; argv++;
}
if(argc > 1 && strcmp(argv[1], "-l") == 0){
-			matchtype = LeftmostBiased;
+			matchtype = LeftmostLongest;
argv[1] = argv[0]; argc--; argv++;
}
else if(argc > 1 && strcmp(argv[1], "-p") == 0){


Interesting...

The first paper looks *very* interesting, as it tackles ambiguity and the rules for picking the best result seems to be the left-biased greedy rules I am looking for!

The second paper has this line: "As we see
below, a single pair of a regular expression and a word may lead to
more than one parse tree. In such a case, the lexical analyzer is free
to return any of these." This makes it less interesting.

The third paper punts ambiguity : "Foster describes an algorithm for rewriting p1+ that can be adapted to solve this problem. "

I have not finished studying it, but the first paper does describe an initial right-to-left pass over the string to be searched. This is not practical in the general case of writing a regexp implementation, since one is looking for the leftmost match and scanning the whole input to find it is inefficient for large input sizes.

But I think I can mine the paper for some ideas.

Confused

I am confused as to what the conclusions of the article are. OK, the languages are all using differently efficient algorithms for RE matching. But the way to speed things up is pretty obvious, I think. DFAs are easier to handle, and if you do not need the full DFA, perform on-the-fly power-set construction. So, what am I missing?

...

What you're missing (me too) is why no one knows or does this!

That's not entirely fair. I do know why many don't -- you'd still need a backtracking implementation for backreferences, so you'd be carrying around two implementations to debug. However, the number of people who actually understood this seemed to be vanishingly small: there are web sites and book shelves are filled with discussion of regular expressions as "pathological" with no mention that this is really just an artifact of the implementation choice, not something fundamental. Even implementations that use a real Thompson-style NFA (aka DFA) still go quadratic trying to find out where the subexpressions matched, but that's a topic for another time.

I did consider titling the article "Regular Expressions Cannot Be Pathological".

Slow is in the eye of the user

What you're missing (me too) is why no one knows or does this!

I'm not so sure you're missing the reason, since you wrote in the article:

It may be hard to believe the graphs: perhaps you've used Perl, and it never seemed like regular expression matching was particularly slow. Most of the time, in fact, regular expression matching in Perl is fast enough.

So if it's fast enough most of the time, what's the motivation for changing it? For example, if the Perl world were awash in complaints about slow regexes, you can bet people would be examining and discussing the problem, and I suspect awareness of the Thompson-style NFA would quickly be raised. As the saying almost goes, necessity is the mother of rediscovery of invention.

Bounded Space and Posix

I have had Ville look this over, and now I posting my proposed bounded space Posix algorithm in more detail. I put it up at the haskell wiki at RegexpDesign.

The text file was not the best wiki formatting, so I will clean up the formatting when I get more time.

Nice

For those following along at home, I posted some longer comments on the libtre mailing list, in the thread Repeat histories

Thompson's algorithm was patented

The full text of the patent is here.
Dennis Ritchie's QED notes are here.
edit:
Russ,
I'm not sure whether the algorithm was widely known. I just skimmed 17 compiler books, 7 cite Thompson's paper and only Compiler Design in C explains the algorithm.

... and the patent (granted 1971) expired in 1988

That's not why everyone is using slow algorithms.

That plus historical force

That plus historical force conceivably can account for a lot, though.

First off, let's settle on a single implementation to talk about. As the most widely used, I think the Perl Compatible Regular Expression (PCRE) library is a decent choice. It's not what Perl uses, but it's compatible to a degree that nothing else I'm aware of is, and it's used by several languages and a broad selection of tools.

So, the question is: can all of PCRE (including back-references, lookahead/behind assertions, recursive subexpressions, call-outs, etc.) be implemented more efficiently for the general case, or would such optimizations allow only for the optimization of pessimistic cases such as those provided in the paper, while rendering the more arcane, but oft-used features of PCRE less optimial? I'm not going to be shocked if there's a better way to match, but the PCRE and Perl community have been struggling with the real-world gap between the theory of regular languages and the practice of programmers trying to implement the kitchen sink using regexen for quite a few years now. I would imagine that there has been more than one attempt to improve performance, some of which have been successful, and some not.

one of my few area of expertise

I am the author of a free software regexp matcher called "Rx". The most up-to-date version can be found deeply embedded in the source code for "GNU Arch" (a revision control system). Be cautious of certain widely distributed earlier versions (most commonly known as "GNU Rx") because they have some known, severe bugs.

My implementation is, for most purposes, among the top 3 performers among widely distributed regexp engines. Last I checked, anyway. None of the top 3 is a clear winner, performance-wise (it depends on the pattern and the string being tested).

I spent over a decade on that piece of software, with plenty of mistakes along the way. Naturally, I therefore regard it a fascinating and widely under-appreciated and under-documented area of systems software. I am capable of, yet manage to mostly refrain from, boring fellow hackers for hours on end with observations about subtle details.

A few words about some highlights:

Careful distinction must be drawn, as comments above note, between true regular expressions (always equivalent to a DFA) and "regexps" as commonly found in the wild. "Regexps" have features like back-referencing and so forth that take them out of the regular language category, so simple conversion to a DFA is not an option.

Conversion of a true regular expression to a DFA is problematic for on-line algorithms, anyway, since certain patterns result in an exponentially larger number of DFA states. (This is why offline tools like unix's "lex" spend so much effort on state-reduction.)

The famous "dragon book" (compiler textbook) has a couple of paragraphs describing the lazy, on-the-fly construction of a DFA in a system that merely caches DFA states and transitions within a bounded amount of memory. Private communication from some of the old unix-lab Bell Labs folks tells me that is because they experimented with that approach - and ultimately gave up on it as "too complicated". Nevertheless, that technique works very nicely - mostly giving DFA performance and degrading to the performance of a back-tracking matcher when the DFA is too large and DFA state access patterns too irregular.

Extending a caching-lazy-DFA for true regular expressions to handle such things as back-referencing is tricky. I learned how via personal communication from Henry Spencer who described to me his invention of an algorithm he dubbed "recursive decomposition". What you can find in my Rx implementation is recursive decomposition augmented with a bunch of heuristics to speed it up in common situations.

Recursive decomposition does the search algorithm needed for such things as back-tracking and the left-to-right bias of Posix regexps, but uses the lazy-DFA technique wherever it can. When recursive decomposition "blows up" performance-wise, it tends to blow-up on that back-tracking, in time.

The algorithm currently used in GNU libc is an interesting variation: it does its back-tracking in parallel. When it "blows up" it blows up in space.

The linked article disappoints me a bit because it rehearses the Thompson construction, swell, but only mentions in passing how the non-regular features people want (like back-references) make everything harder.

There is a whole class of optimization techniques for these "more than regular, less than turing" languages that are strictly black arts in two senses: (1) How to implement them well is poorly documented; (2) How to use them while avoiding performance "blow-ups" is poorly documented.

Regexps (true regular expressions and the extended varieties) are under-utilized, imo. With a good regexp engine, you can really push the limits of these things and all kinds of crazy applications show up (think about adaptive real-time protocol-analyzers, for example). It'd be nice to see the field better documented but all we ever get are these starter articles that retrace the initial steps taken about 30 years ago.

-t

Pray continue...

I am capable of, yet manage to mostly refrain from, boring fellow hackers for hours on end with observations about subtle details.

The nice thing about forum software is it has infinite patience... and those who aren't interested can skip and those who are (like me) can delve.

One of my favourite tricks is to map arrays of completely other things on to chars and then feed the resulting string to a Regexp engine.

I have played this trick with vectors bounding the edges of pixels and tokens and non-terminal symbols in a grammar and... Let your imagination riot.

The advent of Unicode (and other) has broaden the atoms of Regexps considerably beyond that of 7 bit ascii.

ie. The scope for this class of trick has grown massively.

Perl and now perl6 has expanded Regexps way beyond what you call "true regular expressions".

Packrat parsers and peg's also open up new opportunities.

One of the Doh! moments everyone has when encountering unicode in regexps for the first time is the [^abc] construct suddenly results in a vast set. Then you ask yourself does [a-f] still mean what you think it means? (Yes, but only by the "accident" that unicode was designed to include ascii 7-bit as a subset) How do I match all "word" characters? What does '.' match any single character mean in a variable length encoding?

Now imagine if instead of eating characters (or codepoints) regex engines ate sequences of objects on which you could invoke predicates like "vowel?" "word?" "lower_case?" "white_space?"

Now imagine if we freed ourselves from limiting the predicates to talking just about characteristics of characters, but were _any_ predicate method...

Now imagine if we freed ourselves from limiting our interpretation of "adjacent" as "next to each other on a page" to meaning "partially ordered"...

Regexps engines are powerful beasts indeed... start your engines folks... the race is just beginning.

This approach is similar to

This approach is similar to the APL/J approach of formulating computational problems in such a way that they are solvable using matrix operators without explicit loops. I love the puzzle-solving flavor of these approaches...

I was kinda expecting you to prod Thomas to write more about implementing re. I would sure love to read more (though LtU is probably not the best place for this. How about Software: Practice and Experience)?

flattery will get us.... where?

How about Software: Practice and Experience)?

There are few things I would enjoy better as an expression of my being as a hacker than to take many hours over a few months for such a paper. Two big things are missing: One is money - I lack the financial freedom for such a thing by far. I worry (lately with increasing urgency) about my (relative to market) rent and day-to-day living expenses, as things stand. It's pathetic, I know. Color me blushing with embarrassment over that. The other is a need for a partner / writing mentor for that kind of project. I handle English fairly well, I think, and can from time to time knock out a really swell essay but I have, in spite of my advanced age of 43, precious little experience in doing good academic writing (actually, zero experience) and I would need kind help. I am largely socially disconnected from the community of hackers other than via forums like this so I have nobody to call upon for such a thing. That's slightly exaggerated in that I live near UC Berkeley and I know of one or maybe two faculty members whose doors I could (for good cause) knock upon and ask for some introductions / help. I'm not anxious to abuse those potential connections until there is a really convincingly good reason.

-t

Just do it

Technical writing isn't that hard. Make an outline that covers your results, flesh it out, add an abstract and references & submit. Referees & editors will tell you what you need to fix. Ehud's right, S-P&E loves to publish this sort of stuff.

Berkeley, eh? Are you the Thomas Lord on Dohr Street? That's about a mile from here. (I'm up the hill, near Whole Foods.)

+ I am sure that if you have

+ I am sure that if you have a draft you want comments on, people here will be willing to help.

nike

I will add it to my list. Being (well, soon to be) between gigs, maybe it's worth a shot. Thank you for the encouragement.

Yes, I'm Thomas Lord of Dohr St. I guess that you are Duff of Eponymous Device? Small world. Maybe some day I can tell you funny stories about my first boss (Tilbrook). Quite a harsh taskmaster: when he found out peers and I were arguing Emacs v. VI, naturally it became a condition of employment that I learn QED :-)

-t

continuing, but barely

Regexp engines are, I agree, powerful beasts indeed.

One little "practice an experience" tid-bit I'll pass along, since you mentioned Unicode, is an element of my extending Rx from an 8-bit ASCII (say, Latin-1) matcher to a Unicode matcher. One thing that became interesting in that area was to compress various tables (e.g., of Unicode codepoint properties) into sparse arrays. Look at the code (again, found in GNU Arch).

-t

The linked article disappoints me a bit because it rehearses the Thompson construction, swell, but only mentions in passing how the non-regular features people want (like back-references) make everything harder.

But he also mentions they are hard in general: matching with back-references is NP complete. I assume that the nature of this manifests in the Thompson construction as a doubling of the number of paths which have to be followed simultaneously in each step.

This isn't an implementation problem though. I suppose the Thompson algorithm will always only just match regular regexps but this isn't much of a big deal because back-references for example can be implemented as additional constraints which eliminate paths to be followed. The Thompson algorithm creates a somewhat bigger set of admissible solutions which has to be minimized. I don't see a reason why this isn't feasible.

complexity of regular language recognition

(edit: Nuts. I mis-threaded this. It is in response to Kay Schluehr, above.)

No. Thompson's construction deals with regular languages and those are not NP complete. It's the non-regular addition of back-references and sub-pattern-position-reporting to the pattern language that I believe Spenser showed was NP-complete (although I am not, blush, familiar with his proof).

For a handy upper bound on the complexity of matching a regular expression r against a string s you can consider a simple algorithm that in O(|r|) steps converts r to an NFA and simple NFA algorithm that then takes O(|r| * |s|) steps to test for a match.

If you batch-convert r into a DFA in O(2^|r|) steps, giving Dn states, then you can match using that DFA in O(|s|) steps (or O(|s|*log(Dn)) if you are a stickler for the true cost of memory access).

If you do the "lazy, caching DFA construction" that I was talking about you wind up with an algorithm that floats between O(|s|) and O(|s|*|r|). You can get very close to O(|s|) in a lot of common cases without paying the O(2^|r|) worst case for building the DFA.

If you add in sub-expression position reporting and back-references, then (per Spencer) you are in NP-complete territory. The new complexity arises because you have to do a search. In the general case there can be more than one way to match a given parenthesized sub-expression. Among all the possible matches, you have to find the one that best satisfies a rule like "give preference to subexpressions on the left - give them the longest match that works - but if a non-longest match of them is necessary, use that". The added complexity doesn't have much of anything to do with the exponential size of a DFA for a true regular expression - it has to do with the size of the search space for aligning sub-pattern matches and handling back-references.

-t

No. Thompson's construction

No. Thompson's construction deals with regular languages and those are not NP complete.

There is some subtlety here. I do not assume that the number of states in the original NFA are somehow reduced to yield optimal runtime behavior of the regexp matcher. The Thompson NFA alone still only deals with a regular language but without state reductions in a grossly suboptimal way ( exponential complexity by default ). However, we permit this because the algorithm is augmented by predicates reflecting context sensitive decisions e.g. those of back-reference matching which lead to a reduction of states at runtime - or even not in the worst case. The latter means that we remain in NP territory.

In order to illustrate this just consider the regexp example given here provided along with the NP-completeness proof of regexp matching with back-references.

\$regex  = '^(x?)(x?)(x?).*;
(?:\1|\2|\3x),
(?:\1|\2x|\3),
(?:\1x|\2x|\3),
(?:\1x|\2x|\3x),'


When you replace each back-reference \1, \2 and \3 with the referred expression in parentheses one would expect major reductions being possible: for example (?:\1|\2|\3x) just becomes (?:x?|x?|x?x) which can be reduced to (?:x?x?).

Without performing the reductions all paths are taken. Those paths which are not permitted by means of back-reference matching are ruled out using predicates which are supplementary and correspond to non-regular extensions of regular expressions.

I hope the idea how to deal with back-references when using Thompson NFAs has become a little more clear.

Complexity

The Thompson NFA alone still only deals with a regular language but without state reductions in a grossly suboptimal way ( exponential complexity by default ).

I don't understand what you're trying to say here. Thompson's algorithm doesn't involve any exponential behavior. Its complexity is bounded by the product of the length of the text and the length of the pattern.

implementing backreferences

A short answer here is that Duff is right. Thompson's algorithm does not handle patterns like the one you are asking about. Given a true regular expression, r, in O(|r|) steps it builds an NFA whose size is O(|r|). It matches a string s in O(|r| * |s|) steps. End of (that) story.

I see that you have some idea in mind for implementing back-references. In particular, you want to transform expressions by replacing backreferences, like "\1", with the pattern subexpression to which they refer, like "x?".

The problem is that that transformation is not, in the general case, valid. For example:

(x?)\1   is not equivalent to   (x?)x?


The first pattern will not match the string "x" but the second pattern will match that string. A backreference does not mean "must match the same pattern as that earlier expression" it does mean "must match the same string as was matched by that earlier pattern. Your proposed transform breaks that as the example I gave shows.

Here is a pattern (a "regexp" not a true regular expression) that is hard to match efficiently:

([ab]*)b*\1


That pattern matches a string of the form XYX where X is a possibly empty string of "a" and "b", and Y is a possibly empty string of "b". Note that the substring X must occur twice for the overall string to match.

Consider trying to match that against:

abbbbbabbb


A matcher, having seen the first "a", knows that the first parenthesized sub-expression matches at least one character. Upon encountering the first "b", there is a question: should that be matched by the "([ab]*)" part or by the "b*" part? The same question recurs for each of the first five "b" characters. Any algorithm that solves this must, in effect, do some searching to discover that the "([ab]*)" must match three "b" characters, in this case. (And note that if the string were, instead, "abbbbbabb" then "([ab]*)" would have to match only two "b" characters.

You might think that gnarly heuristics could find a clever way out of that trap. In the example pattern I just gave, a matcher might notice right away that by counting the number of "b" characters at the end it can quickly find out how many "b" characters the "([ab]*)" must match. If you or I were to hand-compile our example regexp to a C algorithm, we would have little trouble making a matcher that ran in O(|s|) time (for this one pattern).

But the problem quickly gets harder and harder. For example, consider:

(([ab]*)b*\2)*[ab]*\1b*


That's a fairly devilish one.

It is not so hard to write a "backtracking matcher" that will always give correct answers for expressions with backreferences. Unfortunately, such matchers have exponential performance in the length of the string being matched: O(n^|s|) for some n.

Spencer proved (it is said, again, I don't know his proof) that the general problem of doing better than backtracking - getting a polynomial rather than exponential matcher - is of the class NP-complete. So if you come up with a polynomial algorithm then, by gosh, you will be a good candidate to win a Turing Award or some such. There is probably challenge prize money out there, too. And, on the other hand: it is very, very, very unlikely you'll have such a breakthrough and it may very well be impossible (but nobody has proved that). Conversely, if you prove that no polynomial algorithm is possible, you will again be rich and famous -- but again, best of luck on that unlikely endeavor.

Some History

This is some "just so" history but I think there is some literal truth to it.

The simple, naive way to implement true regular expressions in a language like C is with a backtracking search algorithm. That was the state of the art when Thompson wrote his paper. Even after Thompson's paper it remained common practice. Thompson's breakthrough was showing a simple to technique to achieve O(|r|*|s|) performance rather than O(n^|s|).

The theoretical complexity of regular expression matching is O(n^|r| + |s|). The O(n^|r|) part can be done in "batch" and if you pay that cost just once for a given expression r, then thereafter you can match strings in O(|s|). That's what lex is for.

People had around these exponential, backtracking search regular expression algorithms and they noticed that it was easy to add certain non-regular features - such as backreferences and subexpression position reporting. You still have an exponential algorithm, but it can do more tricks. For simple patterns and short strings, the poor worst-case performance doesn't matter much - so people were happy with those features.

Those features got codified in various contexts like POSIX, perl, and GNU Emacs. We're stuck with them for a long time to come.

As software grew more complex and computers grew larger, the problem of more complicated patterns and longer strings started to matter. So, several of us went off to try to find fast ways of matching those - e.g., to try to avoid exponential behavior.

Spencer then proved that we probably won't succeed at that and we probably won't anytime soon prove that we can't succeed.

Just so. The features that put us in NP-complete category are features that were added because, even though they aren't true regular expression features, it was easy to hack them into early backtracking matchers.

Main Research Activities

The main implementation action in recent years has been concerned with four different problems:

1. Efficient batch-oriented DFA construction and state minimization. Some people work on making things like lex work better.

2. Efficient on-line true regular expression algorithms. Some work on tuning true regular expression algorithms with O(|r|*|s|) worst-case but O(|s|) common-case performance.

3. Tuning and extending the functionality of backtracking matchers. While these seem doomed to always have exponential performance people work on different space/time trade-offs and heuristics to make them fast in common cases. People work on adding new non-regular features to regexp languages.

4. Handling large character sets, Unicode in particular. Regexp languages typically require implementations to manipulate sets of characters. When the character set is ASCII or an 8-bit character set, there is no big deal. When the set of all characters is as large as Unicode (and when the encoding of characters in strings is correspondingly complicated) there are interesting performance challenges.

-t

Oh, yeah. Thanks a lot,

Oh, yeah. Thanks a lot, Thomas, for the corrections. Now I can see more clearly what I missed and that my assumptions were too simplistic.

In the general case of "... (P) ... \1" the back-reference \1 doesn't depend functionally on a known string S being matched by P but P's matching of S also depends on what \1 will match.

I wonder if there are characteristics of a regexp that suppress those global interdependencies? Here is a first attempt:

In an expression:

... (P) ... \1


the back-reference \1 can be separated from P when following holds:

1. P doesn't contain back-references which means that it is self contained.

2. It is possible to write the expression in the way

... L(P)R ... \1


where L and R are left and right delimiters of P which means P has no characters in common with L and R. L can be empty when (P) is the start of an expression.

The no back-reference condition can be checked syntactically.

In practice L and R might be pattern and the delimiter condition is satisfied when following is true:

LAST-SET(L) /\ FIRST-SET(P) = {}

LAST-SET(P) /\ FIRST-SET(R) = {}


This means that FIRST-NEXT ambiguities as we observe them in terms like "([ab]*)b*\1" are not permitted by the rules.

FIRST-SET(P) /\ LAST-SET(P) = {}