Derivatives of Regular Expressions

Derivatives of Regular Expressions, Janusz Brzozowski, Journal of the ACM 1964.

Kleene's regular expressions, which can be used for describing sequential circuits, were defined using three operators (union, concatenation and iterate) on sets of sequences. Word descriptions of problems can be more easily put in the regular expression language if the language is enriched by the inclusion of other logical operations. However, in the problem of converting the regular expression description to a state diagram, the existing methods either cannot handle expressions with additional operators, or are made quite complicated by the presence of such operators. In this paper the notion of a derivative of a regular expression is introduced and the properties of derivatives are discussed. This leads, in a very natural way, to the construction of a state diagram from a regular expression containing any number of logical operators.

This is one of my favorite papers. It describes a very cute algorithm for building deterministic finite automata directly from a regular expression. The key trick is the idea of a derivative of a regular expression with respect to a string, which is a non-obvious but fun idea.

Note: This is an ACM DL link; I couldn't find the paper freely available online. :(

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

zippers and previous mentions

This paper was referenced previously: http://lambda-the-ultimate.org/node/1329

I have also read of such strange derivatives in the context of the zipper data structure.

Regexps and zippers

I spent a little while trying to find the relationship between the Brzozowski and zipper derivatives. There is a sense in which they aren't analogous. The Brzozowski derivative chews a matching symbol from the left of a regexp but the derivatives that McBride uses to compute zippers make a 'hole' anywhere in a datatype, not just at the left. This is because McBride uses a derivative operator that looks more like Leibniz's original derivative operator than Brzozowski's.

If you modify Brzozowski's definition to make it look more 'Leibnizian' you get an operator that chews matching characters from anywhere inside a regexp, not just from the left.

Ie.
Define

d_a(u+v) = d_a(u)+d_a(v)
d_a(uv) = d_a(u)v+ud_a(v)
d_a(u*) = (u*)d_a(u)(u*)
d_a(a) = 1
d_a(b) = 0 (b a symbol /= a)

So d_A(DATA) = DTA+DAT ie. the word DATA with one A deleted from anywhere inside it. This looks a little more like differentiating a datatype. I'm not sure what the applications of this operator might be: maybe matching regular expressions against strings with missing characters, but the expressions can rapidly become cumbersome if you want to match against a string with a few missing characters.

Clowns to the left of me, jokers to the right

Perhaps look at this continuation of derivatives of data type: Clowns to the left of me, jokers to the right PDF.

(And hopefully Conor's site won't randomly decide to give out 403s this time.)

I did already

Those new derivatives correspond to operators on regexps that chew everything including and to the left (or right) of a matching character. Eg. D_E(REGEXP) = GEXP+XP (or something like that).

Again, I don't know if this give a useful operation but it's interesting to see the connections.

Divided differences

By chance I found myself back here again 3 years later. The 'reply' button is still there so now I can give a good explanation of the relationship between Conor's old and new 'derivative' operators. The new operator corresponds to the classical divided difference of numerical methods, not the derivative. The derivative is the special limiting case in which the divided difference is over an interval of size zero. You can then connect back to regular expressions via this.

Derivative of a googling

There's a link to the paper near the bottom of this page. I hope I don't get Thierry Coquand in trouble with the ACM enforcement wing...

On that same page is a link to some course notes which include a "a complete definition and the algorithm to test if a string is accepted by a regular expression", with Haskell code.

sigfpe has an overview of the technique here, with some related links at the end.

shapr points out that the citation list for this paper on Citeseer is interesting, including such papers as Programmable Type Systems for Domain Specific Languages (Thiemann).

A related paper is Partial Derivatives of Regular Expressions and Finite Automata Constructions (Antimirov).

Tangentially, I see that Brzozowski is a prolific guy, who is still publishing.

Brzozowski derivatives and RELAX NG

The XML schema language RELAX NG is implemented using Brzozowski derivatives: see James Clark's paper on his implementation.

Efficiency tradeoffs

When tinkering with regex-tdfa (Haskell package) I went to raid the library to look up on old paper on deriving regexps. Thanks to Anton for the link to one.

The tradeoff to simpler algorithms is that the derivative technique studies the actual characters being matched and their "future" regexps. After much (inefficient) processing this creates smaller finite automata. The Thompson and Glushkov and related techniques efficiently build automata from the structure of the regexp without looking at (comparing) the actual character atoms, but these automata are larger and obviously less specialized. The derivative technique can also spot additional opportunities to merge parts of the automaton compared to the quicker techniques.

Freely Available Copy

Here is a link to a freely available copy of the "Derivatives of Regular Expressions" paper...

http://www.cs.chalmers.se/~coquand/AUTOMATA/brzozowski.pdf

To me, it became obvious when...

To me, it became obvious when I realised that each equivalence class of derivatives maps 1:1 to a state, and represents the regular expression that would be matched if that state were the start state. Its an idea that I had some time ago (no credit claimed - the idea came from reading stuff that may well have its roots in this paper) but my term is 'tail grammar' - the grammar representing the set of possible tail strings after some head string has been recognised. The same trick can be done with LR parser state models for context free grammars, but issues can arise in terms of how many tokens to pop from the stack on a reduce.

Adrian Thurston (http://www.cs.queensu.ca/~thurston/) has written a program called Ragel, which I believe works on equivalent principles, but representing the regular expressions using the state models that implement them. There are simple state models built in for simple machines, and more complex machines are built using algebraic combinations of these machines. Minimisation is done at each stage, along with elimination of epsilon (no token swallowed) transitions.

Behavioral Derivative

Automata, Power Series, and Coinduction: taking input derivatives seriously J. J. M. M. Rutten (1999) (check the references too)