Parsing Expression Grammars

Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford, MIT, 2004.

For decades we have been using Chomsky's generative system of grammars, particularly context-free grammars (CFGs) and regular expressions (REs), to express the syntax of programming languages and protocols. The power of generative grammars to express ambiguity is crucial to their original purpose of modelling natural languages, but this very power makes it unnecessarily difficult both to express and to parse machine-oriented languages using CFGs. Parsing Expression Grammars (PEGs) provide an alternative, recognition-based formal foundation for describing machine-oriented syntax, which solves the ambiguity problem by not introducing ambiguity in the first place. Where CFGs express nondeterministic choice between alternatives, PEGs instead use prioritized choice. PEGs address frequently felt expressiveness limitations of CFGs and REs, simplifying syntax definitions and making it unnecessary to separate their lexical and hierarchical components. A linear-time parser can be built for any PEG, avoiding both the complexity and fickleness of LR parsers and the inefficiency of generalized CFG parsing. While PEGs provide a rich set of operators for constructing grammars, they are reducible to two minimal recognition schemas developed around 1970, TS/TDPL and gTS/GTDPL, which are here proven equivalent in effective recognition power.

An excellent paper! I read it for the first time today and was surprised not to find it in the LtU archive.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Packrat parsing

PEGs have been discussed here, but mostly under the name packrat parsing, cf. Andrew Cooke's story Packrat parsing on Bryan Ford's PhD thesis.

Algorithm-oriented syntax design considered harmful

I think that it is wiser to use the traditional tools for design the syntax of languages than grammars that are specifications of deterministic parsing algorithms, because they focus attention on the rules that govern syntactic structure, and because, being nondeterminstic, they make it easier to ask questions about awkward corner cases.

I'm guessing that a wholesale move towards PEGs would result in a tendency for new languages to have less pleasant syntax.

Algorithm-oriented syntax design considered harmful

Even with traditional tools you can't escape from algorithm-oriented syntax design. The most obvious difficulty comes from restrictions on left recursion or right recursion which cause problems expressing the correct associativity in the grammar.

That's what operator-precedence parsing is for

Associativity is not a problem if you use an operator precedence based algorithm, like the Shunting Yard algorithm. And as current versions of gcc show, you can use the Shunting Yard algorithm alongside a traditional top-down parser, letting the Shunting Yard algorithm handle just the infix expression portions of the grammar.

Good paper indeed

I read a pre-publication version of this paper when I was an undergrad, and in fact that's what got me seriously thinking about pursuing further education on the topic on programming languages. Yesterday, SamK requested interesting papers to be posted, and I was just about to post the PEG paper as my entry for "groundbreaking papers" when I saw this entry.

While mixing parsing algorithms into the grammar specification can be considered iffy, I think that PEGs generally provide a very easy-to-understand tool for grammar specifications. Non-determinism is power, but with power comes responsibility and it often bites students in the behind. I think there is more to learn about parsing in studying top-down parsers and how PEG grammars can be directly mapped to deterministic algorithms, rather than battling the intricacies of lex and yacc, which for the student only leads to unreadable and "magical" bottom-up parsers.

For the interested, there is a PEG mailing list and more material here.

PEGs in Lua

Roberto Ierusalimschy, the creator of Lua, implemented PEGs in Lua and has written an interesting paper about the design and implementation in SP&E. PEGs are now part of Lua as an alternative to regular expressions.

A preprint PDF of the SP&E

A preprint PDF of the SP&E article is here. The Lua implementation is really interesting; it supports both a Snobol-style pattern language and a notation that combines Ford's with regular expression style. Both are translated into domain-specific virtual machine instructions.

PEGs and Packrat Parsing are not the answer

As someone who is working very hard on my next-gen parsing framework Gazelle, people often ask me why I'm not using PEGs. PEGs are all the rage right now. Everyone seems to think they're the future.

I attribute a lot of this to yacc/bison LALR backlash: the error messages ("shift/reduce conflict") are inscrutable, the parsers totally not understandable to average programming mortals. But LALR is not the only option you have before resorting to PEGs. Everyone seems to think in terms of yacc/bison, but ANTLR (LL(*)) has been around for a while now, and improves on this situation a lot. And with Gazelle (also LL(*)), I intend to improve on it even more (take a look at what a Gazelle grammar dump currently looks like -- and this will evolve to be prettier and contain more explanation).

So why not PEGs? Well two main reasons.

1. The performance of packrat parsing is significantly worse than deterministic parsers like LALR or LL(*).

In terms of pure asymptotic complexity, while both algorithms are linear in time, packrat parsing is O(|G| * N) in space, where |G| is the size of the grammar and N is the size of the input. LL(*) and LALR are O(1) space.

How does this play out in practice? Having not implemented packrat parsing before, I turn to the Aurochs parser generator and their performance numbers, which for parsing 140kb of JavaScript are 1s of time and 380 Mb of space! You have to admit that 380 Mb of RAM to parse 140kb of text is pretty extreme. The current version of Gazelle is 134x faster and (for input of this size) takes 0.0009% of the RAM that Packrat Parsing does.

These are significant differences. And the difference for the memory will only get greater as the input size grows.

2. PEGs define away ambiguity, which means that you don't know where the real ambiguity is

PEG advocates use the fact that PEGs can never be ambiguous as a selling point. I think it's harmful, because it means that you never know where the ambiguity actually is. In a PEG, if you write:

a: b / c;

... you don't know if there could be situations where "b" and "c" could both match. Imagine that b is "if/then/else" and c is "if/then" (the classic case of ambiguity). With PEGs you are explicitly resolving this ambiguity to say that else's always bound to the most recent "if", but you don't realize you've done this because you PEG tool will never warn you that an ambiguity ever existed! Remember that this ambiguity is a user-facing issue that needs to exist in the documentation for your language ("else's always bind to the most recent if"), but with PEGs you're flying blind about where these user-facing ambiguities actually lie.

With a PEG, you never know if changing "a: b / c" to "a: c / b" changes your language or not.

What Gazelle supports is the traditional CFG formalism with explicit ambiguity resolution for cases where you need it. So by default, you use non-prioritized choice ("|"), but if Gazelle finds that the choices are ambiguous, you explicitly resolve the ambiguity with prioritized choice ("/"). But Gazelle won't let you put prioritized choice in places where there was not an ambiguity to begin with. With this scheme, you can keep tabs on exactly where your grammar is ambiguous, and exactly how you've resolved those ambiguities.

PEG != packrat

PEGs don't need to be implemented in packrat parsers. One of the more popular PEG parser generators, Rats!, let's you specify that some productions should not be memoized.

In fact, this paper (on DCGs rather than PEG parser generators) runs a bunch of tests on a large body of Java code and concludes that that not memoizing is a better default for most productions. They speculate that human readable languages generally avoid the kind of excessive, large scale backtracking that packrat parsers are supposed to protect against. At large scales, we hate backtracking just as much as the parser does. It's only at very small scales, where we apparently see things more holistically, that backtracking can get excessive.

ANTLR, Gazelle, Trail, ...

As someone who is working very hard on my next-gen parsing framework Gazelle, people often ask me why I'm not using PEGs. PEGs are all the rage right now.

Nice to see some people who are attempting to make real progress in a domain that was considered dead or case-closed for a while. Your criticism on PEGs is spot on btw.

Since we are at it I'm going to shamelessly advertise Trail which is part of EasyExtend. Currently I'm working on careful generalizations of Trail NFAs towards function networks. Those FunNets let Trail finally deal with left recursion / cycles in EBNF grammars without transforming the grammar.

Deterministic vs non-deterministic choice

The question of whether a nondeterministic choice operator (as is found in myriad LALR(1) tools such as yacc) or a deterministic choice operator (as found in PEG) is better--reminds me of similar arguments concerning if statements and the like in programming languages themselves.

Many of the same issues arise.

Virtually all production PLs have deterministic if statements; or something resembling them. (Chained ifTrue: ifFalse: methods and the like in Smalltalk have the same effect). Many programming languages don't have an unconstrained nondeterministic choice operator--the C-style switch statement is nondeterministic in its operation, but the langauge requires all the branches be disjoint (default: is handled as a special case which preserves this rule). Pattern-matching in functional programming languages is also deterministic. Two notable area where deterministic choice is generally not employed (because the choices are specified in separate modules, making establishment of an order problematic) are in multiple dispatch resolution, and multiple inheritance/delegation resolution. Both of those are thorny problems; which are generally dealt with either by arcane (and often surprising) rules to break ambiguity, or by making ambiguous cases an error which the programmer must explicitly resolve.

The advantage of deterministic choice is obvious--it permits using wildcards. The ubiquity of this in pattern-match in Haskell or ML needs no further comment. In the context of parsing, it allows the mixing of rules matching specific keywords, i.e. "true", with rules matching arbitrary alphanumeric strings, i.e. [a-zA-Z]*. In nondeterministic parser tools, one can't do that easily; so lexing is treated as a separate phase by a tool which is deterministic. (Or at least sort-of; the "longest leftmost" rule used by many lexers is positional-deterministic with regard to strings of equal length).

The main objection to determinism (besides the fact that deterministic choice is non-commutative) is that in many cases, the order is arbitrary--and if you think two productions are disjoint (or one is a subset of another) and they're not; you're potentially in for a surprise. You're still in for a surprise with nondeterministic choice (the tool might do whatever it wants), but in many cases you get warned by the tool that your language is ambiguous. Of course--one of the easiest ways to resolve such ambiguities is with--deterministic choice.

With many existing tools which don't have explicit deterministic choice, ambiguities (at the parser level) are handled by various other means, such as a) changing the grammar so that the ambiguity is no longer present, b) ignoring them and accepting the tool's default behavior (a common way of dealing with shift/reduce conflicts); c) using associativity annotations to resolve the ambiguity--something which works in yacc, but is kinda kludgy when the production in question isn't an infix binary operator, d) using the semantic rules to modify the parser's behavior (such as rejecting a production), or e) annotating the text to be parsed with a preprocessor (lex is, after all, just an entrenched case of getting around the limitations of yacc) or f) hope and prayer.

The conclusion to this rant: Why must we choose between deterministic and non-deterministic choice? Ignoring parsers that exploit parallelism when faced with multiple possible productions for a given nonterminal, the implementation has to choose which one to try first anyway... I think that there is ample room for both.

What are formal languages for?

I had understood Joshua's comment PEGs define away ambiguity, which means that you don't know where the real ambiguity is to be making the same point I had made that non-determinism makes it easy to ask the right questions about language syntax, though making it rather better. Both of us, I think, are coming from the point of view of what classes of formal language provide the best basis for designing the syntax of fully-fledged programming languages.

Your comment about the advantages of cleaving to typical programming language semantics are fair enough, but they are not very appropriate for this design space, but are serve rather as a justification for tools such as Perl 6's rules. I agree, for these purposes, but this is not anything like making a case that PEGs are a substitute for traditional grammars.

Of course, once PEG-like engines are widely deployed, more and more domain specific languages will be designed using them, so we had better get used to the idea that they are de facto grammars for designing programming languages.

Which is why I suggest "both".

Given a parser design tool which permits both forms of determinism, you can have the following workflow:

1) Specify your grammar using ND choice, other than "obvious" cases like the keyword-vs-identifier case, which requires determinism to resolve.

2) If the tool informs you of an ambiguity, decide how to deal with it. If the solution is to modify the grammar to eliminate the ambiguity, do so, and go back to step one.

3) If, you decide to resolve the ambiguity by fiat (such as how the dangling else problem is resolved in C-ish languages), simply re-write the rule in question to use deterministic choice. In this case, the rule would be if {expr} else {expr} / if {expr}. If for some reason you prefer the else to bind to the outermost if, then if {expr} / if {expr} else {expr}.

I must admit, my in-depth understanding of the numerous useful subsets of the CFGs, and how efficient they are to parse, is a tad rusty; so there may be some Very Good Reason that determinism and nondeterminism might not mix well--a language needing both might no longer admit an efficient parser. If so, that is obviously a design issue to consider.

But if one is designing a formal language syntax for human programmers (the class of parsing most relevant to LtU)--where one must avoid ambiguity, but must provide the human with textual cues and respect certain stylistic conventions that we silly humans have come to expect--one is probably not dealing with terribly deep parse trees (and lots of backtracking) to begin with. (How much of a compiler's time and memory is spent parsing, vs. other things such as reading in the program text, or semantic analysis, code generation/optimization, etc?) Constructs which do "go deep" in most PLs tend to involve fully-bracketed forms, dangling else excluded.

Parsing natural language is another matter altogether; and here, nondeterminism *is* highly appropriate, given that natural languages themselves are inherently ambiguous. Even so, there are undoubtedly situations where a deterministic rule might be desireable. For machine-generated structural data, you shouldn't need to roll your own parser, just use XML or sexprs or some other unambiguous, fully-bracketed syntax and be done with it. :) Of course, I'm being a bit disingenuous here, as there is undoubtedly the need to parse existing data that wasn't generated with a nice easy-to-parse format...

Not that simple

Context-free grammars and parsing expression grammars are full of traps...

2) If the tool informs you of an ambiguity

But ambiguity in context-free grammars is undecidable. Your tool might tell there's something fishy about your grammar, like an LALR(1) conflict---and the absence of such warnings then indicates that the grammar is unambiguous. Or you've learned to stop worrying and to love generalized parsing, but then you'll learn about ambiguities at run time---or your users will, because your test cases did not unveil one particular ambiguity.

If for some reason you prefer the else to bind to the outermost if, then if {expr} / if {expr} else {expr}.

This parsing expression does not match any if {expr} else {expr} construct: observe that the first part if {expr} is always matched by the first alternative, thus the expression matches this first part and leaves you with else {expr} on your hands, most likely ending up in a parse error. And, maybe because it's only fair that important usability issues should be equally hard for PEGs as for CFGs, the question whether one alternative of a PEG expression is "useless" (as the if {expr} else {expr} above), is not decidable in general.

I'm not sure anyone prefers

I'm not sure anyone prefers not to use "longest match" in case of the dangling else disambiguation.

Organizing longest matches by hand can be become very tedious for larger grammars. So let's say we want a tool that organizes our pattern according to longest matches ( doing left factoring indeed ). Now the required analysis for those tools turns us back into the waters of CFG parsers with lookahead strategies ( or Thompson NFAs alternatively ). But that's precisely where we came from. PEGs are a step backwards that is advertised as a step forward.

Maybe we need a "longest match choice" operator

though when dealing with parse trees containing some mixture of terminals and nonterminals, how one defines the "length" of something is an interesting question. In the case of shift/reduce conflicts, choosing to shift rather than reduce makes sense; but s/r conflicts are easy to deal with.

More generally, what we really need is better ways of dealing with disambiguity (at the specification level). Longest match is often times a good rule, though not always. Leftmost match--i.e deterministic choice--makes good sense for cases like keyword-vs-symbol; though what we probably really want there is "most specific match", something that is undecidable in the general case. (This is how, to a first approximation, multiple-dispatch issues are resolved in many languages). In still other cases, what we really want is to become context-dependent; the correct choice depends on some information on the semantic stack, but not directly expressible in the grammar itself. Ugly as it is, quite a few programming languages that start with the letter "C" work this way.

I don't see this as a battle of PEG vs classic LALR(1) vs classic LL parser strategies, and I'm a bit puzzled by the hostility to PEG or any other tool. (Perhaps PEG has been over-promoted by its proponents; I don't know). In the end, a parser is Just Another Computer Program, one takes text in and spits annotated syntax trees out (consuming some amount of time and temporary storage in the process). How it is implemented is of key importance to the developer, but of minimal importance to the user; who is mainly concerned that a) it works correctly on correct input; b) it produces reasonable error messages on invalid input (the hardest part of writing a production-grade parser is this); and c) has acceptable time/memory performance, with "acceptable" being highly application-dependent.

Tools

I'm a bit puzzled by the hostility to PEG or any other tool.

Making tools is also about making good things and for practitioners it can mean the distinction between a good and a bad day at work. So people struggle endlessly about their qualities and try to improve them and see how to not waste their time. They are certainly not indifferent even if the production process becomes invisible in the end product. I tend to think this is generally a good thing. At the downside are religious flame wars that are just another medium for acting out male aggression.

Complex grammar transformations

[...] In the context of parsing, it allows the mixing of rules matching specific keywords, i.e. "true", with rules matching arbitrary alphanumeric strings, i.e. [a-zA-Z]* [...] so lexing is treated as a separate phase by a tool which is deterministic

Having written a lexerless LALR(1) parser generator, I'd just like to say that this (important) problem can be resolved with a simple (maybe some would disagree with "simple") transformation of the grammar that's very easy to recover from when constructing the parse tree.

Essentially the solution I came up with was to build a trie out of the set of keywords, and use it to "remove" the keywords from the sort of regular expression you mentioned (yielding a set of regular expressions under "non-deterministic choice"). To avoid an explosion in table size, I also rewrite keyword references (and their substrings in the "chopped up" regex, along with some other related regexes) to maximize sharing of states.

Some might argue that the solution above is more complex than a lexer, but at this point I'm comfortable with it and I think that it's worth it to avoid having a lexer. I think that it's similar to the additional complexity of converting an LALR generator to GLR (with efficient sharing of stack prefixes and such), plus I can say that the parsers I generate are fully deterministic.

Performance, Ambiguity of CFGs vs PEGs

I don't think that your first point is valid, in principle anyway. We have a pretty good idea how to implement CFGs, but PEGs are still in their infancy. I'm not sure what the current state of the art is in PEG-land, or how close Aurochs is to that; but they've improved memory consumption by more than an order of magnitude since you've posted this comment. I don't think anybody really knows how good PEG-based implementations will become.

Arbitrary LL(1) grammars cannot be recognized in O(1) space in the worst case: matched parenthesis, for example, might require O(log n) if you use integers to represent the stack, but most parser generators would be O(n), basically using a Peano representation.

You are comparing worst-case to common-case performance, which is definitely a fallacy. Packrat parsing is only one possible way to implement PEGs. Even without venturing into the realm of entirely new algorithms, hybrid approaches integrating existing techniques should be able to greatly improve common-case performance.

More to the point, most parser generators are used to build an abstract syntax tree, so unless you offer the ability to fuse certain syntax-based computations into your parser, (the trivial case would build nothing if you just want to recognize the grammar) the overall result will be O(n) anyway, both best- and worst- case.

On the other hand, your second point is definitely good food for thought. My impression is that the popularity of PEGs are largely due to the perception that PEGs are more easily embedded as library-based DSLs; parsing combinators set off the whole PEG craze. My impression and/or this perception might be false, however.

Parsers/Patterns/Logic

Parsing per se isn't, to me, the interesting part of grammars. To me it's unification of control operators on pattern matching more generally. OMeta uses pattern matching as a general control operator, at the language level. PEGs are a very simple way to achieve this (and implemented under Warth's OMeta/JS).

Can the algorithms under Gazelle be adapted to operate against arbitrary patterns of objects, rather than characters/tokens? If they can, Gazelle becomes a general object pattern matcher as well as a parser.

OMeta's language is akin to Scala's parser combinators, which now support arbitrary object mapping as well as character-oriented data. If we blend in combinators that correspond to the control operators discussed in The Reasoned Schemer, we have a pattern-based control scheme that can perform logic computation as well (in particular, we need interleaved choice and other operators that help achieve fairness).

It's reasonably easy to see how the straightforward execution of something like that can happen with PEGs. In a more complex parsing/matching model with state-based optimization, how does that happen? I'm not saying it can't -- with proper reification of all of the above the raw materials are there to be optimized...it just seems fairly in conceptual distance from state-based parsing algorithms.

Scala's parsing combinator library is currently very anonymous in how it builds out the tree of combinators, which precludes multi-stage construction of the tree (and makes runtime optimization very difficult). I hope it will assign specific and well-known names to common combinators...which will allow the combinators to operate against other combinator trees, which will give us a higher-order pattern matching capability. I suspect the type signatures will be terrifying ;)

An interesting (though perhaps dated) book on the subject

is

A Grammatical view of Logic Programming by Deransart and Maluszynski (1993).

A lot of similarities exist, of course, between parsing (and writing parsers in a formalism such as BNF or similiar, augmented however you like) and logic programming in a Prologue like language (expressing a logical system as a set of Horn clauses). You've got terminals (facts), nonterminals (rules), etc. You have nondeterministic operators (the choice operator in parsing, the disjunction operator in Prolog) and various backtracking strategies.

At a more fundamental level, logic systems expressed in Horn clauses, and the class of languages expressed by CFGs, are both "compromises". They are proper subsets of the larger spaces of "interesting" logical systems/languages, which are commonly used as they are efficiently decideable. And in both cases, the constraints of the formal system are commonly "hacked around" in various ways--one wonders how one expresses a "cut" in Yacc. :) (%left or %right, skillfully employed, perhaps?)

One difference that comes to mind between the two worlds is the effect of nondeterminism. It isn't as big of an issue in the logical programming world, as a statement in a system is true if it can be satisfied by any solution over the axioms in that system; having multiple such solutions doesn't cause a problems. (Until you try to change the system to reflect a new reality, which is why the RDBMS guys worry about normalization so much). If the only output of a parser was a yes/no decision--"is the string in the language?", then nondeterminism would similarly be harmless. But with parsers, the output of a parser is a syntax graph (tree) and not a boolean; and different ways of parsing the same text might produce different but valid parse trees. In some applications (natural language processing) that's OK; other (non-context-free) algorithms are used to decide what is meant; but in programming languages that's usually Not a Good Thing.

Can the algorithms under

Can the algorithms under Gazelle be adapted to operate against arbitrary patterns of objects, rather than characters/tokens?

Is there any relevance to this? The OMeta paper and presentation don't give enough indication that re-targetting to anything non textual is a problem that had to be solved.

Non-textual

Kay: See OMeta Optimizer for an example. The object-based output of the compiler step is passed through the optimizer, which is just another "parser".

My understanding of the intent behind OMeta is that it seeks to use pattern matching and transformation as a control structure, generally. The "mood specific language" capability is an offshoot of that.

They're using OMeta now for

They're using OMeta now for naïve code generation (you know, the backend of a compiler), Piumarta's tiny TCP implementation, code analysis, and so on.

Necessary Ambiguity

I haven't fully perused the paper, but how can an ambiguity-free solution cover all languages? The canonical example is C's "x * y", which parses as a pointer declaration or multiplication. Only after you can resolve the types can you distinguish the two (and C++ lets you overload, so you're really up the creek there).

Unambiguous grammars are nice, but not universal even in computing practice. Perhaps this could eliminate some pain in working with academic languages, which can remain blissfully ideal in many ways. One generally also gets exposed to academic languages by definition as much as by example, so a clearer and more intuitive description could be a big win by itself.

Depends on what you are doing

Writing a parser for an existing language is a bit different from designing a new one. Natural languages aren't context-free, likewise for many computer languages (including C/C++)--though many have a context-free "structure" that can be employed to do much of the work, with contextual clues (such as a symbol table lookup, in this case) used to handle ambiguity.

If you are designing a New Programming Language, and have control over its concrete syntax, one of the choices is to make it easily parseable (without ambiguity, or with ambiguity which is easily resolved both in the parser implementation and the language manual); knowledge of the limits of CFGs (and the subsets thereof supported by parser tools) is essential. Designing an ambiguity-free language isn't hard; S-expressions are a trivial example. However, such languages tend to look "artificial" to humans; who generally converse in context-sensitive and ambiguous languages. Natural languages like English or German being examples; the language of mathematics as well.

Syntax matters far less than semantics, ultimately. But it does matter.