Yacc is dead

In Yacc is dead (2010) Matthew Might and David Darais of the University of Utah, Salt Lake City...

present two novel approaches to parsing context-free languages. The first approach is based on an extension of Brzozowski’s derivative from regular expressions to context-free grammars. The second approach is based on a generalization of the derivative to parser combinators. The payoff of these techniques is a small (less than 250 lines of code), easy-to-implement parsing library capable of parsing arbitrary context-free grammars into lazy parse forests. Implementations for both Scala and Haskell are provided. Preliminary experiments with S-Expressions parsed millions of tokens per second, which suggests this technique is efficient enough for use in practice.

It seems every problem in computer science can be solved with either one more level of indirection or a derivative.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

§1

I agree a lot with section 1. Not least because I am an average programmer who has to deal with Yacc. I know the motivations for the Lex/Yacc team, but it is a serious and noteworthy issue when people prefer setting up loops and fget() calls to using a tool that has worked for 50 years.
Then again, such questions—qualitative and concerned with irrational things—have no place in this more-dismal science.

This sounds like interesting

This sounds like interesting work, but I'll add a few comments by way of perspective:

1) There is an existing class of bottom up algorithms for parsing arbitrary context free (possibly ambiguous) grammars, called GLR algorithms. This page includes an incomplete list of different GLR parser generator implementations (even Bison these days has a GLR option though it is theoretically not the most efficient).

2) Ambiguity is a fact of life in parsing natural language, but folks designing computer languages often don't want a parsing formalism that accepts ambiguous parses that must be disambiguated post parse. For most artificial language and development purposes, it seems like a better bet to use a grammar class that enforces non-ambiguity already in the language/acceptor design. There is a big space of grammars in between LALR(1) and context free.

I can only agree with this

I can only agree with this:

Ambiguity is a fact of life in parsing natural language, but folks designing computer languages often don't want a parsing formalism that accepts ambiguous parses that must be disambiguated post parse. [...] There is a big space of grammars in between LALR(1) and context free.

as the "Yacc is dead" paper is indeed clear about the fact their solution deals with the ambiguity problem (for some inputs and some CFGs) in a pretty "raw" fashion — re: the parsing algorithm result, in the general case, it's all about parse forests, there, and not about a (single) parse tree.

One has to pay attention upfront to this specifics, as this obviously may or may not be compatible with one's mileage w.r.t the rest of the language processor's implementation.

In fact it can't be any more

In fact it can't be any more "cooked" in the general case because determining whether an arbitrary CFG has sentences with ambiguous parse derivations is a formally undecidable problem. So it's important here to be circumspect about what one actually wishes for from the parser generator.

Yes, exactly right: [...]

Yes, exactly right:

[...] determining whether an arbitrary CFG has sentences with ambiguous parse derivations is a formally undecidable problem. So it's important here to be circumspect about what one actually wishes for from the parser generator

One thing for instance that would have been probably nice to find in their paper is a treatment of how their approach/method, and implementations, specifically compare to other PEG-based and/or Packrat parsing algorithms, out there, w.r.t, e.g.:

1. the respective "sizes" of classes of grammars within reach of theirs vs. PEGs/Packrat's

2. the respective upper bounds of algorithmic complexity for those, again vs. PEGs/Packrat's implementations

to start with.

(as it seems that this "Yacc is dead" solution side for the parsing problem, if "computationally quite close" to anything, that would probably be to those, in my understanding.)

Expressive strength

In Total Parser Combinators (ICFP 2010) I use Brzozowski derivatives to show that parsing is decidable—but not necessarily efficient—for every "grammar" expressible using my combinators. I also prove that any total, finitely ambiguous parser which can be implemented in the host language can also be implemented using the combinators (up to reordering of the results).

Thank you for this

Thank you for this clarification/complement for LtU readers. That will surely go in favor of a less subjective reading of the Yacc is dead paper.

Ambiguous parsing

Canonical counterexample to "we don't need ambiguous parsing":
x*y;
In C++ this could be a multiplication, pointer declaration, or arbitrary overloaded meaning of "*". You have to hit name and type resolution before you can distinguish them.

Sure, lots of modern/well-designed languages have unambiguous parsing. But what about the billions of lines of code written in C, C++, (multiple dialects of) FORTRAN, COBOL, etc? Are you treating preprocessors and macros? The class of "machine languages" and "ambiguous parsing languages" is rather more conjoined than may be readily apparent to those who only parse in an academic environment. But wait, there's more!

I have the pleasure of working with a full GLR parser, and I see no advantage in ever using a more restricted generator again. I don't maintain an ambiguous grammar, but even for unambiguous the advantages are readily apparent. I use left recursion with willful abandon, keeping grammars small and close to the formal specifications. Speed optimization has never been an issue; my quick bench run shows 1,000 lines parsed per second for an unambiguous grammar, and the penalty for in-line semantic ambiguity resolution is rather small AFAIK. If that isn't enough, there is plenty of parallelism that we haven't taken advantage of. Parsing is a naturally coarse-grain parallel activity per file, and GLR parsing has small-grain parallelization possibilities on top. Running full tilt parallel on a modern machine could easily bring it to 4,000 lines/second. We haven't bothered with this because the size of trees in memory vastly dominates parser scale limitations; filling up RAM a little faster hasn't been a priority.

Parser generators are a beautiful case study of software development colliding with practical limitations and solving it with excellent computer science and rigorous tradeoff analysis. Lots of tooling infrastructure still stands on various LR variants, and understanding their limitations is of course necessary to utilizing them properly. Unless you plan on doing parsing in a highly resource-restricted environment, though, these techniques don't buy you much in the modern computing environment (and if you're in the "unless" case, you probably want to hand-code your parser anyways).

I mentioned that one

I mentioned that one *designing* a new bit of language does better to avoid ambiguity and you seem to agree with that sentiment.

As you point out, lots of existing languages have various types of syntactic ambiguity and there are various approaches to eliminating that - it was also explicitly part of the motivation for some GLR developments. It's also worth pointing out that in many cases, the ambiguity simply cannot be eliminated by syntactic considerations alone (including precedence rules) as the correct choice depends on semantic analysis of the symbols. I certainly don't go from there to thinking that always building a bottom up parse forest must be *the* right approach.

Of course I am not saying throw away GLR parsers - above I pointed out there existence of many implementations, and I fully support your view that in additional to handling natural language they could be good choices for parsing some historical computer languages.

But I see an analogy between strong type systems that require more effort to pass the compilation stage, but then prevent more runtime errors, and grammar formalisms that balk at ambiguities in their compilation stage but therebye eliminate some unanticipated runtime problems. If the design situation is such that simple precedence rules always give the right answer, then more power to them. But that is clearly not always the case.

Sylvain Schmitz's Ph.D. thesis contains some good examples of ambiguity problems in GLR parsing, arguing that relying on precedence is just a form of risky chance taking.

language design

Ah, I see what your meant there much more clearly. I directed my commentary at a much more general point, and intended my comment to be semi-independent of yours (whether that came across or not). I stand in violent agreement that ambiguous machine languages are a frightfully bad idea, and dealing with them is a necessary evil; language design should rightly avoid such.

However, I think the point still stands on the advantages of GLR. It provides the full power of context-free grammars, and as you mention one can utilize additional constraints to prevent abuse and common mistakes. On these premises it is certainly considered prima facie evidence of a bug for a parser to emit an ambiguity.

What are you using left

What are you using left recursion for? I'm trying to figure out how useful left recursion is, and how much performance you should be willing to give up to get left recursion.

lists

Writ large, it represents the ability to associate recursion to the left as well as right. Writ small, it comes in handy for compact representations of various lists, especially tail-lists:

TypeName = TypeName '.' Identifier ;
TypeName = Identifier ;

VariableDeclaratorId = VariableDeclaratorId '[' ']' ;
VariableDeclaratorId = Identifier ;

(examples nabbed from my translation of the internal Java spec grammar to our GLR BNF notation. Should be readable; single-quoted elements are literals, unquoted are productions.)

It's not that left recursion is superior in isolation, but that it is superior to have it available. General recursive descent parsers can't handle left recursion and force grammar writers to bend their grammars around the limitation. Having both available allows CFG grammars to be as short and expressive as possible, short of E-BNF style macros. In my experience, E-BNF is not nearly as useful once one has a full CFG-capable BNF.

My point earlier is that a modern GLR implementation is about as fast as any use will need, so denying yourself the capability is a rather pointless exercise. Josh Stern rightly pointed out that the use of the full GLR capabilities (such as ambiguous parsing) is not always wise, but I have found external checks quite sufficient to keep myself out of that morass. I do have viewpoint bias here, as all my grammar development has been against large samples of existing code, which makes experimental ambiguity checking somewhat easier than for, say, new DSL development. I see no reason for the underlying parser generator, however, to be limited even if a language author obeys (internal or external) restrictions on feature use. Those kind of architectural assumptions can bite hard down the road.

Left Recursion Desirability

I don't think a generalization about recursive descent and left-recursion is apt here. If backtracking is supported, left-recursion can be as well, and it can also detect ambiguous parses (or you can stop on the first good one).

If the parser is being compiled, the question is simply whether or not the grammar can be translated into a controlled form, one that the language designer might like to use directly too, such as:

TypeName = Identifer ('.' Identifier)*

VariableDeclaratorId = Identifier ( '[' ']' )*

although there are cases that are not quite so straightforward, of course.

At one time I looked at detecting left-recursions dynamically (easy) and automatically setting the continuation tail(s) as a retry alternative (messier).

But if you can normalize the left recursions as part of compiling parser code, and you implement backtracking, the job is done.

The next thing to do is learn how to prune the continuations of untried alternatives/extensions when the parse reaches a point where, if the input is well-formed, there cannot possibly be an alternative to the part of the parse that just succeeded. For example, if there were a rule like

Choice = TypeName | VariableDeclaratorId

one can prune alternatives when either the '.' or '[' show up. It is also useful to know when a left-recursion behaves regularly (that is, it is more successful to try the widest successful recursion first). Note that the ambiguity when there is just an Identifier is also recognized in this contrived case.

Whether or not one wants to deal with ambiguous languages, I once speculated that backtracking provisions may be useful not only for (optionally-constrained-)CFL generality but for error-detection, providing useful diagnostic information, and being able to continue parsing with provisional fixes that allow further errors to be detected with reasonable quality.

That the language designer could rely on a natural form of expression of the grammar seems desirable for that reason as well, since that may cue how the parser climbs out of an error case.

I concede that one can contrive some pretty pathological languages to undermine this idea, but I haven't surrendered completely.

I took the time to read

I took the time to read the whole paper, though not the time, yet, to play with their two implementations to get a more concrete taste about the paper's claims.

Quoting another part:

Grammars and formal languages date to the earliest days of computer science. Parsing is almost as ancient. Even the derivative of a regular expression is decades old. Given all that has been done, it is surprising to find something was still left undiscovered. Perhaps not surprisingly, derivative-based parsing defies the classical taxonomies. Historically, methods for parsing have been divided into two groups: top-down and bottom-up. Derivative-based parsing does not fit cleanly into either category.

(Which I have no problem to buy as it is phrased, a priori.)

For my own note, the high-level algebraic strategy they followed (based on derivatives) in their reasoning to tackle the membership- and parse tree/forest construction problems for CFGs did remind me a lot of James Clark's algorithm for RELAX NG validation, in the context of schema validation with RELAX NG (based on Hedge Automata theory, successfully used in RELAX NG implementations, precisely).

Now, as to whether "Yacc is dead" already and just because of this work (or... not quite dead, still), well, I wouldn't venture in those waters yet, or at least not before seeing a few other convincing implementations in languages other than Scala and Haskell (I may give a try to it myself in C# just out of curiosity if I can find the time for that), but the results and their conclusions are indeed impressive and do look promising, IMHO.

Thanks for the link.

How is this approach more

How is this approach more efficient than backtracking + memoization, which is also easy to implement, has good asymptotic time and not so good space complexity and rather high constants?

Yacc is dead? In short: No

Because if I want to use Yacc, I can find dozen of tutorials, even books on how to use Yacc.

Whereas if anyone want to use their approach, all they have is a paper difficult to read, and a library implemented in two non-mainstream languages and no tutorial (no reading the library's code isn't the same thing).
So if they really want to kill Yacc as they claim, they really did a incomplete job..

Yacc is dead ? for me yes ...

Yacc (or bison or other LALR(1) or GLR parsers) is dead for me because of parser combinators and especially Packrat parsing.
However, I don't think Yacc is ready to die yet in general ... and I am not found of self-proclaimed "killer" papers... though beyond the title the paper seems interesting.

Yes, if there was only one

Yes, if there was only one thing to keep as not serving the paper very well... that would likely be the paper's title, precisely.

As IMHO, that kind of "Yacc is dead" statement's acceptance vs. rejection, is mostly, a matter of one's own taste and of one's speed of closure (and as always... one's mileage: availability, exec time performances, memory footprints, etc) w.r.t algorithms outperformed (or "killed") by new, recently published ones.

There, it does seem it's just the title is "a bit too much" for the age of its content, since the latter does imply the realm of implementations (i.e., usable parsers contemplated for the reader, in the end), and not just of purely formal things.

YAYIDA

yet another yacc is dead article...

Also interesting

What am I missing? At first

What am I missing? At first reading, this looks like non deterministic LL parsing implemented using heavy language support (sequence comprehensions, lazy evaluation and data structure) and hidden under another theorical background.

And BTW, I wonder how one can claim to have a parsing methode able to handle any grammar efficiently and then
- state that to handle sexp (one of the simplest non regular grammar), one had to handle closures specially in order to avoir exponential running time;
- present experiment results only for sexp.

There are some interesting

There are some interesting comments on this paper at Hacker News especially Matt Might's link to the reviewers' comments.

Link seems dead

However a copy could be found on one of the mirror sites.

Russ Cox : Yacc is Not Dead

A quite harsh answer by Russ Cox.

Russ Cox is confused

The paper does not argue that derivative-based parsing is the best algorithm ever (asymptotic performance or otherwise). The paper does not argue that people will abandon YACC (Bison, ANTLR, etc.) in droves and take-up derivative-based parsing. To the extent Russ is countering those positions, he is not talking about the paper.

Russ also seems to want to "prove" that derivative-based parsing is downright bad because of its exponential worst-case performance. The paper, though, is clear that that worst-case is a trade-off for having an on-line parser with a tiny implementation that is easy to reason about and evidently sufficient for many uses for which "cargo cult" parsing is currently popular. Of those virtues of derivative parsing, the only one shared by the algorithms Russ names is "good enough performance in many cases." For the stronger guarantees of Russ' favorite parsers, the programmer must statically compile patterns and carefully work around hard-to-understand limitations of the parser generators.

Take, for example, Russ' comment: "Concretely, the paper fails at goal #2: "parses efficiently on average". Most arbitrary context-free grammars are ambiguous, and most inputs are invalid, so most parses will take exponential time exploring all the ambiguities [....]"

The paper is not really all that concerned with "arbitrary context-free grammars". Rather, it's very specifically interested in the grammars programmers do and will in the future work with. The asymptotic worst-case complexity of the parser generator is of no practical concern in many situations.

Meanwhile, YACC is dead - long live YACC. Just look at all the "cargo cult parsing" going on.

Disappointed, maybe, but not confused.

Tom Lord writes, "The paper, though, is clear that that worst-case is a trade-off for having an on-line parser with a tiny implementation that is easy to reason about and evidently sufficient for many uses for which "cargo cult" parsing is currently popular."

I disagree. I don't think the paper is at all clear about what the worst case is, which would be the first step toward trading it off for the list of benefits. Search for the word efficiency or efficient or exponential. This is the treatment of efficiency:

The “efficient parsing” condition is necessary because programmers will avoid tools branded as inefficient (however justly or unjustly this label has been applied). Specifically, a parser needs to have roughly linear behavior on average. Because ambiguous grammars may yield an exponential number of parse trees, parse trees must be produced lazily, and each parse tree should be paid for only if it is actually produced.
...
Our original implementation did not contain the closure operation, and when we tested it by parsing large, randomly generated S-Expressions, parsing took time exponential in the size of the input; it took seconds to parse a hundred tokens, but over an hour to parse a thousands tokens. Using right-recursive lists reduced the performance penalty to minutes to parse thousands of tokens, but this was still unacceptably high.
...
By introducing the special closure construct, we achieved roughly linear scaling by hand-optimizing the implementation of the parse, parseFull and derive methods.
...
Ultimately, derivative-based methods share the best properties of both words: they are easy to implement, and easy to implement as libraries (like top-down parsers), yet they are efficient and expressive (like bottom-up parsers).

They are simply not efficient like bottom-up parsers.*

The code is indeed a tiny implementation, but I did not find it easy to reason about, and I do not believe it is sufficient for good enough performance if there is any ambiguity at all in the grammar tickled by the input. Suppose you are implementing a grammar where a common expression is ambiguous. If that expression appears once in the input, no big deal. If it appears ten times, probably no big deal. Twenty is pushing it. Thirty is bad. Forty is deadly. This is not a sane performance curve, especially when it is well known how to do better.

My main disappointment is this: the paper makes a case for a nice parsing API and then implements it with a bad algorithm. This code is not straightforward enough that users are going to write it themselves (like, say, Doug Crockford's TDOP parsers). They are going to use the API as provided by a library, so it is okay if the library takes a little more effort to implement. And you could implement that same API with a GLR parser or one of the other actually efficient methods for parsing arbitrary context-free grammars.

The on-line parser part is a red herring: it is about as hard to write an on-line GLR parser as it is to write an off-line GLR parser generator, and both are only a little harder than writing a DFA implementation.

* In fact, Daniel Spiewak has pointed out in comments on my blog that I may have misunderstood the description of the Scala code: it may be not as lazy (which is to say as efficient) as I thought. If that is true, then ordinary backtracking—such as, ironically, in the cargo cult parsing via (ir)regular expressions—would be faster.

[misthreaded -- removed]

(sorry)

legacy languages

It seems to me that with the maturation of clang, the difficulty of parsing c++ should be a thing of the past. I just wish some would package up a clang lib that outputs an AST for some relavent AST api. So that's Difficult to Parse Languages --, now we just need some defacto standard parsers for the other n Difficult to Parse Languages. Of course this is somewhat tongue in cheek as there are surely many dialects of c++ clang doesn't cover. But wouldn't it be nice if there was library of common language parsers? I know, the fun is in the parsing algorithms, not in the implementation. Just saying.