yacc death revisited

Earlier we discussed a provocative paper concerning rumors of the death of YACC.

Hey, look, they've got more to say ("Parsing with Derivitives, an Update").

I think this quote sums up one key bit nicely although the more substantive stuff they have to say is also interesting :-)

An implicit failure of the paper is that we didn't convince the peer reviewers that this was a fun, easy way to implement a general parsing library.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

How does this relate to

How does this relate to Memoization of Top-Down Parsing

Yet as Norvig notes in passing, using his approach the resulting parsers in general
fail to terminate on left-recursive grammars, even with memoization. The goal of this
paper is to discover why this is the case and present a functional formalization of
memoized top-down parsing for which this is not so. Specifically, I show how to
formulate top-down parsers in a ‘continuation-passing style’ which incrementally
enumerates the right string positions of a category, rather than returning a set of such
positions as a single value. This permits a type of memoization not described to my
knowledge in the context of functional programming before. This kind of memoization is
akin to that used in logic programming, and yields terminating parsers even in the face of
left recursion.

re "How does this relate to"

How does this relate to Memoization of Top-Down Parsing

Not much, as far as I can tell. Both happen to use fixpoints, sorta kinda, and both benefit from memoization and/or caching .... but the paper you link to is concerned with finding a base case for a recursive depth-first exhaustive search while the derivatives paper is concerned with a simple and efficient (in "interesting" cases) approach to breadth-first exhaustive search.

Abstract Complexities

I looked at the blog an see them struggling with the complexity of the algorithm (which I didn't read).

I wonder, are the lower bounds just not known? I.e., is for instance Early proven optimal on its set of grammars? That would give an immediate cubic lower bound I guess.

IIRC, CFG parsing (for a

IIRC, CFG parsing (for a fixed grammar) has the same complexity as boolean matrix multiplication, for a matrix of size n x n. So, unless we have a huge breakthrough, that puts a lower bound at n^(2+eps). In fact, it's also an upper bound, although not necessarily practical given the space usage and constant factors involved.

Can you explain how it is

Can you explain how it is possible for an algorithm to be O(n^(2+eps))? This means that you have a sequence of algorithms that get arbitrarily close to O(n^2), right? So couldn't you run the n-th algorithm in the sequence if your input size is n to get a O(n^2) algorithm?

What if there is a

What if there is a preprocessing (input independent) step preparing the n-th algorithm that is worse than O(n^2)?

Right that I swept under the

Right that I swept under the carpet.

Assume that the time it takes to generate the k-th algorithm is p(k). Suppose that the runtime of the k-th algorithm on input of size n is r_k(n). Now the question is can we find a function k(n) that makes the total running time p(k(n)) + r_k(n)(n) in O(r(n)) where r(n) is the limit of the r_k(n), i.e. a way to slowly use more advanced algorithms so that the cost of generating the algorithms is not too much, but the shift to more advanced algorithms is sufficiently quick to make the final algorithm fast.

For example suppose that the preparation time of the k-th algorithm is 2^k, and the runtime of the k-th algorithm is n^(2+1/k). Take k(n) = lg(n), then the total running time becomes O(2^lg(n) + n^(2+1/lg(n))) = O(n^(2+1/lg(n))) = O(n^2)

The general case: Suppose that p(k) is some super fast growing function, and that r_k(n)(n) is some super slow shrinking function (of k). The total runtime is O(p(k(n)) + r_k(n)(n)) = O(max(p(k(n)),r_k(n)(n))). We want the runtime to end up being O(r(n)), so we take k(n) sufficiently slow growing so that p(k(n)) in O(r(n)). Then we have to check that r_k(n)(n) in O(r(n)). This is true if k(n) gets arbitrarily large, which can be done because presumably p(k) is finite for finite k.

So it seems that no matter how expensive it is to generate better algorithms, you can devise a method to make an asymptotically fast algorithm.

Ah you're right. Wrong

Ah you're right. Wrong knee-jerk reaction. I think the problem with your argument is more simple: each of those O(n^2 + ep) has a different constant factor. The constant factor on the nth algorithm could be 2^n. Your hybrid algorithm could be exponential!

Oh, right. And that problem

Oh, right. And that problem is unfixable...

Actually, thinking about it

Actually, thinking about it a little more, I think my first objection wasn't fixed by your construction, either. Suppose the total operation count of algorithm f_n on instance m is 2^n + m^(2 + 1/n). Choosing n = 2 lg(m) gives a count of m^2 + m^(2 + 1/(2 lg m)). But that latter term still isn't O(m^2). Of course, an operation count of 2^n * m^(2 + 1/n) just makes things worse.

Take away: Be careful with sequences of Big-Ohs. Proof that merge-sort is O(n): By induction, sort the halves in O(n). Merge is O(n). Thus the whole operation is O(n). QED

I think the latter term *is*

I think the latter term *is* O(m^2), since m^1/(2 lg m)) = 2 (plug in m=2^k).

Other take aways: Don't

Other take aways: Don't calculate in public and check your work.

Someone showed it is reducible to matrix multiplication

Where that someone was Valiant I guess, that guy has been all over the place proving fundamental stuff. Anyway, IIRC, with something akin to Karatsuba multiplication matrix multiplication drops in complexity from O(n^3) to something lower, O(n^2.376) for Winograd. (As far as I know, it's not epsilon. Not sure.)

As far as I know it is a theoretical result, it doesn't work in practice. No idea if matrix multiplication in the setting of grammars can be made practical, and there are always hairy details.

By all accounts, as far as I know, 'practical' general parsing complexity is cubic.

What I don't understand is

What I don't understand is why everyone in this thread (expect the original author) assumes that "epsilon" necessarily mean an infinitesimal variable which is arbitrarily small.

It's true that epsilon usually carry this meaning, but we programmers know that variable names can be misleading.

What I was taught is that the matrix multiplication exponent is 2 plus a small constant (around 0.376), and that it was conjectured we could get it even smaller. Hence the habit to name the small constant by a variable, to describe "the current best value obtained by complexity research, on which I don't specifically depend". I think "epsilon" is a reasonable name for that, even if a bit misleading.

Nowadays there are group theory results that may *also* legitimate the "arbitrary small" meaning of "epsilon". But I don't know much about this work and probably wouldn't mention it when doing a quick statement about complexity of an algorithm.

I did mean an arbitrarily

I did mean an arbitrarily small, but positive, epsilon. ISTR a conjecture that we could find algorithms that went arbitrarily close to n^2. According to http://www.siam.org/pdf/news/174.pdf, a conjecture that it's exactly 2^n (http://arxiv.org/abs/math.GR/0511460) is credible.

All practical implementations are cubic

This result is just paper until they implement a quadratic algorithm.

Of course it is. You just

Of course it is. You just asked for lower bounds.



it's all about the sweet spot

Here's one way to look at it:

A CFG describes a push-down automata (PDA), we well know.

In this context, we think of the PDA as a finite state machine that will run with a stack and against a given input.

CFG notation -- the way we think of and about grammars -- tends to be a PDA description language that uses non-deterministic finite automata (NFA).

The conversion of NFA to equivalent DFA is well known (it involves taking closures of transition relations over sets of NFA states -- treating sets of NFA states as DFA states).

In part, the algorithm described in this paper does that DFA conversion "on-demand" in response to input. It is a lazy, memoizing or caching NFA->DFA algorithm situated in a PDA. The authors take care to actually get the description of this pretty much right for recognition purposes (e.g., well founding left recursion) but there isn't a lot of mystery there (other than why didn't others do this sooner).

What's startling about the paper (and, apparently, the act of writing it) is that, in fact, in practice, it yields oddly useful parsers from very little code. Wait, it's *that* easy?!? Huh. Go figure.

Theory seems to point out that it isn't that easy. You can always create pathological grammars that don't do well under this technique (it looks like, and is expected). Randomly generated grammars don't appear to be handled well, on average. But.... screw all that.... this algorithm is both pretty easy to predict if you understand it a bit AND it also appears to perform quite well on a large space of grammars of the sort that people actually want to use. There is something about Nature that nudges humans, at this time, towards certain kinds of grammars and, interestingly, this algorithm does really well for many of those grammars.

It is a lazy, memoizing or

It is a lazy, memoizing or caching NFA->DFA algorithm situated in a PDA.

Unfortunately, the set-of-state simulation described in the paper doesn't seem to take advantage of the context-freeness of PDAs, hence their inability to guarantee polynomial runtimes. However, I believe it's possible to exploit the structure of PDAs better and guarantee Earley-level performance bounds.

question of goals

Guaranteeing performance bounds is expressly a non-goal of the effort.

Even if it's a non goal of

Even if it's a non goal of the original authors, such work would definitely be interesting.

AFAIC, +1. I have no

AFAIC, +1. I have no difficulty to agree with this:

[...] this algorithm is both pretty easy to predict if you understand it a bit AND it also appears to perform quite well on a large space of grammars of the sort that people actually want to use. There is something about Nature that nudges humans, at this time, towards certain kinds of grammars and, interestingly, this algorithm does really well for many of those grammars.

My general intuition is we can allow ourselves a (moderate) dose of pragmatism in these matters as well.

Surely, languages such as C in which millions of LOCs are expressed for the purpose of more-or-less "low level" code (rather more than less, in the case of C) are indeed most likely to be absolutely demanding the fastest lexing and parsing phases as one can possibly achieve; hence the very common people's preference towards the purely linear complexities found in the LR family of algorithms.

But also, as higher and higher the language's expressive power and average abstraction level of your programs written in it are raised, the more and more likely, if you understand best and make the most relevant use of the language's syntax/semantics, you'll end up writing relatively smaller and/or less numerous phrases meant as compilation (or interpretation) units.

Then, I believe the tension between the respective parsing algorithms' performance, on one side, and the size of classes of grammars they can handle on the other (along with the ease of implementing a grammars/parsers for/with them), becomes less and less prevalent, as seen by the implementor in the end, anyway (as compared to other performance-related issues, e.g., w.r.t the runtime support, JITting, etc).

Imho, the Yacc is dead paper and the other satellite papers from the authors, are opening another pretty interesting field of very lightweight solutions to the general parsing problem, if not for all PLs imaginable to come, of course, at least for a large number of them.