The fundamental limitations of parser combinators... and how to fix them.

I am presenting a paper at PADL 2011, in which I discuss the fundamental limitations of current Haskell parser combinator libraries like Parsec and UUParse. The limitations are inherent to the modelling of non-terminal references using direct recursion in a referentially transparent functional language like Haskell.

The basic problem is that because of the model used, an algorithm can only ever "look at" a finite top part of the infinite tree representation of the grammar, and the full grammar can never be observed. If you have ever wondered why libraries like Parsec or UUParse only ever use LL-style algorithms (sometimes with memoization or lazy computation of first sets), it is because only these algorithms can work with such a finite top part of a grammar. Bottom-up algorithms or grammar transformations (which are common in parsing literature and widely used in parser generators) cannot be implemented. Even printing a BNF representation of the grammar is fundamentally impossible.

In my paper, I show how a different representation of the recursion in the grammar allows for a full "view" on the structure of the grammar, while keeping the DSL very shallow. I demonstrate the additional power with implementations of a few grammar algorithms in the accompanying technical report. An elaborate implementation is available in the grammar-combinators library (featuring a packrat parsing algorithm, left-corner grammar transform, many grammar manipulation primitives, basic UUParse and Parsec compatibility etc.).

Details are in the paper. I'm very interested in feedback from the community. For those of you interested in playing with the code, there is a tutorial available.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Can you tell why you felt

Can you tell why you introduced parser combinators in your paper by inventing 7 symbols? Is this intended to increase the felt math factor in it?

The symbols for the operators

The choice of symbols is of course not fundamental and I'm not 100% happy with the situation myself. In the paper, I (unfortunately) use two sets of operators: the UUParse operators when discussing that library and the operators from my grammar-combinators library. The reason I have not used the UUParse symbols (from Applicative and associated classes) in my library is technical, and partly explained here.

Have you tried sending this

Have you tried sending this to any of your non-Haskell-using peers? It's unfortunate to see such work potentially ignored simply due to its presentation style.

Non-Haskell users

Well, I'm presenting the paper at the non-Haskell-specific PADL conference, so I hope to find some interested people there. We are also considering writing a journal version of the work at some point, and then I plan to improve the notation.

The nature of parser combinators prevent many things.

Bottom-up algorithms or grammar transformations (which are common in parsing literature and widely used in parser generators) cannot be implemented. Even printing a BNF representation of the grammar is fundamentally impossible.

I am not an expert in BNF specification, but I was told it was designed for context-free grammars.

Parser combinators can easily express context-dependent grammars, an example was given with Parsec:

xmlContent = do
    (tagName,params) <- openTag
    content <- many xmlContent
    closeTag tagName
    return $ Node (tagName,params) content

I think, you cannot use BNF for parser combinators.

CFGs and parser combinators

This is a valid remark, but it does not invalidate my point. Even if parser combinators can actually do context-sensitive things in the individual rules, the question remains how you model recursion. You still need a full view of the grammar for many applications, and then you need an encoding which allows you to observe the full grammar and the structure of the relationship between its non-terminals.

By the way, I think it is possible to take my paper and replace the applicative style production rule combinators and replace them with monadic ones to obtain the same concepts for a form of context-sensitive grammars.

Monadic combinators

By the way, I think it is possible to take my paper and replace the applicative style production rule combinators and replace them with monadic ones to obtain the same concepts for a form of context-sensitive grammars.

Ulf Norell and I represented grammars as functions from well-typed nonterminals to (somewhat restricted) monadic right-hand sides in the paper Structurally Recursive Descent Parsing.

I miss the term Reification

I glanced over the technical report, so take the following with a grain of salt.

Am I correct that the paper boils down to, in short, that by "reifying" parser rules, different parser strategies may be used for a defined grammar?

I would be interested in seeing this work in languages which allow a simple form of introspection/reification in the language. It would give a nice argument for a Haskell extension which makes that possible. (I envision that one could just reuse the function names as symbols for the generating grammar.)

[ The question being, does your manner of reification generalize to other problem domains as well? What are other means of reification in Haskell? How do Haskell extensions help you in reifying certain data? What other choices are there? How do we know that this is not a 'Rube Goldberg' solution? ]

Language features

I'm not 100% sure what you mean with reification, but I think my solution is specific to a pure programming language. If you have compile-time or run-time meta-programming or even mutable state, I think other solutions may be possible and perhaps simpler.

Reification by Wikipedia

By means of reification something that was previously implicit, unexpressed and possibly unexpressible is explicitly formulated and made available to conceptual (logical or computational) manipulation. Informally, reification is often referred to as "making something a first-class citizen" within the scope of a particular system. (From Wikipedia)

One of the ways of looking at your report is that it is difficult, but not impossible, to exploit parser combinators to the fullest since you can't really reuse what you expressed as grammar rules since function definitions are not first-class citizens of the language.

Concrete, the function 'def expr = constant + (expr . op . expr)', may read like a grammar rule, but since in most languages you don't have access to the defined terms (introspection), it is difficult, but not impossible, to reason over the grammar defined.

Your solution -to me- reads like: 'We use a manner of reifying grammar rules in the language [through type trickery?], such that they become first class citizens, and from there, we can easily derive properties over the expressed grammar and switch between different parsers.'

I am not so surprised by the fact that by reification you can implement more functionality and more complex parsers. I wonder more about whether the strategy chosen for reification is in any sense, for lack of better words, 'right.' Does it generalize?

Naive point

Not sure if this point is helpful or stupid as I don't have to write specs this formally. So on the off chance it is the former and not the latter, I'll take the invitation as part of the community.

The idea of a Parser is to take a string to a structure in other words take something relatively unstructured and make it structured not to take something structured and evaluate it, that's more like an evaluator. In looking at the base structure:
data Domain = Line | Expr | Term | Factor | Digit

You are picturing a source much more hierarchical than what actually exists in most places where someone has to parse. Terms can be on multiple lines. Digit seems pointless the first thing and only thing you do with the digit structure is convert it into a number (where I assume in your notation number:: Number -> something).

I'm all in favor of advances in parsing. I love your idea of better organizations leading to desirable properties of parsers. IMHO think about parsing the TeX underlying your paper and trying to pull out the paragraphs before and after references 2,6,14 including equations.

That is what a parser has to handle. It has to be able to grab what it needs from an unstructured source without understanding the underlying structure, that is without being a TeX compiler.

The example

The example is kept simple for clarity of presentation. Concerning TeX, I'm not familiar with the details, but I understand it is fundamentally based on macro-expansion. As such, I expect it will be hard to separate parsing from execution, so I'm not sure that classical parsing solutions apply.

But anyway, I agree with you that any parsing tool should be usable for real languages, but I believe that with some extensions for context-sensitivity as suggested in another thread, this will be okay for my library.

Name reuse, semantic actions, and performance

I liked your paper very much, but I found it a bit obfuscated. I concur with the comment on the symbols, they seemed out of place. What bothered me more was the reuse of very same names in different namespaces. I'm familiar with Haskell enough to know that the two Line identifiers in declaration Line :: φarith Line are not related, but I still found it confusing.

It wasn't until I read your comparison with Oleg's finally-tagless work that I finally realized how you resolved the question of parse tree nodes and semantic actions, by the way. You first suggest the parse tree construction, then dismiss it, but never seem to clearly explain what's replacing it. Perhaps I was reading too quickly.

The last comment I want to make is that I was surprised by your unguarded admission, in section 3.2, that the added abstraction inevitably has a performance cost. I mean, that is no doubt true if you use the same parsing algorithms. I thought the whole point of the abstraction was that you can now employ different algorithms, including more efficient ones, more suitable for a particular grammar or input. Is this impossible for some reason I'm missing?

Name reuse, semantic actions, and performance

Mario,

Thanks for your comments. Regarding the use of identifiers like Line in different namespaces (Haskell's type namespace is separate from its value namespace): this is inspired by Yakushev et al. who do the same in Multirec. I find it readable, even though I agree it might be confusing in a first read.

About the performance discussion: when I say "the added abstraction inevitably has a performance cost", I was indeed referring to a comparison against using the underlying parsing algorithms more directly. In theory, a compiler might be able to compile out a lot of the abstraction, and I think that's what GHC partly does when you use inlining flags like Magalhaes et al., even though I haven't checked in the generated Core.

By the way, for pure performance, there is some work in the library to allow you to use a given grammar with a kind of parser generator through Template Haskell, but this currently does not yet inline the semantic actions and I do not yet get the same result as when I hand-implement the parser directly.

Note that my library is currently not faster than parser combinator libraries for any example. Note also that my library currently does not have a direct implementation of an LR-style algorithm, but I believe that you can actually simulate something like LALR(1) by running an LL(1) parser on the uniform Paull transformation of the grammar (just like you can simulate a left-corner parser by running a top-down parser on the left-corner transform of the grammar).