Treetop: Packrat Parser Generator for Ruby

Hey LtU. I learned about parsing expression grammars on this site, so I wanted to share what I've done with them. Treetop combines the fact that PEGs are closed under composition with Ruby's mixin semantics, equating grammars with modules that can include one another and override each other's rules with access to the `super` keyword. I think a lot of the ideas within it have been invented in parallel within some other frameworks in the year I have been writing it, but I nonetheless think it offers an expressive set of features in a usable package.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

PEGs are awfull

They [PEGs] are a simple but powerful generalization of regular expressions that are easier to work with than the LALR or LR-1 grammars of traditional parser generators.

Ordered choice becomes a real pain as the language grows. Everyone using regexps for tokenizers has the problem right now. Finding the right ordering for the token pattern in a tokenizer of a medium sized language becomes a lengthy and tedious trial & error game.

Growing pains

Ordered choice becomes a real pain as the language grows

Is there any formal grammar system that doesn't cause pain as the language grows? If so, I haven't used it.

Do you want to make a

Do you want to make a statement or do you just want to say something?

I think you're being asked

I think you're being asked for alternatives that don't exhibit this or similar problems, and perhaps an explanation of how they're avoided.

PEGs Seem Desirable, But Are They Usable for Large Projects?

Everything I've read about PEGs seems desirable from a PL design perspective, starting with their ability to handle tokenization, but I haven't tried to construct a large grammar with one yet.

Has anyone personally tried writing grammars for moderate to large languages using them?

Did they run into trouble getting the rule order right as Kay suggests and are there any particular language constructs that have caused the most difficulty in that regard? Maybe Kay could give us some examples of rule ordering problem.

Ordered choices

Ordered choice gives you some work. Other than in the context free case ( BNF ) where A|B commutes [1], ordered choices forces you to place the pattern for longer matches before those for shorter ones. There is no way to achieve this automatically.

My prime example is the tokenizer for Python which uses regular expressions. Python has ~50 different token that will be recognized by the parser. Some of them have substructures e.g. the NUMBER token which is an opaque representation of all kinds of numerals e.g. ints, floats, exponentials etc.

Suppose you want to introduce a new pattern for recognition of IP addresses ( according to IPv4 ). You'd like to make IP addresses literal expressions in your DSL that extends Python. You can't provide an API to the programmer where she just specfies a new token and adds it to the set of token definitions but she has to care for the correct position of the corresponding pattern in a huge regular expression; otherwise localhost 127.0.0.1 will be splitted into '127.0', '.', '0.1' i.e. two float literals and a dot.

The tokenizer.py module of Pythons standard library gives a nice illustration. All relevant regular expressions are defined and also the grouping. One can tinker around with it and transform e.g. the ordered choice:

PlainToken = group(Number, Funny, String, Name)

into

PlainToken = group(Funny, Number, String, Name)

and see what happens ( don't know if this particular permutation causes havoc but some do for sure ).

[1] I don't want to leave the impression that (E)BNF is without problems. It's just not fair to say that PEGs solve them and finally load more work on the user at other places.

EBNF doesn't fix it either

EBNF doesn't fix it either though, you just end up with ambiguity which still needs resolving.

That said, I'm using a separate lexing pass with Parsec these days (admittedly partly because that lets me implement layout).

Longest match fixes

There are always one or more hungry state machines and each of them wants to eat as much characters as it can according to its specification ( production rule ). The one with the most appetite is allowed to build the token. This resolves the ambiguity. The lexer will always prefer producing IP_ADDRESS over FLOAT DOT FLOAT even if the latter is a valid token sequence accepted by the parser.

So longest match is a dumb but effective and often even sufficient disambiguation strategy. There are of course cases where it is
inadequate but then individual and context sensitive parsers as for C are likely needed.