archives

Recursive Descent Parser Generators

I'm writing a parser generator, to avoid being TLDR (which is most of my messages) I wanted to ask if anyone knows of other LL(*) parser generators, what I should aim for performance wise (what's acceptable) and if you've had experience writing something similar what's one of the toughest nuts to crack on this style of parser?

For me (more than one thing) :

  1. Generating Parsers with unbound look-ahead uses a lot of memory to do statically so none (or very little) of the processing is done during run-time.
  2. Follow predictions are a PITA: you have to analyze all possible parse paths coming into the given rule, walk the context stack to disambiguate the offending lock(s), and keep stepping ahead and only accept if the parent contexts can parse and that state can also accept (this was a nightmare, but I think I have it...?) I'm writing the prediction discriminator logic now (my most recent task.) Dangling else isn't an issue because iirc it is aware that the parent context is on an edge state once the current rule exits, and I apply to the maximal munch principle.
  3. Lexical Ambiguities: I'm a glutton for punishment, I used includes to allow complex grammars to span multiple files. To avoid issues with parse order and precedence on the user's side (you can still specify which trumps the other, but this trumping order is 'always on' if used) I allow for lexical ambiguities to create new parse states and following those ambiguity identities you can figure out which path was actually intended.
  4. To solve the memory issue, I decided that I would use: a symbol stack, an active rule context, recursive descent, and reductions.
    • Reductions some of solve the issue of memory by identifying where multiple paths of a prediction all enter the same rule on the same state, parsing that rule, and evaluate what comes after that rule. This abates some of the issues with the look-ahead requirements, but adds more complexity when the follow-state of that rule is ambiguous with the prediction! I haven't even gotten to this issue :(
  5. AST Generation: making this look easy is the hard part. It should just work (I will focus on refining it once everything else is done.)
  6. It's so easy to create grammars that are simple but yield an exponential state explosion. I need to do more research to detect this ahead of time so I don't waste time on processing something useless, so the user has a better experience than 'Out of Memory Exception'.

If this ends up working as intended, I'll be astonished, because when I started this eight years ago, I didn't have an effing clue. I still probably don't, but I at least know what I want to do now. The grammars define no semantic predicates, nor actions on lexical matches, so the generator has to do all the work. So far (without error context) it's 50-80 times faster than my own hand-written parser for its own language, until I found that I wasn't handling lexical ambiguities on follow states and now have to fully implement them to do a proper comparison.

Edit:
The goal with this LL(*) approach is there is very little back-tracking employed. The only time it should need to is when it's handling left-recursive rules and it has no choice to do so because it has to fall to a weaker prediction mechanism to yield a preliminary identity onto the stack that it can use to move forward, then rinse and repeat.

Edit 2: Redacting the following, turns out the issue is more complex :) The sample I gave it had zero issues with :(
'No Alt Indirect Left Recursion' is when you have something like:
A ::= B | C;
B ::= A '[' ']';
C ::= "Other";

Without code to properly identify these 'No Alt Indirect Left Recursion' states, it will recurse infinitely because there's no fall-back to choose first, it would always predict 'B' if it has: Other[].