Looking for pointers: mixfix error recovery

I'm looking for pointers to ideas, approaches, papers, and implementations of error recovery in mixfix implementations.

As BitC is progressing, we're becoming progressively more reliant on the mixfix processing layer. In particular, it looks like we're about to handle type annotations in the mixfix layer, where the expression e:T has an "expression hole" on the left and a "type hole" on the right. This isn't a problem per se, but this is one of the natural places to insert a parse error recovery point, perhaps along the lines of (pardon the bison-ese):

expr: expr ':' type
expr: error ':' type

When mixfix only handled simple expressions, the consensus in our community was that error handling at the granularity of sub-expressions wasn't very motivated. But as more and more of the language processing is handled by the mixfix layer it becomes important to think about error recovery in that layer.

By error recovery, I mean recovery such that further useful parsing can continue after an error. It's easy enough to print diagnostics when an error occurs; the goal is that each execution of the compiler should identify as many errors in the input unit of compilation as it feasibly can.

Does anybody have pointers to work that has addressed error recovery approaches in mixfix processing?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

A paper

This paper strikes a glancing blow on your problem:

Parsing Mixfi x Operators, by Nils Anders Danielsson1 and Ulf Norell.


Instead of rejecting ambiguous grammars we suggest that ambiguous parses should be rejected, preferably together with error messages showing all possible parses, thus aiding debugging.

That's not an error recovery issue

I'm aware of that paper and that suggestion, but it doesn't really address the issue I'm after. I'm concerned about the case where we have an input that clearly (and correctly) fails to parse in some portion. What we would then like to do is have a means to re-synchronize the parser so that subsequent, unrelated errors can be detected and reported in the same compile run.

I haven't seen anything in the mixfix/distfix literature that addresses this at all.

Perhaps an area to look is

Perhaps an area to look is parallel parsers: for divide-and-conquer, theyl'll need to speculate how to jump in at any point anyways. I think I misunderstood the question however.

weak tyro advice

Without a whit of research, my strategy is: always generate a parse tree, so errors are explicit parse tree nodes. Unfortunately this seems to require summarizing all possible wrong inputs in classes of error, so every bad input maps to some kind of error. When you see something you don't like, you must describe specfically what is not liked.

The hard part is deciding where to resume recognizing good lexical syntax or grammar again, which is probably what you're asking, so I'm not helping. If you think of an error as punching a hole in a valid parse tree, I try to make that hole smaller so feedback on other code occurs sooner.

I think you should just wing it. If language research is like the rest of the development world, no one wants to define what to do after errors very carefully. I bet users won't mind if you sometimes generate more error messages caused by a first error before resuming in good code.

If you think of an error as

If you think of an error as punching a hole in a valid parse tree, I try to make that hole smaller so feedback on other code occurs sooner.

So basically, instead of backtracking on error and trying another rule, assume a syntax error and continue parsing with the current rule, placing an explicit syntax error node in the tree. Then collect the set of all parse trees, basically one per rule, and pick the one with the fewest number of errors.

For a successful parse, there will be a tree without errors. If no successful parse exists, the tree with the smallest number of errors is the program requiring the fewest number of changes to become syntactically correct.

That does sound optimal, albeit expensive, but it seems like a good place to start.

Edit: You could probably make this efficient by making parsing incremental. Only continue parsing with a given rule if the number of errors encountered so far is less than the number of errors encountered by the previously lowest error parse. This would require evaluating each rule in parallel until they start dropping off as they start accumulating errors. Some will come back into the running for unparseable programs as errors accumulate.

Experience Report

I tried something like this for a DSL parser some time ago.

The problem was how to re-synchronise when all parse trees had errors. I thought I was smart by giving each rule a re-sync symbol it could look ahead for. Sadly this didn't work unless all the current rules happened to have the same symbol. If the re-sync symbols were different (which was the case most of the time in the particular language) then each tree had to continue to be parsed to end of input and that became too expensive, especially where subsequent errors multiplied the numbers of trees.

Also when several rules had different causes of the error which one should be displayed to the user? Plain "syntax error" wasn't very helpful but "you have error abc or def" was worse :-)

Eventually gave up and failed on first error.

The problem was how to

The problem was how to re-synchronise when all parse trees had errors.

What do you mean by "synchronization"? Following naaskings approach each branch of the parser always eats the same token. In that sense they are "synchronized" by default.

Synchronize meant

By synchronize I meant the rule to the token stream.

After a mismatch continuing to try to match subsequent tokens at the same place in the rule gave long strings of errors which wasn't very useful. The idea (not original) was to define a token or pattern that was very likely to indicate a specific point in the rule (eg the ';' as statement end in a C like language).

From this point any further mismatches are likely to be further syntax errors, not irrelevant consequent errors, so all tokens up to this point can be ignored.

I assume it is hard to find

I assume it is hard to find a dedicated checkpoint in a mixfix expression which can be arbitrary nested.

Have you tried to repair the token-stream?

Let's say the token-stream T is corrupted at index n. We build a new token-stream T' from T with the following properties:

1. The edit distance dist(T, T') must be minimal.
2. For a fixed k the stream T'[0..n+k] must represent a valid expression ( T[0..n-1] is valid by assumption ).

It is important to choose k not too small because otherwise we might run into consequent errors but also not too big because we don't want to detect new syntax errors at this point.

Yes, and not much

I assume it is hard to find a dedicated checkpoint in a mixfix expression which can be arbitrary nested.

Yes, informally the requirement for accuracy seemed to be that the sync pattern must be illegal either syntactically or semantically in any preceding nested expressions. Experience showed that a pattern was still useful if it rarely occurred in practice in preceding nested expressions.

Of course unless your language treats the whole input as one expression then there are parts of the syntax which can't occur in expressions (eg traditional statements). These are much better sync points, but the question was how to handle errors within the mixfix expressions. Still maybe the lesson is define your language so that in practice the mixfix expressions don't get too big and so they can be abandoned and then sync on the next thing thats not an expression.

Have you tried to repair the token-stream?

Not much, due to lack of time only a quick consideration was given to single token edits. But due to the ambiguity of the grammar it was not clear how to decide on a substitution or insert. Any guidance on recent literature welcome.

Instead as I said we abandoned expressions on the first error, skipped tokens until the first one that could not occur in an expression and continued from there.

Not much, due to lack of

Not much, due to lack of time only a quick consideration was given to single token edits. But due to the ambiguity of the grammar it was not clear how to decide on a substitution or insert.

That's not decidable upfront. As in naaskings approach the parser still has to handle multiple parses in the general case. It only removes the need for a "sync-point" to deal with consequent errors. However new errors might be introduced. For example inserting a bracket '[' might locally fix your expression but leads to a global imbalance. This doesn't reduce the value of the approach though because one still counts fixes and the parse with the least ones becomes the winner of the competition.

(Other experience report)

On this:

So basically, instead of backtracking on error and trying another rule, assume a syntax error and continue parsing with the current rule, placing an explicit syntax error node in the tree. [...] For a successful parse, there will be a tree without errors. If no successful parse exists, the tree with the smallest number of errors is the program requiring the fewest number of changes to become syntactically correct.

After several attempts with other approaches (including the obvious stop-on-first error), I've implemented exactly what naasking just described. And it works pretty fine, as far as friendly (say, not "totally dumb") multi-error reporting is concerned, I can confirm, at least with a PEG-based fluent interface like in this.

However, I confirm, too, it's not especially cheap, clearly, but I haven't tried the rest of naasking's suggestion (the incremental parsing part), though.

My .02

Just wing it man

Hi Jonathan, good to see you back at it again.

I used to swear by JavaCC, which would fail on the first error (with an exception no less), but when writing the new parser for my Virgil in Virgil, obviously I had no parser generator to rely on.

I wrote a scannerless recursive descent parser without much thought to error recovery. I did design it to "just keep going" after seeing an error and by lucky accident perhaps, it seems to do just fine at recovering from parse errors within expressions, at least the kind that are common, like missing a closing parenthesis or semicolon or misspelling a keyword. I had to take care that it wouldn't get stuck in infinite loops, though. It helped to write a few utilities that implement the *, ?, and the + operators for productions--having first class functions in the language the parser is implemented in can make this really easy.

For the problem of a cascade of parse errors at a particular position, my parser ignores all but the first error generated at a particular position in the input stream. Typically that error is going to be generated by the deepest production and is going to be the most specific, such as "expected integer literal" instead of "expression expected", and be the most useful.

I don't have error productions or error syntax tree nodes. In fact, if not too many parse errors occur, my compiler will go ahead with semantic analysis with the rest of the code and still produce useful error messages.

So my advice would be to just not worry about it. If your grammar is too unwieldy to write a recursive descent parser by hand, consider simplifying it. In the end the parser I wrote by hand is 1/5 the number of lines of code for an equivalent JavaCC-generated parser (1000 lines of Virgil vs 5000 lines of Java). It was kind of a pain to debug, but when I built up a large regression suite I developed a lot of confidence in it and I find it no longer tricky to modify or refactor.

Also consider what you expect the length of the edit part of the edit-compile-run-debug cycle is going to be. If it is short, then users won't make large changes between runs of the compiler and it won't be that important to be able to give them 100 useful errors when they usually only have a couple. In fact, it might completely overwhelm them. Most of the time users won't fix more than a handful before running the compiler again, since some errors will depend on others being fixed. If they have to hunt through 100 to find the 5 most important ones to fix first, then you actually wasted their time, not saved it. Make your compiler super fast and the ECRD cycle really tight, and suddenly the usage model completely changes.

And BTW if you plan on using make for your build system, I'll cry. Just do whole program compilation. It's not such a crazy idea these days.

And BTW if you plan on using make for your build system,

I'll cry.

Hey! My Pet language uses make a lot. *laughing out loud and returning to his own corner*