Question: Graceful error recovery during parsing

I was hoping for a few pointers to good information on parsers that can recover from errors and continue to parse the remainder of their input, skipping portions of the text that are not valid in a clean way. Thanks!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

JavaCC uses Java try/catch in BNF rules

You can have a look at the brief explanation of this feature here.

Here are some tips(from personal experience)

If you mean that what you want is to ignore the error input and proceed with the rest, hopefully correct, input, here are some tips from personal experience:

  • continue parsing from the token after then one that caused the error. For example, if token N caused the parser to fail, continue parsing from token N+1. This can be slow, especially if the parsing rules are big.
  • use some tips from the grammar; for example, curly-bracket language compilers use curly brackets and keywords to find out the propable end of error region.

If you mean that what you want is to undo the actions done during parsing due to an error found in the later part of the source code, then you have to rethink the way parsing is done: you don't have to do anything during parsing, except noting down what should take place after parsing finishes.

For example, when parsing XML, you don't have to create a Node object for each successfully parsed node, because an error many levels deep may dim the node useless. You only have to store in an stack the event of node creation, along with the start and end token. Then, after parsing ends, all parsing events are executed in the order stored in the stack. You don't even have to use a real stack for this: an array, along with an index of the top element is more than enough. When parsing fails, the index is restored from a local variable, thus making 'undo' an O(1) operation.

the concept of error is relative to objectives

This page now also links to an alternative grokout version of the groklm ruleset. In the "grokout" rules, any material that is not matched as "expr" ";" gets copied symbol by symbol to the output stream. This is the material that one would usually flag as erroneous.

In the language machine, there is no concept that anything is erroneous - it's entirely a matter of providing alternative treatments for the main unit of analysis. Only the writer of the ruleset knows what that should be. If a ruleset provides no 'catchall' alternative at the outermost context, a failure to match and consume the input causes the engine to fall backwards out of the universe - so to speak.

The idea of a "catch all" rul

The idea of a "catch all" rule, as Peri suggested, crossed my mind... maybe I need to take a look at the language machine. My experience thus far is limited to LR, LALR, and Packrat parser generators. Packrat is definitely my favorite of the three from an ease of rule development standpoint. In that system, my concern is that a catch-all rule would be defined like...

known-language-constructs -> declaration / block / ... / unknown-construct

unknown-construct -> any-character+

How do I limit the extent of the catch-all to only cover the text that can't be understood? Once the misunderstood region starts, where does it end? Is this something that can be implemented within the expression grammar itself, or will I need to extend the actual parsing algorithm? Or maybe this is something that the language machine would handle better and I should look at it more closely.

Just so everyone knows, I'm thinking about this parser in the context of a syntax-aware text editor, where the format of the language being edited could be described by attaching properties to text based on a grammar description written by the user... so assuming I could perform the parse on every keystroke (I have no idea how possible this is), I would need to handle intermediate stages where code constructs aren't completely typed out, so that if a programmer is entering code in the midst of existing valid code, the syntax-aware features on the code they have already written aren't compromised by their partially composed addition.

I guess I could ignore parse errors on text that has already been interpreted in one way, but I was hoping for something more robust. Thanks for all your suggestions. I'm going to take a look at various links over the next few days but I thought I'd clarify the issue.

extent of the catch-all rule

The catch-all rule is just a rule like any other - as I said, the language machine itself has no concept of a syntax error - it just applies the rules that you give it in the same way that a java virtual machine or a physical machine just executes your programs. So it's entirely up to you to decide what the catch-all rule should do.

In fact the grok rulesets that I've put up demonstrate this.

The catch-all rule in the filtering translator (which doesn't flag an error) just consumes one symbol and copies it to the output. So the main analysis then tries to match an expression at the character position one on from where it failed to match. That's the behaviour you want if you are really looking for grammatical patterns that are embedded in ordinary text with no clear delimiter that says where they are.

The catch-all rule in the other two rulesets (groklm and grokreg) skips the rest of the line. This method is pretty crude, and can lead to several successive lines being flagged, but its easy to do. One trouble with sophisticated error recovery is that detailed error messages which identify the wrong error are more confusing than a simple 'eh?' which tells you where to start looking. You on the whole know what you were trying to say. It doesn't, and it's often not easy for it to guess.

I've never tried to write a syntax-aware text editor. The language machine does does operate symbol by symbol, and is intended to be capable of being embedded in other (license compatible) applications, so it might do the trick.

not

I think you could use the not (!) operator to limit the extent. Something like:

String = "\"" (!"\"" any)* "\""

or

EOF = !any

In this case not doesn't consume anything from the stream and any consumes exactly 1. Also, not would return true at the end of the stream. So with your above example you could do:

known-language-constructs -> ...
unknown-construct -> (!known-language-constructs any)+

There are two ways to think a

There are two ways to think about a parse failure. It could be an error in the source, or it could be a failure to recognize the message. In the first case the problem is with the sender, in the second case the problem is with the parser.

I think a simple solution works well

In a manually written recursive descent parser of my language Kogut I just log a parse error on the side and return a dummy tree node. When the error is found at the bottom of the parser, where atomic expressions are recognized, I skip one token.

I don't do anything fancy with semicolons or brackets, except that when a closing bracket is missing, the error message includes the location of the corresponding opening bracket.

Later phases silently "compile" dummy nodes into dummy nodes of the next internal representation. After performing all phases which might detect user errors, I show to the user the error log sorted by source location.

I'm not convinced that inventing something fancier would significantly improve the quality of error messages.

Skipping tokens

I'm not convinced that inventing something fancier would significantly improve the quality of error messages.

The purpose of skipping tokens up to the next likely "synchronization point" (e.g. a semicolon or an end bracket) is generally not to improve the quality of error messages. Rather, it gives the compiler a better chance to report multiple errors at once. So, ideally, instead of compile-fix-compile-fix-compile-fix the programmer would compile-fix-fix-fix-compile. The disadvantage is that these attempts at error recovery can lead to spurious errors getting reported.

Reporting multiple errors

My compiler does report multiple errors at once, and I think the effect is good enough. Perhaps this depends on the language being parsed.

I tried now to introduce parse errors in a big source file and look at the error messages. I expected mismatched brackets to cause the worst damage. Well, adding or removing opening or closing braces of toplevel definitions merely yields messages at appropriate points. Only when there are too few opening braces or too many closing braces inside a long function, the parser ends the function prematurely, treats the rest of its body as toplevel definitions, and the compiler complains about many unknown names (of local variables defined above the premature end).

Adding an extra closing bracket of any kind fools the parser, which terminates too many pending subexpressions before eating the bracket, even if the given kind of bracket was not expected there at all. Perhaps this could be improved somehow.

Other changes yield well localized errors and don't seem to require any sophisticated synchronization to avoid many false messages after them. Except for mismatched brackets the messages usually don't explain the nature of the error though, only the location is correct.

error correcting parser combinators

This might be of interest.

What about looking ahead?

I've written a recursive-descent parser which stops and reports a failure to match a rule...but I would like to report as many errors as possible, in order to minimize the compile-fix cycle (usual problem, from what I've seen). I have tried the usual solution (skip to a specific point ahead), but as someone said above, it makes the parser report many spurious errors. For example, a simple missing comma in a parameter list may lead to erroneous parsing of the rest of the tokens. For example:

BNF:

procedure = "procedure" identifier "(" parameter-list ")" "{" statements "}"
parameter-list = parameter-list "," parameter | parameter               
parameter = identifier ":" typename
typename = identifier | "int" | "real" | "string" | "char"

code:

procedure foo (a ; int, b : int) {}

The italics part contains the error.

As I looked into the above piece of code, I realized that error handling would be much better if the parser could look ahead and see that the token after the error is a typename, as the grammar requires. Since there are no altenative branches for the rule parameter, the parser can report that "found ; instead of :".

If the parameter rule had two branches, the parser would have to check the two branches for a complete match before selecting one of them as a possibility. If, for example, the grammar was

parameter = identifier ":" typename
          | identifier "@" integer

then the compiler would have to: check both branches, then look ahead after the error, then find that the first branch is a better match, so the error report would be: "found ; instead of :", as above.

In other words, given a bunch of rules {r1, r2, ... , rN}, a parser could select the closest match in this set. If the match is 100%, then no error would be reported. Otherwise, the closest matching rule would be selected, and errors would be reported and parsing would continue based on that.

The approach could also work if there was something missing, instead of something being mistyped. For example, the input:

procedure foo (a int, b : int) {}

In the above, the ":" is missing between "a" and "int". The compiler could use the same look ahead trick and determine that ":" is missing.

Of course the above may not work/be difficult in a recursive context. I wonder if there has been any relevant work.

JIkes

The Jikes java compiler project has two papers that have some very interesting and practical ideas. I've also tracked indentation to determine a programmer's scoping intentions. This pinpoints mismatched brackets in common programming styles.