Differentiating Parsers

A fascinating article by Oleg Kiselyov on delimited continuations:

We demonstrate the conversion of a regular parser to an incremental one in byte-code OCaml. The converted, incremental parser lets us parse from a stream that is only partially known. The parser may report what it can, asking for more input. When more input is supplied, the parsing resumes. The converted parser is not only incremental but also undoable and restartable. If, after ingesting a chunk of input the parser reports a problem, we can `go back' and supply a different piece of input.

The conversion procedure is automatic and largely independent of the parser implementation. The parser should either be written without visible side effects, or else we should have access to its global mutable state. The parser otherwise may be written with no incremental parsing in mind, available to us only in a compiled form. The conversion procedure relies on the inversion of control, accomplished with the help of delimited continuations provided by the delimcc library.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Incremental but not continuous?

I agree this is really cool. But it seems like by "incremental" they mean we can interact with the parser while its running, but it doesn't support incremental re-parsing, or going back and updating existing parse based a change to the text buffer. The latter is necessary for a parser that runs in an IDE. At anyrate, this is a lot to chew on.

Re-parsing...

The heading "Checkpointing and re-parsing with different input" has what you're looking for, in the case of incremental reparsing from the change-point to the end.

Speculation since I've not written the code: To support incremental parsing in the sense that only the changed portions of the AST need to be reparsed, you could memoize (in a sense) the parser over parse positions (once it had seen the character at positions X-Y, return the result from a previous parse over those positions without looking at the text itself) and use a clever scheme for position ranges such that they are stable across new inserts or deletes except in changed areas themselves.

Of course, you would also want to use an error-recovering parser, so that the entire parse didn't become invalid because of temporary-while-typing syntax errors; this seems like it could be implemented independently of the above though. So to use this for what you want, you need (a) to write an error-recovering parser; (b) to invent the right scheme for stable-across-edits position ranges; (c) memoize the parser over (b); and (d) differentiate it as described.

... I am behind my deadline on something completely unrelated, alas.

(( Would all the complexity above be worth avoiding re-parsing all the way to the end? It doesn't seem to be a clear win. ))

It's a huge win.

Re-parsing all the way to the end is insane, unless you are running atop a 500 node modern supercomputer. Incremental re-parsing is a huge win.

A separate problem is good error messages.

Unifying these two issues is a black art that mostly uses heuristics. I am (very slowly) working on a blog article that digs into various open source IDEs and shows how they handle these issues, and also how the language (parsing model) plays a role.

I am (very slowly) working

I am (very slowly) working on a blog article that digs into various open source IDEs and shows how they handle these issues, and also how the language (parsing model) plays a role.

I look forward to reading it. :-)

Nice

This is indeed a black art and it's something I've long been curious about. I hope you'll post something here when you've written something.

Isn't that overstating things a bit?

Re-parsing all the way to the end is insane, unless you are running atop a 500 node modern supercomputer. Incremental re-parsing is a huge win.

Without doing measurements, my intuition tells me that, for most of the work I've been doing lately, full recompilation (not just parsing) takes well under 0.25 seconds per source file.

Heck, I just measured. A C# project I have, using mono, is taking 0.005 seconds/file or 2 seconds/10,000 LOC for recompilation. How much faster does it need to be before it's useful in an IDE / live-editing context? :)

(Sadly, a scala project I have here is clocking in around 1 sec/source file. Scala's compiler is by far the slowest that I use regularly, although this particular project is probably using a dated compiler.)

Note: Unfortunately, I'm not in front of a 500-node supercomputer :)

(Sadly, a scala project I

(Sadly, a scala project I have here is clocking in around 1 sec/source file. Scala's compiler is by far the slowest that I use regularly, although this particular project is probably using a dated compiler.)

This is why the Scala IDE parser/type checker was designed to be fully incremental at the tree level. The compiler was just too slow to redo the file and provide timely feedback. Not sure where things are at now though.

re: Speed of re-parsing the entire file

Heck, I just measured. A C# project I have, using mono, is taking 0.005 seconds/file or 2 seconds/10,000 LOC for recompilation. How much faster does it need to be before it's useful in an IDE / live-editing context? :)

So your project is about 25 LOC/file? 2 seconds per 10,000 LOC is not terrible, but it's not good either. Should that be 0.05 seconds per file? 250 LOC/file seems a lot more reasonable.

Of course, that includes code generation, and not simply parsing (and possibly type-checking) that an IDE/text editor would be interested in. For a assume that takes 1 second per 10,000 LOC, then your text editor would start getting unacceptably laggy in the 1500-2500 lines per file range.

You do bring up a good point, however. Another potential benefit of an incremental approach is being able to gracefully deal with transient states with syntax errors. This may be even more important than the time taken to re-parse the entire file.

I've read about ad-hoc approaches that have been used to deal with syntax errors. Many of these do not take into account the history of the buffer. I'd like to see a principled approach to deal with transient errors that takes into account the edit history.

Modern IDEs track edit history independently

Modern IDEs track edit history independently of parsing, to provide nice colour-coding in the margin indicating added and unsaved lines. A little feedback from the parser indicating the start and end lines of routines, where the parser has decent error recovery, and you can simply hide syntax errors which occur in lines which were known to be good prior to an edit.

Of course, semantic errors caused by renaming referenced symbols, or modifying types, would still potentially be valid. But it seems to me that most of the benefits of a complex incremental parser could be achieved easier through a simple fast parser with decent error recovery and appropriate performance elisions for IDE use, combined with IDE editor buffer tracking, and doing the filtering outside the parser.

Me too

I've read about ad-hoc approaches that have been used to deal with syntax errors. Many of these do not take into account the history of the buffer. I'd like to see a principled approach to deal with transient errors that takes into account the edit history.

This is what I'm most interested in. Is there even one standard algorithm for this type of parser? I have in mind something that could be implemented by a yacc-alike, for instance, to take a standard grammar (LL, LR(1), whatever) and produce a parser with some of the characteristics we've been discussing. Is this even an area of active work? I'm aware of work on error-recovering parsers, but it's pretty dated and really doesn't deal with incremental edits at all.

It's bizarre that such an important and well-known question seems to suffer from a complete lack of general tools.

Sloppy math/measurements, more likely.

I wasn't trying to definitively demonstrate that reparsing is always a win; just pointing out that the obvious '500-node supercomputer' hyperbole was, well, obvious hyperbole.

I think it matters a lot what language you're working with whether incremental reparsing would be a 'huge win' or not, and I think the biggest consideration has little to do with parsing itself but the complexity of further passes.

Yes!

Well said. That's why I asked Sean how many passes scalac makes. I don't even know how scalac handles its plugins feature.

Scala plugins

Scala compiler plugins mostly add a new phase somewhere after type checking. The phase can be inserted anywhere in the sequence.

Additionally, a plugin can add to the type checker. Once the type checker finishes type checking an expression, it passes it to all plugins to let them update the type if they want. This is typically use for annotated type systems; the plugin can add annotations where none are present in the original source code. For example, a string literal would be typed as String but could be retyped by the plugin as String @NonNull.

Error reporting from plugins goes through the compiler's usual error-reporting machinery, so IDEs should be able to report their errors using their usual red squigglies.

I am (very slowly) working

I am (very slowly) working on a blog article that digs into various open source IDEs and shows how they handle these issues, and also how the language (parsing model) plays a role.

Let me know if I could contribute something. The Scala IDE had a really novel way of handling incremental parsing and type checking that involves lots of memoization, and hacks to make the IDE incremental case look like the batch compiler case to existing compiler code (with mixed results of course).

Novelty

I am not sure how novel your solution is, but I would like to see you blog about it on Robotic Pandas. ;-)

I'll probably also eventually read the source code for the Scala plug-in. Has it changed much since you stopped maintaining it and somebody else took over?

Reparsing is fine

Reparsing is fine. For an IDE, you don't actually need to parse everything in the first place - you can intelligently skip lots and lots of work; but even if you did need everything, a reasonably fast compiler can cope handily enough.

To take an example, I've recently timed the Delphi compiler compiling its 417K RTL in less than 11 seconds on my current (admittedly fast) machine. That's nearly 38 kloc/sec, and includes syntax analysis and type checking for every routine, optimization, code generation, and multiple (static and dynamic) library links - in other words, most of the work is not needed by an IDE.

The heading "Checkpointing

The heading "Checkpointing and re-parsing with different input" has what you're looking for, in the case of incremental reparsing from the change-point to the end.

I don't think I could qualify change-point to the end as being very incremental. What if you edit the beginning of a 10,000 line file?

Speculation since I've not written the code: To support incremental parsing in the sense that only the changed portions of the AST need to be reparsed, you could memoize (in a sense) the parser over parse positions (once it had seen the character at positions X-Y, return the result from a previous parse over those positions without looking at the text itself) and use a clever scheme for position ranges such that they are stable across new inserts or deletes except in changed areas themselves.

Yes, this is pretty much what you need to do and is what I did in the Scala IDE. You also have to be clever about braces/comments if you have those in your language (which Scala does).

(( Would all the complexity above be worth avoiding re-parsing all the way to the end? It doesn't seem to be a clear win. ))

Yes it is a clear win, but only when you add type checking information into the mix. The point of remembering trees is not to save parse effort (parsing is relatively fast), but to save type checking effort that is organized around trees (type checking is much slower than parsing).

Still, I see this work as being very interesting, hopefully it follows realistic use cases!

As I've said elsewhere in

As I've said elsewhere in this thread, you can skip type checking almost everything if the most important thing to you is stuff like symbolic completion. (You can also skip parsing method bodies.)

For deeper analysis, you can parse and type-check in the background, to provide squigglies or other preferred means of highlighting errors in source, or populating file navigation treeviews, etc. This work can be done after a delay from the last edit, and interrupted by new edits. But this work is not necessary for the most common analysis that is blocking and therefore time critical, i.e. symbolic completion.

As I've said elsewhere in

As I've said elsewhere in this thread, you can skip type checking almost everything if the most important thing to you is stuff like symbolic completion. (You can also skip parsing method bodies.)

This is what I've been told Xcode does. It separates the parser into multiple streams, such that one input can be parsed multiple ways. Detection of things like a method body is 'skipped' in the sense it is treated like comments. I'm told they do similar ad-hoc rules for error message generation.

Also,

For deeper analysis, you can parse and type-check in the background, to provide squigglies or other preferred means of highlighting errors in source, or populating file navigation treeviews, etc.

I'm not sure what you mean by "type-check" here. Type checking doesn't necessarily give good error messages, especially error messages that can be used to highlight errors in source. A pre-expression may be typed as a "wrong" expression far from the source of the problem.

Since you work for Embarcadero, I should note my favorite IDE of all time, JBuilder, was phenomenal at interactive error reporting.

By type-check, I mean

By type-check, I mean analyse expressions other than those in the "current" method (including parent routines for nesting) for type correctness, to make sure the symbols referred to are valid, the conversions are permitted, the operators are applicable, etc.

For example, if you rename a symbol somewhere (without using refactoring support), all references to the old symbol name will become invalid, but those aren't syntactic problems. They are also not usually problems with respect to code completion - only if references occur in the current expression will it be a problem. So, you can defer such checks to the background, and draw in squigglies or other desired error notifications when (and if - aborting on input) the analysis completes.

I must root around and see if I can find the old JBuilder sources. It was my favourite Java IDE too - I never really understood the appeal of Eclipse, which always poked too abstract concepts into its UI for my liking.

JBuilder 9 was pretty

JBuilder 9 was pretty good.

It also had this ridiculously funny "Did you know?" "splash screen" with a light bulb on it. If you clicked on the light bulb, it would toggle the yellow "light" on and off. But when the light was on, the drop shadow was still there! So I filed a bug report.

But you don't need to reparse that much....

Remember, you're talking about using this in the context of an IDE. In that context, you only need to reparse what's necessary for updating the screen display -- which, in the case of editing as opposed to scrolling, is maybe the next 25-50 lines or so, at most.

Given that most of that is likely to be within the same scope, you're probably not going to end up with all that much further subdivision from stable-across-edits ranges.

IDE doesn't need incremental parsing

An IDE doesn't need incremental parsing. I can speak directly for the Delphi compiler in the RAD Studio IDE as an existence proof.

There's various tricks you can use to speed up the job, and re-parse as necessary between keystrokes. For example, you can skip method bodies using a simplified token nesting grammar, without needing to do expression parsing, type checking, etc. All you really need for the IDE is a good idea of the types defined in the current file, and a well-typed parse up to the cursor point (as the most valuable point of IDE parsing is to provide some kind of symbolic completion) within the current method.

The Delphi compiler works along these lines. For symbolic completion, it uses a special token representing the current cursor location, which when found by the recursive descent parser, causes a longjmp out with all the relevant data about syntax position, expected symbol kinds, and symbol context as appropriate.

I believe your experience is

I believe your experience is extremely specific to Delphi and wouldn't apply to all languages. The techniques you describe are not applicable in many languages, or I would even go as far to say most languages.

C# IDE support gets away with not being very incremental because they have a really really fast compiler, measured in millions of lines per-second. Scala on other hand, is not as fast and its questionable whether Scala compilation could ever be as fast as C# given the complexities of the language and its type system. The Scala compiler does a lot of work, which I see as a good thing, but this means we have to be smarter about building a fast compiler; e.g., being incremental vs. just doing the same work over and over again. Background work is definitely needed, but if it takes too long the feedback provided by the IDE is useless as it comes too late.

The other advantage of an incremental IDE compiler is error recovery: a transient error doesn't require throwing away a lot of good parse and type information that will also be valid later when the error goes away, and allows us to isolate transient errors very effectively.

Anyways, check out the PLIDE workshop proceedings for more details.

Benchmarking is hard to predict

<Rambling>
Awhile back me and my friends were teasing our mutual friend ("Mark") for having "Emacs disease" where "you think and act, in everyday life, in terms of meta characters". And "This wouldn't be a problem if you just used vim." ;-) The joke got to the point of saying emacs is bad at handling large amounts of file memory, and Mark had similar problems... so he got upset at this. We just said, "Look, try loading the spell dictionary file into emacs, we'll load it into vim." Right before he took the challenge he asked, "Uncompressed?" He had a feeling he was going to lose, but as I recall it was surprising how badly vim beat emacs.
</Rambling>

P.S. How many passes does the Scala compiler make?

I believe your experience is extremely specific to Delphi and wouldn't apply to all languages.

I believe Delphi uses one-pass.

I believe Delphi also does not perform a topological sort of program modules, as in Haskell. As Delphi is based on Pascal, which uses one-pass model with forward declarations (requiring the programmer to in effect do a topological sort -- a task that could be delegated to the compiler by changing the language model, with a trade-off in memory-space & time-space performance).

If by pass you mean phase

I think the Scala compiler has around 10 or so phases, but Martin is the right person to ask about Scala compiler performance. The Scala compiler does some amazing stuff, so its not going to ever be as fast as say a C# compiler. The path to performance is then to compile smarter (by re-compiling less often) rather than just brute force recompiling everything all the time (e.g., as the C# compiler seems to do). Each language requires a different IDE solution based on its needs, sometimes brute force or aggressive heuristics are applicable, sometimes not.

How many times do you look back over your shoulder before

leaving the house in the morning....

That is how many passes a compiler has.

If you get to your Smart Car and are ready to hound down the ladies in Beijing with that sexy ride, but then find yourself saying, "Where did I put my keys?" And you go back into the house and look for them. Then that's two passes, just to go for a joy ride in your muscle car.

Most fast compilers require programmers to lexically define references before using them. This allows the compiler to zip through the file (and all files in a specified compilation order), end-to-end. For batch compilers, a challenge is to build a one-pass compiler that provides features such as type inference as good as multi-pass compiler would. They ultimately both can produce the same results, and "one-pass" is never really truly one-pass, except that it "never looks back". Most IDEs also assume, when doing performance benchmarks, that you are using a one-pass compiler, because one-pass compilers tend to be better at the sort of heuristics discussed before in this thread by Barry Kelly. The reason is that the state-space is smaller, so the heuristics will work better by definition of having less likely false positives to report.

If you have a language that

If you have a language that (a) has at least something resembling a context-free grammar, and (b) separates out type definitions from expressions (which is usually a pre-requisite for static typing and hence the most useful form of code completion), then I don't see why the heuristics I mentioned should lose validity owing to multiple passes.

For example, a C# compiler would not usually be implemented using a single pass - you'd want to do a pass to get the type structure before parsing expressions, as that would permit you to type the expressions as you parse them, and make parsing decisions based on types, as is needed, for example, by a recursive descent parser in certain cases in expressions involving generics (the left angle bracket ambiguity issue). But this doesn't stop it using the heuristics I mentioned, in fact it makes it even easier - getting the type structure from an initial pass means you've already written the code to skip method bodies. The type structure is what you really want out of the parse anyway, as it's what you need for code completion.

I take taxi's

I don't believe the notion of "pass" is very relevant in the Scala compiler. It has many phases, but each phase may visit a tree one or more times, while it has a notion of lazy type completion that means you might visit trees as a side effect of processing. The important thing, I guess, is where the overhead is. Some phases have high overhead (I remember refchecks taking lots of time) while some are very cheap (at the bottom is lexing and parsing).

just parse + type-check/name-resolve

The Scala compiler has several passes, but to simply parse and type check, you just need to parse and type check. The two phases are cleanly distinct.

There are more phases after the basic type checking that can also generate errors. In particular, regular type checking is followed by "refchecks", which performs a laundry list of checks. Additionally, later phases can and do add new errors of their own. The bulk of possible errors, though, are found during the true type check and then during the follow-up refchecks pass.

The type-checker part is complicated. It's separated into two parts, type checking and name resolution, and those two parts are combined and run together at the same time. Here I get a little hazy, but roughly, the combined pass descends through the tree doing two things: allocating naming environments and attaching lazily evaluated type-checker routines to each name. Then the compiler goes through and forces all the type checks to be evaluated. Type checks often require a name lookup (typechecking "a" requires finding a's symbol), and name lookups often require a type check (looking up "b" in "a.b" requires knowing the type of "a").

One result is that the files don't have to be checked one at a time. The compiler can type check a little bit of one file, use that to type check part of a different file, and then go back and check some more of the first file.

I know my experience is

I believe your experience is extremely specific to Delphi and wouldn't apply to all languages. The techniques you describe are not applicable in many languages, or I would even go as far to say most languages.

I know my experience is applicable to C, Java, C# and Delphi, and probably most Algol-influenced languages; in other words, most statically-typed languages in commercial use with IDEs.

Languages that have a (reasonably) context-free grammar, (mostly) static types and type definitions that are parseable separately from expressions within the bodies of routines seem to be intuitively amenable. Top-level type inference would probably be the most immediately obvious showstopper. That still leaves a lot of languages, though.

For example, you can skip

For example, you can skip method bodies using a simplified token nesting grammar, without needing to do expression parsing, type checking, etc.

This makes sense to me. Somehow one has to keep the effects of text buffer changes local or at least bound them. But this might require special case heuristics for certain characters after all e.g. adding/removing a " symbol affects the complete token stream just like adding/removing a '/' symbol in languages using the character for multi-line comments. So I still expect some "black art" and language specific optimizations and special case treatments.

BTW I'm disappointed about what I found when I searched for "incremental parsing" on Google scholar. I can't await Z-Bo's findings.

Memoization of parse trees

Memoization of parse trees does exactly this. You only have to reparse methods/blocks that could have changed. Note that in Scala you can have pretty large methods, with classes and methods nested within a method, so just doing "method bodies" isn't good enough.

I handled the C-style comment/brace case by disabling re-parse/re-type when braces or comment delimiters weren't balanced (Scala also had XML literals to consider). This mean before we even bother running the lexer, we do a quick pass to match braces (can be done incrementally, but easy enough just to redo the file). Then auto-complete ensures that braces/comments are already balanced (you can disable auto-complete, just don't expect the IDE to give you feedback while braces are unbalanced). Helps out a lot with error recovery in the parser since we can always assume that braces are balanced.

Again, black art, very few people have experience in this area, and each person develops a unique solution to each problem.

Usually, masters theses in "Software Engineering"

are the best places to look for information on IDE-related issues.

For example, Eric Allen, co-lead of Fortress project with Guy Steele, his masters project was related to DrJava and its associated IDE. He teamed with Brian Stoler. The papers, including Stoler's masters thesis, are available on the DrJava Papers page.

I don't recommend reading the DrJava source code; it was an unholy mess the last time I saw it (ca 2006) and what you would expect when two master's students want to get back to the real world ASAP. (Anyone who has read most graduate student's code knows it tends to be unholy mess, and doesn't necessarily reflect the kind of code they would produce in a real world setting.)

There is also Kirill Osenkov's master's thesis on building a structured editor-based source code editor for C#. Kirill rewrote a fair chunk of SharpDevelop so that it was not tightly coupled to line numbers and column numbers (which make no sense in an editor that presents the AST directly to the user).

By the way, none of this stuff is terribly hard to understand, which is why nobody writes about it. Once you think about the problems for awhile, the solutions become obvious.

I want to find an approach

I want to find an approach which works uniformly for all context free languages and doesn't require any manual configuration. The only information which goes in is the syntax description in the form of a CFG grammar. With this in mind incremental parsing becomes a function of a parser generator rather than one of an IDE which can consume it as a service.

I don't think this problem is trivial other than that everything is obvious once you know the solution.

holy grails

Its easy enough to automatically memoize parse trees and then have a damage/repair style framework for updating damaged parse trees. What isn't easy...and this is true even in the non-incremental case, is error recovery, where, quite frankly, syntax descriptions are just not expressive enough to be used in doing decent error recovery.

CFG's are not the definitive description of a language's syntax: yes they describe how it can be parsed in the perfect case, but nuances in precedences and matching relationships that are useful in incremental parsing and/or error recovery, are just completely obscured in the CFG. If you want an automatic solution, you'd have to get away from CFGs, which are effective for non-exceptional parsing, but nothing else.

Beyond CFGs

I'm fine with that, too, if that's what it takes. Basically, it seems that there are a whole lot of things that people today expect from parsers that the grand tradition of grammars/parsers just don't provide. Obviously it's possible to provide at least some of this functionality, since people do it today in a variety of hand-crafted ways. It's just that we seem to lack a general approach or theory (or better, a set of theories like those classifying the context-free languages into classes). And of course we'd like to know which syntactic constructs will be easier or harder for this type of parsing. Regarding that question, we don't even seem to have anything beyond rules of thumb, and certainly no real classification.

One thing that seems obvious right away, for instance, is that the whole lexer/parser hierarchy is immediately bankrupt in this setting, since, for instance, unbalanced "s need to be handled intelligently.

I'm really curious what kind of formalism would be appropriate for this setting, and I continue to be surprised that there seems to be so little research in this direction, given the importance of interactive compilation in IDEs.

One other topic not mentioned

Code formatting

One thorny pet peeve of mine: Most XML editors are extreme pain in the passes when it comes to how they handle opened and closed XML elements. I think the latest version of Eclipse does the best job handling this.

Edit: Other things, such as code assistance in a Watch window, would be nice, too.

Basically, it seems that

Basically, it seems that there are a whole lot of things that people today expect from parsers that the grand tradition of grammars/parsers just don't provide.

I do believe programmers and authors of parser generators broke with this tradition long time ago when they began to transform parse trees into ASTs for further processing. I'm a resolute reactionary about this and would like to restore the tradition with a few elegant techniques which have been neglected ( at least in practice - can't say much about research/scholarship which just might not have bubbled to the surface of public consciousness ).

I believe in the same way one can also deal with precedences, locality and so on. It is all information which can be derived from the grammar alone and made available through an API.

I'm surprised that this isn't common practice after studying syntax almost to death in the last 50+ years. But maybe lots of this knowledge is just buried in tools, commercial or open source. So I welcome everyone who does mining and archeology.

I believe in the same way

I believe in the same way one can also deal with precedences, locality and so on. It is all information which can be derived from the grammar alone and made available through an API.

Through an API is potentially much less useful than through a capability-based user interface.

I've already thought about this issue, but my first conceptual model was dumb... or at least a temporary dead-end. I mentioned it on LtU before. It involved rubber-stamp pattern matching using statistical models. The major problem is that most useful analysis involves synchronization tokens, and almost all synchronizing pairs indicates some sort of composition/aggregation of sentences.

I've also already thought about an algorithm for automatically deriving important Watch window watches. Technically speaking, Martin Burger beat me to applying Zeller's principled approach of delta debugging to source text edits, with DDchange - which tells the user about failure-inducing changes... but this can and should be integrated with concolic testing.

Edit: You may also be interested in Program analysis as model checking of abstract interpretations. This is the original idea behind very responsive, IDE-based program analyses in Goanna (named after the austrailian monitor lizard).

What isn't easy...and this

What isn't easy...and this is true even in the non-incremental case, is error recovery, where, quite frankly, syntax descriptions are just not expressive enough to be used in doing decent error recovery.

You should provide a walk-through example when you explain error recovery to somebody.

Error recovery literally means tricking the parser into thinking the input stream it just saw actually contained something other than what it just saw. You can do this by inserting tokens into the input stream, deleting tokens from the input stream, or substituting tokens in the input stream. You can do this recovery locally (at the point of error) or globally (at some point before the error, but not after, since the future of the stream is unknown to the parser at this point and is undefined at the point of parser execution resulting in an error). Global recovery is in general limited to K steps where K is the size of the look-ahead for the parser. All global recovery algorithms in this family of error recovery algorithms are basically variant of the Burke-Fisher error repair algorithm. Local recovery effectively always uses token pairs to try to pair off errors.

Finally, error recovery can be further 'localized' to a specific text buffer by organizing the sourcetext into Ropes, using a "Reduced Model". Then a parser that supports multiple entry points could be used to divide-and-conquer the sourcetext. In this approach, you can perform a hybrid of local and global techniques. Common constructs such as OOP class definitions tend to be syntactically encapsulated by synchronized constant tokens, such as a LPAREN and RPAREN. The big idea is to use the LPAREN and RPAREn as synchronization tokens to define a global recovery context by defining the parser's entry point on the class body.

Error Recovery vs. Localization

where, quite frankly, syntax descriptions are just not expressive enough to be used in doing decent error recovery

It sounds like what you'd want is a syntax description that can be annotated with likely omissions and insertions such that it can not only recover from error, but also intelligently identify and describe the error to the developer.

Perhaps simpler, syntax itself can be designed to ''localize'' errors, which is useful when working with live code. I'm somewhat fond of simply asserting that having two sequential blank lines in a file creates a section separator (a 'full stop' is the term I favor), where one must be in the top-level syntax as of the next line. I have also used a simple design (for a file full of named scripts written in different languages which are then compiled by pluggable abstract factories) where top-level elements are simply those not starting in a line with a space character. (A multi-line script would need to slightly indent each subsequent line after the first.)

This localization of errors is more of a KISS solution. It doesn't aim to correct or recover from anything, but still allows the sort of rapid prototyping and live programming features programmers want, and has a very low processing cost. Defining a pair of blanks as a section separator allows one to quickly create a workspace and start writing up a new function without risk of changing how the elements defined before and after in the same file are interpreted (except in the rare case where you are tweaking a syntax extension). The rapid prototyping advantage is simply that dead, half-written code will essentially be eliminated by the parser, thus won't cause problems for static analysis or interpretation of other code.

While a quest for a holy grail sounds thrilling, on many nights many knights would be happy with a KISS...

As for KISS, the basic

As for KISS, the basic Einstein comment of keeping "it as simple as possible, but not any simpler" should apply. There is a tradeoff between programmer convenience and tool writer convenience: sometimes we must make things harder/more awkward to the programmer, but in general we want to move the line as close to programmer convenience as possible.

Anyways, I believe you can mostly have your cake and eat it to, at least for specific C-like languages. It just requires a lot of clever engineering, and you have to think about the issues at the beginning for the language design process.

Something I've wondered about...

...is whether the trick to easier incremental parsing doesn't lie in the parser, but rather in the lexer.

That is, if your lexer gave you a well-bracketed sequence of tokens (ie., begin/end, {/}, <foo>/</foo> etc. pairs were all guaranteed to match up), then you could do things like bound the range of ill-parsed data much more reliably, and do things like edit in the middle of a buffer without having to throw away the parse of what follows.

I know there's been a fair amount of work on "nested word automata", which are weaker than PDAs, but which can still recognize explicit block structure. They arose in verification, to model-check programs with function calls and returns, but I don't think they have been applied to parsing.

I agree although many lexers

I agree although many lexers are insensitive to keywords and do not distinct between "for" and "fo" but assign the same token type to each. So while the token type sequence is preserved there is an impact on the parser. This is an implementation issue only though and one might add a field in the token representation which provides keyword information and makes them distinctive ( or alternatively introduces plenty of new token types ). So all changes which affect the parser either locally or non-locally shell be reflected in changes of the token stream.

I'm not sure what you are

I'm not sure what you are asking for, or perhaps rather why you are asking it. You seem to be suggesting you want to do away with heuristics, and I would say that is a huge usability mistake. Heuristics can provide better feedback than an algorithm automatically derived from some meaning function.

Maybe the closest thing would be Oege de Moor's recent work w/ one of his Oxford Ph.D. students on refactoring and attribute grammars (i.e., JunGL) (sponsored by MS Research Cambridge)? Oege is sort of indirectly cheating here, wisely choosing his problem domain as refactoring, since I don't believe he has any intention of providing user's with good error messages. A refactoring language is just that. Valid-to-valid expression transformation. Incremental edits that put the source text in to a state where pre-expressions are invalid is different.

As a separate matter, it is possible for an IDE to have a standard set of hooks that a parser generator's generated parser could hook into. This would allow programmers to define a grammar and use convention over configuration to automatically do a lot of tasks, but it would only work as scaffolding. Sort of like ActiveRecord in Ruby on Rails, but for IDEs. I've thought about this approach, and that basic idea is why I started scavenging open source IDEs looking for the best tricks. The basic idea was to eventually include it as a plugin to VPRI's SourceIDE, as a really fast way to bootstrap supporting any language in SourceIDE.

Bottom line: You can't eliminate the role of the IDE, even if the IDE merely consumes as a service a CFG, parser generator pair. The service must expose endpoints to the IDE to provide meaningful features to users. In other words, for a client to use a service, the service must be (1) discoverable (2) interoperable!

Lazy Functional Incremental Parsing

The last Haskell workshop had a very interesting paper by J. P. Bernardy on this topic: http://www.cse.chalmers.se/~bernardy/FunctionalIncrementalParsing.pdf

[edit: comment retracted]

[edit: comment retracted]

Yi's incremental parsing

The Yi editor (written in Haskell) has an incremental parser combinator library.

http://www.cse.chalmers.se/~bernardy/FunctionalIncrementalParsing.pdf