Interactive Parsing Theory

Greetings,

I've posted here in the past and just had a few questions about the basics of parsing and perhaps wanted a bit of insight on parsing theory.

I'm currently writing a parser compiler, to better understand the formal side of parsing, and into that I hand-wrote the grammar description language's parser. One of the major things I've learned through this is how very little I really know now and knew when I started.

In the general case, when parsing the full file you begin from the start rule, and go from there, that much is pretty basic. The parts I get a bit hazy on are how to handle partial parses when faced with interactive environments, such as the user typing the text into Visual Studio and integrating your parser infrastructure into the Visual Studio Extensibility. The current method I use involves parsing the entire thing every time, with a asynchronous delay kicked off to pause each time a key is pressed, so it's not parsing everything literally every key press. The major question I have is how to merge the parse trees and isolate the impact of a change to the concrete/abstract tree that results from the user's text.

Is there research based off of interactive parsing I could look at to get an idea?

In the future I want to understand this concept well enough to automate its integration into the project I'm writing now, even if it's in the second or third iteration. I have a feeling it's complicated, perhaps more than I'm thinking now.

Insight is very welcome.

Also note to people posting links to published documents: I don't have any subscriptions to any journals/publication websites, so it might be difficult to follow if such sources are referenced since they usually request a monetary fee for retrieving their data.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Incremental programming

Incremental programming would be the subject you want to study. That, and live coding. Incremental processing at all levels, even parsing, is very useful for live coding and live programming.

On LtU, Sean McDirmid has been doing recent work in this subject.

There are a number of simple techniques for incremental processing that are useful and can easily augment more traditional systems, such as abundant use of memoization.

There is a whole theory of

There is a whole theory of incremental parsing that was explored in the 90s, perhaps ending with work on Harmonia. I've never found this work to be that useful as incremental parsing is quite easy to solve as a straightforward memoization problem.

My REM paper describes a bit on how we accomplish this, but generalizes very quickly. I'm going to discuss this at Parsing SLE at Splash in a couple of weeks, so I'll eventually make my position paper there available. However, there are lots of details to consider:

  • Start with a hand written recursive descent parser. I assume this is what you might have already, otherwise you'll need to hack your parser generator/library, which doesn't sound like much fun.
  • Memoize your lexical stream so you can track parse tree positions in a way that is stable across edits. Once you know what trees you previously computed for a lexical token, you can "reuse" that tree when you derive a tree from that token again, assuming the old tree is compatible with the new tree you need.
  • Editing the buffer "damages" the inner most tree within the edit, which then needs to be "repaired" via reparsing from a damaged node worklist (you might damage multiple nodes while repairing, so worklist processing is necessary). Edits on a boundary should damage all trees around that boundary. If a repaired child tree ends with a new token, damage its parent.
  • Isolate errors by "reverting" a tree to its previous good state and boundaries when a syntax error is detected.
  • Match braces in a separate pass BEFORE you parse. Make braces matches sticky across edits, you'll probably need to add some brace auto-completion for this to work very well, or at least completely incomplete braces with braces the user can't see if they are very annoyed by that. If the user later types the closing brace in a different location, delete the inferred closing brace and use the closing brace typed by the user. By matching braces before parsing, you can then use the matches to implement more error recovery, as well as easily solve the problem of brace matching incrementally.

I'm sure there are a lot of other details, but frankly, parsing is kind of boring; the real fun is in incremental type checking :)

Match braces in a separate

Match braces in a separate pass BEFORE you parse.

And if you're the language designer and you pick indentation for block delineation, then you don't have to worry about the interplay of braces with other lexical features (braces in strings, etc).

I've done this with the

I've done this with the version of yinyang shown in the recently posted web essay. It actually is harder to incrementally match indentation blocks than it is to match curly braces; but it works better aesthetically and no unwanted brace completion (at least for curlies, still need them for parens).

How do you figure?

It actually is harder to incrementally match indentation blocks

Are you trying to stay incremental within a line? Reparsing a whole number of lines is incremental enough for my purposes and integrates pretty trivially with block indentation.

The complexity is that while

The complexity is that while curly braces have a closing brace for each block closed, with indentation multiple blocks can be closed by a single un-indentation; e.g. think 0, 2, 4, 6...and then 2 again. It is not a huge problem, but tricky one to get right. My parser actually depends on resolving blocks first so that this information can be used to make parsing (and error recovery) more robust.

Second Version

So the concepts involved here are likely fairly complex. I think I understand it on a conceptual level but altering a system to use it is something that's required from the start.

I think once I have it building the Concrete Syntax Trees I'll focus on the concept of damaging the nodes as needed in the second version of the program.

So I understand: this process would involve the position of the nodes being based off of the tokens of the active stream. Once changes occur you'd effectively have to construct a new version of this token stream that mirrors the original with the new tokens in the stream as needed...?

The part I'm confused on is how to do state tracking, knowing what the proper state the parser would need to be in to begin a parse at that point, unless there's more context information I'd need to inject into the trees to help isolate the parse state, or would stepping through the actual tree structure be the key to isolating the state?

Since I seem to have someone with a lot more experience than I do, perhaps you could review what's below and give some insight:

Currently the way I intend on handling the parse is extracting the key data points (elements called out by name) and constructing classes/interfaces that represent them to define the CST. It'll be a LL(*) styled parser that will have methods that focus on path determination that will direct the individual parse rules, since for every state within a rule there should be one valid path only; short of two rules overlapping the same token stream, in which case you'd have code which further disambiguates on what's after that as needed. This will increase the method count, but I figure it's the lesser of two evils (look ahead to see what path, then parse using the memoised token stream versus trying every possible match and just going with the longest, the back-tracking involved likely will slow you down.)

The lexer construction is similar to the rule construction except it'll be a character by character state-machine, tokens will match simpler data types, characters, strings, enumerations, and on occasion groups of those. This version of the project doesn't have match code injection, the goal is to capture the relevant data elements so you can process the token when you actually get to it. Knowing what the actual value of a number is isn't as relevant until you need to emit it, in the process of pretty printing, it's unnecessary; with strings, converting the escapes into their literal character is useful when you're emitting its value, but counterproductive when you're just reprinting it. When your concerned with simply matching something, you can just tell it to perform a right-most merge on that part of the state machine, such as if you did it with keywords (versus without merging). Merge cases are likely few, but they're useful when you're concerned with a set of literals matching, and you just want to know what matched (in text), not necessarily which one you matched.

this process would involve

this process would involve the position of the nodes being based off of the tokens of the active stream. Once changes occur you'd effectively have to construct a new version of this token stream that mirrors the original with the new tokens in the stream as needed...?

You must start with incremental lexing, there is no "new version" of the lexing stream. If your lexer is incremental, your token identities are stable and can be used to access existing trees.

The part I'm confused on is how to do state tracking, knowing what the proper state the parser would need to be in to begin a parse at that point, unless there's more context information I'd need to inject into the trees to help isolate the parse state, or would stepping through the actual tree structure be the key to isolating the state?

Extract the state from your parser and make it an input value of each tree node. A reparse is needed if the input value is changed by the tree's parent, and if a reparse is needed anyways, the saved input value can be used in lieu of being parsed through its parent. Parsers don't require non-local state, so this is relatively straightforward memoization; things don't start getting fun until you have to deal with symbol tables during type checking.

It'll be a LL(*) styled parser that will have methods that focus on path determination that will direct the individual parse rules, since for every state within a rule there should be one valid path only; short of two rules overlapping the same token stream, in which case you'd have code which further disambiguates on what's after that as needed.

Sounds complicated. I've only used my technique on easily recursive-descent-parsed languages (Scala as well as my own SuperGlue and YinYang languages). I don't think it would really be an issue given Glitch's expressiveness (my underlying programming model that enables reactive/incremental computation).

The lexer construction is similar to the rule construction except it'll be a character by character state-machine, tokens will match simpler data types, characters, strings, enumerations, and on occasion groups of those.

Careful here; your lexer will need to be incremental but it is relatively straightforward to do that as long as your lexical language is regular. I would move composite behavior out of the lexer and into the parser, and focus on just tokenizing between whitespace (or whatever delimiter) and classifying the output. If you are doing a custom language, simplify your lexical and syntactic definitions as much as possible, don't do anything too fancy (which isn't appreciated by users anyways).

Lexer output structure

I guess one thing that might assist in this effort is to construct a version of the number state-machine by hand, working off of the current output the lexer generator gives me. This will help me mentally step through the processes involved in automating it and might shed light on a few issues I'm having with it.

I'll post back once I'm finished with that.

Hopeful insight

What you're basically wanting is this:

Assuming you have an AST and can copy parts of it up to as well as starting from a given character in the input stream (your text file).

When you insert a block of text (paste OR type): retain AST up to point of changed text. From there, parse entered text onto end of AST. Then, technically, the AST of the orginal part of the file that now follows the newly entered text will either seamlessly fit onto the end of your new AST, or it will not, but one way of dealing with it, is to see if you can fit it onto the end. If you can't, you can try to reparse it onto the end. If that fails, I'll bet you money it's no longer valid as it is.

Deleting a block of text works in similar fashion: see if the non-deleted first-part and the non-deleted end-part can seamlessly fit together to combine into a new AST (by using the grammar to validate). If not, reparse; if that fails, error to the user.

This of course requires you can pinpoint document changes by index and length of edit

Why bother?

Given how slowly people type and how fast computers are, you might as well do a full reparse at every keystroke. You shouldn't have trouble keeping up — it's the back-end stages of compiling that are slow, unless you are dealing with tens of thousands of lines at a time, or your grammar is more than commonly ambiguous.

The Lua statement-oriented REPL parses a line as you enter it. If there's a syntax error involving matching parens, brackets, or braces, it prompts for the next line, appends it to what it already has, and reparses. If there is any other syntax error, it complains; if there is none, it executes the statement.

almost on the nose

unless you are dealing with tens of thousands of lines at a time, or your grammar is more than commonly ambiguous

That's almost on the nose-- except that ambiguity in the common case is exactly the problem. Extremely small, modern languages minimize ambiguity, at least globally. But even something as "old and messy" as Java is locally ambiguous on tokens like '<': is it a less than comparison, or is it the start of a type arguments list? You can tell if there's a whole, complete program artifact around it, but when it's interactive you get to hold your breath and suspend judgment. And that might be forever if your live typist dares enter an incorrect program, or is simply in the middle of a complex reshaping that involves lots of fragments sprinkled about.

So if one is building a general interactive parser for "the usual suspects," dealing with a few thousand lines of (at the very least locally) ambiguous code is exactly the problem. And then you get to, say, C++. Not only is it globally ambiguous until name&type resolution, you get to deal with arbitrarily-structured compile conditionals.

Ambiguity handling is thus the first 90% of the work. The next 90% is what to do with the interim states. Re-parsing from the beginning is problematic because parsers are usually pretty bad about error recovery in less-structured environments, and even if they're good about it you now have a big problem of correlating anything you've done previously with your artifact, like say name/type/flow resolution. Now we're getting expensive, and redoing the work on every keypress starts looking prohibitive.

Your Lua example neatly avoids this by ruling out all the challenge shots. Textual IDE on a well-structured language doesn't have to get into all the shenanigans I listed. However, such restrictions means it doesn't satisfy the "general case". I assert no insult to the example; this is simply why so little in this space actually tries to attack such a wide parsing environment. Be aware of the problems of a universal solution and openly accept what you're not solving.

REPLs are boring

Incremental parsing is not about improving performance, but instead about:

  • Better error recovery using history.
  • Preservation of the type information embedded in the parse trees.

So incremental parsing is a pre-requisite for incremental type checking, incremental code generation...and ultimately incremental program execution.

REPLs are boring, they don't deal with the hard problem of changing existing code interactively.

Incremental parsing isn't required for incr. typing

I wouldn't agree that incremental typing be a "pre-requisite" for incremental type checking and code generation (which I agree are probably important). You could re-parse the whole thing, and compare with the old parse; the parsing would not be incremental, but the type-checker would still be provided a diff on the AST, which enables incrementality on its side.

My point would be: as long as you preserve the last known input, you can give an incremental interface to a non-incremental implementation of any stage in the pipeline. You should then work on the part that is a bottleneck in practice, seems reachable, and you find interesting (maybe all passes have all three "qualities"), but you don't have to design an incremental parser before working on an incremental typer.

You may have meant something softer, such as "incremental typing is so much harder than incremental lexing&parsing is a good step to put your ideas in place", in which case I probably agree.

I guess its not impossible.

I guess its not impossible. But mapping old type information to new would be incredibly difficult, its not something I'm interested in exploring given how easy incremental parsing is anyways.

General Incremental Lexical Analysis

"General Incremental Lexical Analysis" (PDF) is a good paper by T.A.Wagner & Susan L. Graham at U. of California, Berkeley.

The paper is the basis of the NetBeans IDE editor, that performs incremental lexing of the source code being edited, avoiding lexing "the whole thing" whenever the user modifies something.

After using the Lexer API for lexing the file being edited and generating a series of tokens, a set of background tasks perform more high-level analysis of the token stream, including syntax highlighting and semantical highlighting..

This is part of the Harmonia

This is part of the Harmonia work that Graham led after Pan. Anyways, this is just the lexing part, and its very general at that. If you take a simple lexical definition, incremental lexing is quite an easy problem. Say you make a 1 character addition at offset N in an existing token stream S:

  1. Find token T that N occurs on or within (at worst, search from the beginning and count).
  2. If N == T.Offset, then include Tn-1, otherwise we are just going to re-lex T
  3. Re-lex, if resulting tokens are of the same "type", we are done, we don't even have to reparse. If the tokens changed, reparse nearest containing parse tree. If the tokens change AND the end length of the lex has changed (!= Tn.Length + 1), you have to re-lex Tn+1 (and if that length changes, Tn+2, and so on....).

That's about it. As with braces, lexical braces can royally screw this up; e.g. multi-line comments or string double quotes. You don't want an errant unterminated lexical construct destroying your entire existing token stream. Strings...I just check to see if they are terminated on the line they are entered on, and I avoid multi-line comments in my languages these days (though I had to deal with them in Scala much like I did with braces).

That's a great summary of

That's a great summary of the algorithm. I posted the paper because I think it explains it quite well and it's quite general and readable.

Many languages accept tokens that span different lines (Lua and Scheme strings, or XML CDATA), and I think that's a great feature. I wouldn't avoid them. Unterminated strings, or missing braces or separators, can be marked with special "error tokens" that pinpoint a parsing error at a given point in the text.

The existence of these "error tokens" may prevent a deeper analysis of the edited text (such as syntax coloring, or Crockford's context coloring using scope analysis, etc.), because not all token streams can be transformed into an AST if these "error tokens" are present in the token stream.

Right. If your language has

Right. If your language has features that are difficult to deal with, you can perhaps deal with them in the IDE. For example, a multiline comment can be initiated through semi-structured editing rather than having raw text be parsed. For the core language, however, you want to avoid structured editing as much as possible as it will disrupt the programmer.

For this to work, I believe we have to consider language design, language environment, and everything else holistically. E.g. as I did here.

Parsing context free

Parsing context free grammars efficiently under changes in the input looks like an interesting problem. More generally you can also consider executing Datalog queries over changing initial databases, since parsing context free grammars is a subset of what Datalog programs do. If you keep track of which derivations depend on which, you would probably be able to do some kind of invalidation scheme. Since datalog programs are basically a bunch of recursive SQL joins, this brings us to the NaiadLINQ work by MSR. They execute SQL-like queries over incrementally changing databases. This doesn't support recursive queries as far as I can see, which is required for executing general Datalog programs and CFG parsing, but it looks like a similar method would work for recursive queries.

Incremental, reactive Datalog

I'd love to see some good work on this subject.

I tried my hand at it a few years back, but didn't get very far. Mostly, the 'reactive' issues were very difficult due to the search-graph being (up to) exponential in size with the data. Reactivity requires updating search graphs. Also, I was having difficulty handling non-monotonic updates.

Glitch is much more general

Glitch is much more general than Naiad in that it supports a more complete indirect procedural/recursion programming model rather than a restricted direct data flow model; I also developed it for parsing first while at the EPFL (later extending it to type checking and then recently generalizing it into a programming model).

Its not clear to me how Naiad would support parsing; the graph topology seems to be "fixed" statically; while the graph in glitch is built dynamically and can change, necessary for a decent changing parse tree I guess.

The same way you'd do

The same way you'd do parsing in Datalog. For each rule foo in the grammar:

foo ::= bar baz quux

You have a datalog rule like this:

foo(A,B) |- bar(A,X) , baz(X,Y) , quux(Y,B)

Then you convert all those Datalog rules into queries with joins (here expressed in set comprehension notation):

foo = {(A,B) | (A,X) in bar AND (X,Y) in baz AND (Y,B) in quux}

or in LINQ notation (with hypothetical tuple syntax):

foo = from (A,X) in bar
      from (X',Y) in baz
      from (Y',B) in quux
      where X == X' && Y == Y'
      select (A,B)

foo will be represented by a table having two columns A and B. Then you execute those joins incrementally with a similar method as NaiadLINQ.

You have a fixed number of nodes in the dataflow graph per rule in the grammar, not per node in the parse tree.

Interesting, the data-flow

Interesting, the data-flow graph in this case is the grammar, not the resulting parsed tree!

If I understand Naiad correctly, however, I think they only handle "incremental" in the sense of growing input, not in the sense of actual arbitrary change. Perhaps this is something I should ask them about, but I first need to understand Naiad more deeply so I don't embarrass myself.

P.S. while this is an

P.S. while this is an interesting and fun problem, I think in practice it makes much more sense to do structured editing. This gives you what an incremental parser would output for free (namely tree edit commands), minus the problem of broken parses when somebody types an open brace, and its much simpler overall.

Structured editing in its

Structured editing in its naive form is slow for the programmer as they are forced to work within the structure boxes. Its advanced form, which provides for a more fluid experience, I believe it approaches incremental parsing in complexity, and perhaps shares much of the implementation techniques.

Incremental parsing promises to give us the best of both worlds.

Interpolations

The major question I have is how to merge the parse trees and isolate the impact of a change to the concrete/abstract tree that results from the user's text.

For any two parse tree nodes M and N you can determine from the CFG alone if there is an interpolating sequence of nodes M0 = M, M1, ..., Mk = N with M:M1:...:Mk-1:N being a valid parse tree. By "interpolating" I mean that M:...:N can be established without adding or removing a token.

Now suppose you mutate T to T'. If there is an interpolating sequence S1 = M:...:T in the current parse tree and another one S2 = M:...:T' then S2 can substitute S1 without re-parsing. This only works with a CST because the possible interpolations are computed from a CFG alone. It doesn't matter though if the CFG is LL(1) or LL(*).

For a general solution I'd assume it would be helpful to state a symmetric formulation of a parser i.e. given a start symbol S one can move the parse into the future by computing the follow set but also into the past by computing the previous set. So the parse could start everywhere in the token sequence, not only at its beginning.

I don't know any paper which deals with those issues but re-inventing the wheel can also be part of the fun - unless you get paid for the references as in professional research.

If you want to go that

If you want to go that route, my first incremental parser was a bottom-up precedence parser, if you ignore brace matching (handle it separately), you can get a lot of mileage out of expressiveness but there are a few corner cases that have to be hacked. Parsing is bottom up, so you can reparse from anywhere in the stream, and...you aren't limited by a grammar, leading to wonderful error tolerance characteristics (garbage is duly parsed).

Using the notation I have

Using the notation I have introduced above one could build M:..:N:T with a start symbol M and one terminal T, and begin to parse with T in N both forwards and backwards i.e. into the direction of follow sets and previous sets of T in N. This would also go bottom-up but that's because top-down parsers have this jump-to-the-parent-node step after they have completed traversal through the NFA ( or DFA ). So I guess Alexander could use his already familiar parsing engine as well after some tweaking.

Can I see this code?

It seems to me like VS itself doesn't do incremental parsing, it waits a couple of seconds and reparses (and re-semantically analyzes). I thought about how to do incremental parsing a while ago and came up empty.

If you have source code of a syntax highlighter or other VS extension that performs delayed parsing, I'd be curious to see the code because I have found VS code samples difficult to find, and I myself am trying to figure out how to make my own LES classifier run partly-asynchronously.