Parsing: The Solved Problem That Isn't

In the blog post Parsing: The Solved Problem That Isn't Laurence Tratt discusses some interesting unsolved practical problems with parsing especially in combining grammars

The general consensus, therefore, is that parsing is a solved problem. If you've got a parsing problem for synthetic languages, one of the existing tools should do the job. [...]

One of the things that's become increasingly obvious to me over the past few years is that the general consensus breaks down for one vital emerging trend: language composition. "Composition" is one of those long, complicated, but often vague terms that crops up a lot in theoretical work. Fortunately, for our purposes it means something simple: grammar composition, which is where we add one grammar to another and have the combined grammar parse text in the new language (exactly the sort of thing we want to do with Domain Specific Languages (DSLs)). To use a classic example, imagine that we wish to extend a Java-like language with SQL [...]

He goes on to mention several example problems:

  • Two LL or LR grammars may combine to produce a grammar that is neither.
  • Two unambiguous grammars may combine to produce an ambiguous grammar.
  • Two PEG grammars may combine to produce something that doesn't do what you want due to left bias.

What's the current state of the art?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

At least two approaches I know of

On composing extensions to LALR(1) grammars:

Another approach is getting better at detecting ambiguities for SGLR: I'll just link to Bas Basten's homepage here.

The two approaches differ in the trade offs: in the former, the kinds of language extensions you can make are limited, but they'll compose gracefully. In the latter, things are enormously more flexible, but composition requires some amount grammar engineering to resolve any ambiguities.

From an expressiveness point

From an expressiveness point of view, parsing with functional logic programming is state of the art. You write it like a pretty printer and use its inverse as a parser.

Expand?

Jules, can you explain what you mean by "expressiveness"? Technically, a Turing machine is pretty expressive, but that's not really saying much...

Expressiveness in the human

Expressiveness in the human sense rather than the theoretical sense. Here is a paper on it: http://gpd.sip.ucm.es/rafa/papers/flops99.pdf

With a bit more syntactic sugar it can be even better. A definition like the following:

expr(R) = term(T) , [+] , expr(E) , {R is T+E}
expr(T) = term(T)

should be possible to be written like this:

expr(T+E) = term(T) , [+] , expr(E)
expr(T) = term(T)

The latter style of definitions can be transformed into the former with a trivial transformation. If term() is a parser for numbers, then the above gives an evaluator for a series of additions:

expr^-1("1+2+3+4") = 10

The paper has additional combinators for parsing.

Widespread Text Book Error

Its sad but true, already in the DEC10 Prolog manual some decades ago the grammar for arithmetic expressions was wrong. The correct grammar is left recursive and cannot be parsed with DCG. This might have been the reason why they were giving a right recursive grammar in the DEC10 Prolog manual. Here is the correct left recursive grammar:

     expr --> expr, "+", term.
     expr --> expr, "-", term.
     expr --> term.

Only with the above grammar the expression "1-2+3" will evaluate to "2" and not to "-4". But a right recursive grammar can also be used when an additional accumulator is passed around. But this wouldn't make it into a simple grammar that can fit into a DEC10 Prolog manual. The right recursive grammar with accumulator reads as follows:

    expr(X,Y) --> term(X), rest_expr(X,Y).

    rest_expr(X,Y) --> "+", term(Z), {T is X+Z}, rest_expr(T,Y).
    rest_expr(X,Y) --> "-", term(Z), {T is X-Z}, rest_expr(T,Y).
    rest_expr(X,X) --> [].

Bye

P.S.: The left recursive grammar can be directly parsed with a chart parser and a couple of other approaches, but this is not the default execution mode of DCG.

Left recursion is easily

Left recursion can be handled by memoization (which is essentially the same as a chart parser), though I don't think the accumulator passing is too bad either (it is the same idea as Pratt parsing and allows one to parse expressions in linear time).

Chart Parser != Recursive Decent with Memoization

The problem is that with left recursion there is nothing
to memoize in a recursive decent parser:

http://www.tinlizzie.org/~awarth/papers/pepm08.pdf

So a little more than memoization is needed. Also a chart
parser differs from recursive decent in that it is bottom up.

Parsing can be described

Parsing can be described with predicates as follows:

expr(a,c) = exists b: expr(a,b) and input[b+1]=="+" and expr(b+2,c)

Meaning that `expr(i,j) == true` iff expr can parse input[i..j]. A chart parser is essentially dynamic programming on these recursively defined predicates. When a predicate pred(i,j) calls itself recursively pred(i,j) with the same arguments one simply makes it return false. This way it terminates because a predicate pred(i,j) can only call itself with a range i'..j' where |i' - j'| < |i - j|.

PEGs are a different beast (though somewhat related of course).

Bottom Up is not automatically Fallout of Recursive Formulation

The chart parser does not make something false in calling itself. This is the top-down view. A chart parser works bottom-up, and it would need the following grammar:

   expr(a,c) = exists b: expr(a,b) and input[b+1]="+" and term(b+2,c)

Note: the term on the right hand. Otherwise the grammar would be ambigous in a chart parser. You would have multiple maximal sized reads of the same arithmetic expression. i.e. assuming we have also a rule for "-":

   For 1-2+3 would have (1-2)+3 and 1-(2+3) as parses.

An ordinary chart parser always delivers all parses in its chart. Since it is bottom up, a chart parser is also not required to start with the left side of the given sentence. It can start with the right side or in the middle, or at some positions depending on some markers. Or it can work in parallel.

Bye

P.S.: You might also want to look at the top-down / bottom-up section here:
http://en.wikipedia.org/wiki/Dynamic_programming#Dynamic_programming_in_computer_programming

Sure, there is a difference

Sure, there is a difference in computing bottom up vs top down, that's why I said "essentially the same", rather than "exactly the same". Returning false simply corresponds to starting with an empty parse state in a chart parser. The fact remains that left recursion can be handled with memoization, and that the resulting computation is isomorphic to what a chart parser is doing.

Not Isomorphic: Chart Parsers vs. Recursive Decend

What do you mean by resulting computation? I don't think they can be isomorphic in principle. Event with memoization and backtracking.

If we step back and do not refer to dynamic programming, but deduction instead. There is a couple of papers around that explain parsing as deduction. Then bottom up corresponds to unit resolution, and recursive decend corresponds to input resolution.

How would you make unit resolution and input resolution isomorphic? I guess this would stretch the notion of isomorphic a little bit.

Take as a test a non-left recursive grammar:

    expr = term "+" expr

How would the isomorphism between the chart parse and the recursive decend parse look like? I don't believe there is a similarity. I only believe there is a transformation.

Bye

I'm not sure what you mean

I'm not sure what you mean by "input resolution" and a Google search did not turn up any results.

Recursive descent parsers, as I understand the term, are something else entirely. They do not support ambiguous parses, instead they either terminate with 1 parse or they terminate with an error. They are also greedy, in that a sequence `A B` always matches the maximal amount of input with `A`, and possibly the rest with `B`. For example if `A = a | aa` and B = `a`, then `A B` does not match the string `aa` with a recursive descent parser, because `A` already consumed the whole input, while `B` is expecting one more `a`. This greediness is what allows them to run in linear time. In contrast, chart parsers and DCGs and its functional logic variants do succesfully parse the input `aa` with that grammar.

So if you mean the same thing with recursive descent, then yes, I agree that they are a different beast (as I said).

Here is a paper that describes how to do left recursion with memoization: Memoization of Top-down Parsing.

They use Scheme and represent non-determinism with continuations that are invoked multiple times, instead of with built-in backtracking.

Maybe a Mix

It could be that in the case of left-recursion the recursive decend parsers with memoization really mimics a bottom-up parser.

But I understood your comment that *all* chart parsing is just recursive decend with memoization. This is something I cannot believe.

On the other hand from deduction it is known that resolution can also mix unit strategy and input strategy.

The problem is when there would be a full similarity, then there would be no advantage of using recursive decend parsers. And everybody would use chart parsers in the first place.

But there are cases where recursive decend and its deductive counterpart input resolution (i.e. DCG), are more efficient than bottom up resp. unit resolution (chart parser).

Bye

Input Resolution / Unit Resolution
http://en.wikibooks.org/wiki/Logic_for_Computer_Scientists/Predicate_Logic/Strategies_for_Resolution/Input_and_Unit_Resolution

DCG
Pereira, F.C.N. and Warren, D.H.D. (1980): Definite Clause Grammars for Language Analysis – A Survey of the Formalism and a Comparison with Augmented Transition Networks, North-Holland Publishing Company, Artificial Intelligence, 13, 231 – 278
http://cgi.di.uoa.gr/~takis/pereira-warren.pdf

Alright, let me ask you outright then

What do you mean by recursive descent parsing? Can you give an example of something that you consider recursive descent parsing? The page you linked to did not really help.

Note that as I understand it, recursive descent parsing is something else entirely, and am fully agreeing that it is something different than chart parsing. However, DCGs and functional logic parsers, are *not* recursive descent parsers (as I understand the term). Packrat parsers *are* recursive descent parsers.

DCG is recursive decend

DCG is recursive decend in its Prolog incarnation. Prolog uses depth first search, with a left literal selection rule (SLD resolution). It is recursive decend and non-deterministic. But you can also tweak your rules with the cut and some special DCG constructs, so that it corresponds to PEG.

Or you can reduce the non-determinism, via some prediction (left token set), but this is not automatically delievered by the ordinary Prolog incarnation. You would need to manually change your grammar or run it through a tool.

PEG:
http://en.wikipedia.org/wiki/Parsing_expression_grammar

Then obviously we mean

Then obviously we mean something else when we say "recursive descent". I'm not quite sure what you mean by it though, if you consider both Packrat parsing and DCG to be recursive descent. They give a completely different meaning to the sequencing operator: Packrat parsers are greedy whereas DCGs examine all possibilities, like chart parsers.

In any case, the paper I linked to explains quite clearly how to do left recursion with memoization.

In a paper published in this journal, Norvig (1991) pointed out that memoization of a topdown recognizer program produces a program that behaves similiarly to a chart parser. This is not surprising to anyone familiar with logic-programming approaches to NLP. For example, the Earley deduction proof procedure [note: Earley's algorithm is a chart parser that goes from left to right] is essentially a memoizing version of the top-down SLD proof procedure employed by Prolog. Pereira and Warren (1983) showed that the steps of the Earley Deduction proof procedure proving the wellformedness of a string S from the standard ‘top-down’ DCG axiomatization of a CFG correspond directly to those of Earley’s algorithm recognizing S using G.

Yet as Norvig notes in passing, using his approach the resulting parsers in general fail to terminate on left-recursive grammars, even with memoization. The goal of this paper is to discover why this is the case and present a functional formalization of memoized top-down parsing for which this is not so. Specifically, I show how to formulate top-down parsers in a ‘continuation-passing style’ which incrementally enumerates the right positions of a category, rather than returning a set of such positions as a single value. This permits a type of memoization not described to my knowledge in the context of functional programming before. This kind of memoization is akin to that used in logic programming, and yields terminating parsers even in the face of left recursion. -- Memoization of Top-down Parsing

Chart Parsers use Memoization for A --> B Productions

Hi,

The term recursive decend does not imply no backtracking. See for example:

http://en.wikipedia.org/wiki/Recursive_descent_parser

Thanks for the link on left recursion and memoization. This is of course relevant for those people building parsers different from chart parsers.

In chart parsers there are also some problems with recursion. The initial definition requires the grammar in chomsky normal form (chomsky reduced form). So basically productions of the form:

A(t) --> B(s)

have no place. The above productions can be solved by a form of memoizations in chart parsers. The problem is what to do if these productions have attributes with increasing size, i.e. t>s. Normaly this would bomb the chart when these rules are cyclic. So basically a heuristic can be applied that limits the size of the chart. But the memo is the chart itself, but it can be viewed as a "cut off" memoization technique.

Bye

Of course recursive descent

Of course recursive descent requires backtracking, I did not claim otherwise :) They just don't examine all parses for a sequence like `A B`. They only examine the parses where the piece that `A` parses is extended to maximal size ("greedy"). Therefore they parse a wholly different class of languages than chart parsers & DCGs, and as such are a not really comparable.

As the paper mentions, the parser it develops using memoization is a type of chart parser:

Norvig (1991) pointed out that memoization of a topdown recognizer program produces a program that behaves similiarly to a chart parser [...] Yet as Norvig notes in passing, using his approach the resulting parsers in general fail to terminate on left-recursive grammars, even with memoization. The goal of this paper is to discover why this is the case and present a functional formalization of memoized top-down parsing for which this is not so.

In particular, DCGs with Prolog's evaluation strategy correspond to an Earley parser, which is a chart parser. Except for left recursion of course, which the paper sets out to remedy using memoization.

Not Scribbling, Some Practical Testing

Although practical tests are no substitute to theoretical considerations. At least doing some realty check can be useful
to avoid absurd claims.

The end of this post here contradicts what I wanted
to say at the end of my initial post here.

All I wanted to say is that DCG cannot deal with left recursion,
where chart parser and other approaches can deal with left recursion.
But now suddently I hear that DCG is a chart parser.

I can understand the confusion since neither DCG nor chart parser are common to the functional programming camp. DCG probably because of its non-determinism and chart parsing probably because of its bottom-up nature.

Recursive decend parsers with memoization offer here a thread of hope since they are neither bottom up, nor fully non-deterministic and give the promise of covering what a chart parser can do.

But I never wanted to exclude recursive descend parsers with memoization in my initial post. I wrote chart parsers AND OTHER APPROACHES. But I didn't expect to land in a discussion about whether
chart parsers and THE OTHER APPROACHES are the same.

Since it is often more useful to find discrepancies than commonalities. I want to show now a discrepancy between DCG and a chart parser. Namely I want to practically show that DCG cannot deal with a left recursive formulation of arithmetic expression, whereas a chart parser can do so.

This is a SWI Prolog execution of a DCG for arithmetic with left recursion:

Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.0.0)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
?- [user].
expr(X) --> expr(Y), "+", term(Z), {X is Y+Z}.
expr(X) --> expr(Y), "-", term(Z), {X is Y-Z}.
expr(X) --> term(X).
 
term(X) --> [X], {integer(X)}.
^D
?- phrase(expr(X),"1-2+3").
ERROR: Out of local stack

One can also experiment with some reordering of the DCG rules. It might then correctly accept positive sentences, but loop on redo. It mostly loops on negative sentences and thus not reject them.

Now lets look at a chart parser. The execution protocol of a chart parser looks a little bit different. But one can manage to use almost the same rule format as for DCG, there is just a hint needed "forward", that says that forward chaining should be applied.

This is an experimental char parser execution of a DCG for arithmetic with left recursion:

?- [user].
:- forward expr/3.
:- forward term/3.
:- forward word/3.

expr(C,N,M) :- expr(A,N,H), word(+,H,J), term(B,J,M), C is A+B.
expr(C,N,M) :- expr(A,N,H), word(-,H,J), term(B,J,M), C is A-B.
expr(A,N,M) :- term(A,N,M).

term(A,N,M) :- word(A,N,M), integer(A).
^D
?- volunteer(word(1,0,1)).
?- volunteer(word(-,1,2)).
?- volunteer(word(2,2,3)).
?- list.
expr(1, 0, 1).
expr(-1, 0, 3).
expr(2, 2, 3).
term(1, 0, 1).
term(2, 2, 3).
word(1, 0, 1).
word(-, 1, 2).
word(2, 2, 3).
?- volunteer(word(+,3,4)).
?- volunteer(word(3,4,5)).
?- list.
expr(1, 0, 1).
expr(-1, 0, 3).
expr(2, 2, 3).
expr(2, 0, 5).
expr(5, 2, 5).
expr(3, 4, 5).
term(1, 0, 1).
term(2, 2, 3).
term(3, 4, 5).
word(1, 0, 1).
word(-, 1, 2).
word(2, 2, 3).
word(+, 3, 4).
word(3, 4, 5).
?- 

In the above I have listed the content of the chart at two moments of time. After the arrival of "1", "-" and "2". And after the arrival of "+" and "3". The result can be read off by the chart content expr(2, 0, 5). I have uploaded the source code at (*).

So we have confirmed:
- that DCG in Prolog CANNOT handle left recursive arithmetic.
- that a chart parser CAN handle left recursive arithmetic.
- that most probably DCG in Prolog is not a chart parser.

So I guess DCG in Prolog is not a chart parser. I hope this
has clarified the notion of a "chart parser". And I hope also
that it shows that there is more out in the world than only
recursive decend with memoization.

An indicative that the recursive decend with memoization and
chart parser are not the same can be seen by the fact, that
the chart parser is not invoked by some top down "recognize".
Also the chart parser generates irrelevant chart entries
(expr(2, 2, 3), expr(3, 4, 5) and expr(5, 2, 5)). But further
digging would be necessary to work out the full differences.
Which I would enjoy.

If you are surprised by the above fact, don't blame
me, blame mother nature for giving us so much diversity.

Bye

(*)
http://plus.google.com/u/0/b/103259555581227445618/103259555581227445618/posts/SmNM94ZLj5Y

Sorry, I should have said

Sorry, I should have said "In particular, DCGs with Earley deduction instead of Prolog's evaluation strategy correspond to an Earley parser, which is a chart parser.".

That DCGs don't automatically handle left recursion (and neither does the equivalent encoding in Scheme with continuations to do non-determinism) is the whole point! That's the problem that the kind of memoization described in the paper I linked to solves. The idea is roughly that you define a "lazy set" data structure that keeps a list of elements and a list of continuations. Every time an element is added, all the continuations are called on this element. Every time a continuation is added, it is called in turn with all the elements. Now a parser for a constant string is simply a function that takes the input and compares it to the string, and if that string is a prefix then the result is a set with the rest of the input as the lone element, otherwise it is the empty set. Grammar union corresponds to set union. Grammar sequence corresponds to applying the parser of the first grammar to the input, and then applying parser for the second grammar to all the remaining inputs returned by the first parser (which are returned in this "lazy set"). Memoize these parsers and you're able to handle left recursion (and parse CFGs in polynomial rather than exponential time). Note the similarity to forward chaining.

That said, your implementation of a chart parser is very nice, compared to the usual business in an imperative language. I believe Norvig makes a similar observation that bottom up parsing corresponds to forward chaining in his AI book.

[Note that contrary to your terminology, a chart parser does not have to be top down or bottom up according to wikipedia: "We can distinguish top-down and bottom-up chart parsers", but this is merely a matter of terminology]

My Fault / Doubt

I usually indentify chart parsers with CYK parsers and not Earley parsers. The continuation memoization is an interesting idea, whereby
I currently don't see how it exactly relates to Earley.

You wrote: "parse CFGs in polynomial rather than exponential time."
Yes, non-deterministic recursive decent can be exponential for ambigous grammars.

CYK has O(n^3). Which can be seen from the chart construction, the chart is of size O(n^2), and for each cell in the average O(n) other cells pairs have to be inspected.

The Earley parser is said to also have O(n^3), and in some non-ambigous cases less. But since I currently don't see how the continuations memoization relates to Earley, I cannot verify your claim by myself.

Problem I have right now is that a continuation is an intensional object and that it spans the whole grammar, whereas in Earley only (A --> B, *, C; j) fragments are stored, i.e single rules of the grammar with a position.

But it seems that the memo table is indexed by the args. What are these args? Word lists? Then it also does not match Earley. But if the args is the position and if each production is individually memoized, then it could somehow match Earley. The single rule that corresponds to the memo would be (A --> *, B).

Cool! Interesting!

In the unoptimized version

In the unoptimized version in the paper the args are strings, and the tables are represented as linked lists. This causes O(n) slowdown on table lookups leading to O(n^4) parse times. In a practical implementation you'd want to use arrays indexed in O(1) with integer indices (and the author also says this in the paper). This reduces the complexity to O(n^3) with less for grammars that are less ambiguous.

In CYK parser we have for each predicate a table pred[i,j] telling whether pred can match substring [i..j]. In the paper's algorithm we have for each predicate a table pred[i] which contains not booleans but sets containing all those j's where pred[i,j] == true in CYK. If an extra index gets added to such a set, the continuations that are called make sure that other sets get updated. At least, that's how I understand it.

I'm not sure whether or how this corresponds to Earley.

Data-Driven vs Function-Driven

There is also another name for bottom-up and top-down that can sometimes be used interchangeably. Data-driven vs. function driven. For example DCG in ordinary Prolog is function driven, i.e. top-down. There is no data that drives the parsing in DCG, except the arguments of the grammar predicate that is invoked. So DCG is also no chart parser since it is not data driven, there is no chart.

Early deduction seems to be the resolution strategy counter part of Earley's algorithm. It is usually applied top-down, so no wonder it corresponds to what Norvig is doing.

But by chart parser I don't understand this kind of parser. I understand a bottom-up parser. And I still believe that bottom-up parsers do create a different execution pattern from top-down parsers, or parsers that are a mix of bottom-up and top-down. In particular I assume that bottom-up breadth first does differ considerably from top-down depth first, so that the two cannot be called isomorphic computations.

Here is a paper that shows how Early deduction can be turned into bottom-up:

Bottom-Up Earley Deduction (1994)
by Gregor Erbach
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.4183

And here is the paper that introduced Early deduction:

PARSING AS DEDUCTION, 1983
Fernando C. N. Pereira David H. D. Warren
http://acl.ldc.upenn.edu/P/P83/P83-1021.pdf

If I have time I will scribble some top-down parsing execution and some bottom-up parsing execution, so that it is clearly seen how the two differ. But I guess it will not be today.

Or maybe you can convince yourself by looking at some data-driven vs. function-driven executions in another context.

Bye

Depth/breadth

I've only just quickly read through this sub-thread, but I think it's worth noting explicitly that it's possible to use continuations (or related delimited control or transformations) to efficiently write a breadth-first traversal in a depth-first style.

Having glanced at the paper Jules referred to, it seems that this might be what's going on, and I wonder if it is the source of the disagreement. Certainly I would argue that in the presence of laziness or continuations, the distinction between data-drive and function-driven cannot be as sharp as you seem inclined to make it.

Not Language Elements, Parsing Algorithms

Hi,

Our discussion was about parsing algorithms, right? Notions such as data-driven vs. function-driven or bottom-up vs. top-down are only used to characterize the algorithms.

The discussion is not about the language elements of the host language where the parse algorithms are implemented. These language elements are irrelevant and shouldn't change the character of the parse algorithms under consideration.

A more precise characterization would involve complexity classes. But we are not yet there, since we are still struggling with the identification of the parse algorithms itself.

And before diving into complexity classes it can be more of interest to check the applicability of the parsing algorithms, i.e. for which kind of grammars do they terminate?

Bye

Magic Set vs Earley

Maybe there is a relationship between the magic set method (*) in deductive databases and Earley deduction. I just made the following experiment which is 100% bottom up at runtime. I started with arithmetic grammar:

expr = expr + term | expr - term | term.
term = term * factor | term / factor | factor.
factor = integer.

Now I added magic sets. mexpr and mterm indicate that an expr resp. term is top down needed. But the rules for mexpr and mterm are described bottom up. Strictly the magic set method would also generate the rules mterm = mterm and mexpr = mexpr, but I didn't add them, since they don't produce something new. If the original grammar would have a rule factor = ( expr ) we would have more recursive grammar predicates and thus more magic rules, but so we only have:

mterm = mexpr expr + | mexpr expr - | mexpr.

Then I used the mexpr and mterm to guard the rules in the original grammar. Namely the new grammar is now:

expr = mexpr expr + term | mexpr expr - term | mexpr term.
term = mterm term * factor | mterm term / factor | mterm factor.
factor = integer.

This is the run of the original grammar for the input 1 + 2 * 3:

expr(1, 0, 1).
expr(3, 0, 3).
expr(2, 2, 3).
expr(7, 0, 5).
expr(6, 2, 5).
expr(3, 4, 5).
term(1, 0, 1).
term(2, 2, 3).
term(6, 2, 5).
term(3, 4, 5).
factor(1, 0, 1).
factor(2, 2, 3).
factor(3, 4, 5).
word(1, 0, 1).
word(+, 1, 2).
word(2, 2, 3).
word(*, 3, 4).
word(3, 4, 5).

And this is the run for the magic set grammar. Before giving the input 1 + 2 * 3 to the forward chaining rules, the magic mexpr(0) is given to the forward chaining rules. Saying we are in need of parsing an expr starting at 0. The result is:

expr(1, 0, 1).
expr(3, 0, 3).
expr(7, 0, 5).
term(1, 0, 1).
term(2, 2, 3).
term(6, 2, 5).
factor(1, 0, 1).
factor(2, 2, 3).
factor(3, 4, 5).
word(1, 0, 1).
word(+, 1, 2).
word(2, 2, 3).
word(*, 3, 4).
word(3, 4, 5).
mexpr(0).
mterm(0).
mterm(2).

Thus less irrelevant chart elements are deduced. Namely the following
chart elements are not anymore generated:

expr(2, 2, 3).
expr(6, 2, 5).
expr(3, 4, 5).
term(3, 4, 5).

But the following irrelevant chart elements is still deduced:

expr(3, 0, 3).

And extra storage is needed for the magic sets:

mexpr(0).
mterm(0).
mterm(2).

Smells like Earley.

Best Regards

(*)
MAGIC SETS AND OTHER STRANGE WAYS TO IMPLEMENT
LOGIC PROGRAMS, Francois Bancilhon, 1986
http://ssdi.di.fct.unl.pt/rcr/rcr0607/docs/magicsets.pdf

Nice insight, Dyna example

Presentations on the Dyna logic language frequently use the same example of turning a CYK parser into an Earley parser. You might enjoy checking that out. See for example:

http://cs.jhu.edu/~jason/papers/eisner+goldlust+smith.emnlp05.ppt

+1

Pitty there is no +1 button on LtU. Nice post, thanks.

P.S.: How the two relate:

Prolog has Horn clauses:
a(I,K) :- b(I,J) , c(J,K).

Dyna has “Horn equations”:
a(I,K) += b(I,J) * c(J,K)

Optimization (Asymmetric Conjunction and "Right to Left" Order)

Hi,

Just recently noticed an optimization in chart parsers. Introduced the operator &/2 for this purpose. So I can have the following two types of forward chaining rules. With symmetric conjunction:

a(I,K) :- b(I,J) , c(J,K).

And with asymmetric conjunction:

a(I,K) :- b(I,J) & c(J,K).

The symmetric conjunction reacts when either the left or the right side reacts. The asymmetric conjunction will only react when the left side reacts.

The symmetric conjunction would be needed when the tokens for parsing arrive in an unspecified order. But when the order can be fixed to be "right to left" (sic!), then the asymmetric conjunction can be used.

The "right to left" order can also be used for non binary rules. The asymmetric conjunction needs to be put after the first goal in the body of the rule. Additionally we can put auxiliary conditions that depend on all goals in the rule, since with "right to left" order the attributes of goals will be present when the first goal reacts.

I have done some measurements, there is quite a gain by this approach. "right to left" order is unusual, often parsers work "left to right". But since this is only a procedural variance of a declarative concept, the "right to left" order parser delivers the same parses, there is no issue with correctness.

For more info see here.

Expressive Power

A widely accepted formal notion of expressive power was developed by Matthias Felleisen, On the Expressive Power of Programming Languages.

In general, people will object if you confuse computational power (Turing equivalence) with expressive power.

What about Marpa

Jefrey Kegler has been working on a variation of Earley's algorithm to produce a general purpose parser generator. Here's the latest draft of his claims:
http://cloud.github.com/downloads/jeffreykegler/Marpa-theory/recce.pdf

And his project:
http://www.jeffreykegler.com/marpa

I think I read something on his blog about composability as well.

ETL

I was working on the project in this area. It was previoulsy discussed here.

There are two ideas in the project that are not well articlated in the docs:

  • The grammar to the source code is what type to data. And the project goes along static typing for the sources.
  • The grammar langauge should go along natural engineering thought lines. It should explicitly talk about contexts, operators, statement, grammars, and imports, extensions (we could see all of this in lang spec, but not in the common grammars).

I need to rework some of its core elements (like adding true keywords, relaxing some restrictions, and adding interfaces for dependency management). I will possibly do it as part of one my experimental project.

Fast or composable: pick one

I think the situation is not so much that "parsing is an unsolved problem," as that "parsing is a complex problem space with some fundamental tradeoffs."

I mean, no one would say that "encoding is a solved problem that isn't." Everyone knows that any given encoder is optimized for a particular kind of content, and that there are some essential tradeoffs (lossy vs. lossless, compression vs. CPU, etc.) that you can't avoid. It's more accurate to say there are many encoding problems, and many solutions to each, and you have to choose intelligently.

Exactly the same situation pertains in the parsing world. The fundamental issue of parsing is ambiguity -- if you have an unambiguous language, parsing it is relatively easy, but that limits expressiveness. It is fundamentally impossible to parse an ambiguous language as quickly as an unambiguous language. Yet people often want to compose languages, which generally introduces more ambiguity.

GLR techniques and others allow returning all alternatives in ambiguous situations, which has performance issues but can be straightforward to reason about. LL* techniques (such as ANTLR) address ambiguity with aggressive DFA-based lookahead and semantic rules to control the disambiguation.

But you still need to understand the subtleties of the algorithm to some extent, just as someone encoding a video stream needs to understand the tradeoffs inherent in selecting how often keyframes should be interspersed, what compression ratios should be selected to minimize noise, etc.

Personally I think some of the most interesting and relevant recent work is being done by Eelco Visser, including work on declaratively specifying debuggable, composable DSLs with IDE support and composing languages by importing language extension libraries. These are based on the Spoofax language workbench, which internally uses the SDF language framework, which is inherently SGLR-based, therefore supports and explicitly manages ambiguities.

For production use in a non-composable/extensible setting, it's pretty hard to beat ANTLR v4 these days, especially with code completion in ANTLRWorks v2. In fact, for most practical purposes, I consider ANTLR v4 the most complete single solution to the parsing problem.

Fast and composable: it is possible to pick both

You miss additional dimension for trade off. It is possibly to be both fast and composable, but if we drop requirement for supporting old languages.

Fudamental issue is to allow language composition along natural extension lines. The production is too low level extension point, so it is difficult to compose grammars using it.

To see an extension lines of the languages one must look at language specifications. In the language specifications, they describe statement and operators (with precedence rules and associativity) described in certain contexts, there are also reusable blocks of they syntax that are reusued across the grammar. But at the end of the language specification, these nice descriptions are compiled to BNF somtimes using unnatural translation rules.

In ETL, I have tried to allow grammar composition along these extension lines (operators and statements in the contexts) and grammars became easily composable. They also quite fast, as it is interanlly translated to LL(1) grammars (LL(2) lexer and LL(1) parser). However, the grammars defined using ETL play by certain rules. They all share common lexical and phrase layers (block and statement composition).

So it is not possible to define SQL and Java using ETL. But it is possible to define Java-like and SQL-like languages, and compose them together, and they could reuse a common expression syntax.

Pratt parsing is IMO the

Memoized recursive descrent + Pratt parsing is IMO the best practical technique. It is linear time (no backtracking) and it can handle operator precedence more easily than traditional grammars (since that's basically all it does). Here's a nice article explaining it.

And here is an on-the-fly scanning Pratt parsing by naasking. This can be extended to get better composability. For example, as the article in the original post notes, you need a different scanner for a programming language than for SQL expressions embedded in that language. With on-the-fly scanning, you can switch to a different scanner dynamically.

The Pratt parser has changed

The Pratt parser has changed a little since that article. You can see a simple calculator example in the tests (called MathSemantics, line 167). You can also see how I extended the math grammar to add let bindings in EquationParser (line 213), by simple inheritance from MathSemantics.

As long as grammar rules don't overlap, it should be extensible in this fashion. Ambiguous extensions might be resolvable by overriding the default actions and disambiguating manually. I haven't really tried this though, so it's speculation.

I think composing grammars should work as well, as long as the grammars are properly nested in tree-like structure. So a LanguageGrammar contains an SqlGrammar, which the LanguageGrammar uses in the positions where SQL is permitted. Again, speculation though. Wish I had more time to try these out.

Is there a reason why you

Is there a reason why you have a separate scanner language, rather than just using regular expressions?

I don't have a separate

I don't have a separate scanner language. I infer scanners from the grammar.

I have no strong opinions

I have no strong opinions about what the "best" parsing algorithm is (or whether there is / could be such a thing!). I would be interested in finding out if / what Pratt parser's limitations are when it comes to composition, as the published literature makes it pretty hard to guess what that might be. For example, might it suffer from shadowing in the same way as PEGS? How might it handle ambiguity between different languages? etc.

Compositional Pratt

I've played with using an enhanced Pratt parser (sorry, no code worth showing at the moment) to implement Haskell's Read class where there is no possibility for manual disambiguation. Instead ambiguity was handled by nondeterminism, giving up linearity.

Because Pratt parsers construct the parse tree in-order there can be less ambiguity on the right hand side of a subtree, partly because the operator context may limit the different languages that can be parsed on that side but also, I believe, because tokens which never occur on the left hand side of a subtree can be properly scoped top-down without any ambiguity. Of course the subtree operator may itself be ambiguous and so forth. In the end my experiments went in other directions and I've never tried to see if this has any affect.

This also didn't have to worry about composing different tokenisers for different languages.

Bol Processor

On a slight tangent that might be of interest, Bernard Bel and Jim Kippen's Bol Processor music composition system is based around a grammar engine for generating compositions in the form of "polymetric expressions". Though generating instances in the language is the main purpose of the engine, as opposed to parsing, the grammar engine can do context sensitive "parallel" derivations and has some practical innovations such as meta variables, "reference/slave" expressions (for constrained derivation), remote context, "programmed grammars" and "sub grammars".

The system was originally developed for the early macintoshes (system 7 thereabouts) and still bears that stamp, though some amazing porting work was done by Anthony Kozar and Bernard Bel to make it run on MacOSX.

Immovable objects and unstoppable forces

Is this really a parsing problem? The reason you can't compose arbitrary grammars is that arbitrary grammars don't compose. I think this is more a question of what constraints we should be placing on syntax extensions such that we get a good degree of flexibility and also composition.

make solutions, not problems

I concur. I see grammar composition come up from time to time in speculative commentary on parsing, but I've never seen a case that makes it a compelling problem. We don't have anti-gravity technology, so we don't build vehicles that require it for air travel.

In practice, one does not "blend" languages together. Either one uses a single language that embeds the extension within its own syntax (e.g., Racket), or one uses separate languages with some embedding mechanism (e.g. PHP inside HTML). The first has only one grammar; the second has two with a clear mode switch. One's choice of parsing technology hasn't even come into play.

To my eyes, Crossing language boundaries is the more interesting, pressing problem. There are legion methods of embedding SQL derivatives into ALGOL derivatives; they vary from bad to terrible. Enabling the host language to hold the SQL syntax adds design desiderata, but simple on the engineering side. Making these different worlds cohere against each other's semantics is where the rubber hits the road.

I agree somewhat. On the one

I agree somewhat. On the one hand, arbitrary grammar composition is unnecessary (especially the kind that can lead to ambiguity). On the other hand, many parser generators don't even support trivial grammar composition like PHP in HTML, or SQL in Java. The reason is that traditionally parsing is done as a two pass algorithm: a tokenizer phase and a parse phase, but we need a different tokenizer for SQL than for Java. So you either need some mechanism to switch tokenizers or you do away with tokenization and instead do everything in your parser.

When the two pass model is

When the two pass model is avoided and two grammars are unified then ambiguities occur on the level non-terminals for numbers, strings, identifiers, operators etc. except the composition is fairly trivial i.e. one grammar owns the start symbol and the other one remains disconnected. I think a practical definition of "grammar composition" shall result in a grammar which isn't a disjoint union of two grammars.

Anyway, aside from useless special cases grammar composition is not trivial i.e. it requires manual work of the programmer and it might be more or less the same with a tokenizer as it is without.

Obviously when we're talking

Obviously when we're talking about grammar composition we're not talking about union, but rather having one grammar as a subgrammar of another:

program = statement*
statement = variable `=` expr
expr = number | function-call | variable

=>

program = statement*
statement = variable `=` expr
expr = number | function-call | variable | SQL-expression
SQL-expression = ...

With arbitrary context free grammars such kind of composition can lead to ambiguity. With PEGs for example this will not lead to ambiguity because PEGs are prioritized choice.

BTW I'm not convinced by the original article's quick dismissal of PEGs because they don't support left recursion. As far as I know in practice left recursion only happens for left associative operators, and there are well known techniques to parse them without left recursion that can be expressed elegantly with parameterized PEGs (e.g. Pratt parsing or dealing with the associativity in a later stage by parsing '1-2-3' to a node (- 1 2 3) with 3 arguments; the meaning of the multi-arg `-` will then deterimine associativity).

In retrospect, I put the

In retrospect, I put the problems with PEGs in the wrong order in that article. The lack of support for (arbitrary) left recursion in PEGs is an annoyance. Ordered choice is more problematic, because earlier entries in the ordered choice "shadow" later ones. For example, if you were to do:

expr = number / function-call / variable / SQL-expression

is "COMMIT" a PL variable or an SQL command? [You can construct much more complex shadowings using the above grammar, but this simple example shows the general point.] As a PEG, it'll be parsed as a PL variable, with no hint that it is also a valid SQL expression. With a suitable CFG parsing algorithm, you will at least be offered both possibilites as a result of an ambiguous parse.

That's an advantage of PEGs,

That's an advantage of PEGs, not a disadvantage. If you want it to be parsed as a SQL expression, simply put that in front. Ordered choice is not the most elegant construction theoretically, I agree. But in practice it works well.

I fear that shadowing in

I fear that shadowing in PEGs is a much deeper problem, of which I gave one specific example. A related one is that if the LHS of an ordered choice successfully parses a "short" string, it shadows the longer string that the RHS might have parsed. Unless you can successfully analyse the LHS and RHS in advance, I doubt you could work out what the "correct" order is. In general, my experience suggests that there is often no "correct" order, with every possibility having issues.

In theory I agree

Can you give a real world example?

I think rather than with ordered choice your problem is with PEGs' sequencing operator, which is greedy. Hypothetically you could support a non-greedy sequencing operator by sacrificing linear time parsing (it would then take worst case O(n^3) with memoization). I'm not sure if there are practical situations where this buys you anything though.

Ordering isn't a

Ordering isn't a particularly robust conflict resolution strategy and it really only works when a rule is a prefix of another one e.g. in cases like IPv4 / Float / Int. But suppose Float can be written with a leading dot and one also supports a '.' separator between identifiers and also ellipse '...'. Things start to become complicated and one wishes one had a built in conflict resolution strategy such as LL(*) parsers which can deal with it without requiring manual work.

Can you perhaps define the

Can you perhaps define the problem more clearly? As written I don't see why the simple solution wouldn't work:

Expr = IPv4 / Float / Int / ID / '...'
ID = [a-z] [a-z0-9.]*
etc.

In fact I think every other ordering would work equally well since there is no ambiguity even if this was a CFG.

If you wanted to support identifiers that start with numbers and dots (which most languages don't support so we're well into theory land), you could do it simply by putting `[0-9.]*` in front of the previous definition for ID:

Expr = IPv4 / Float / Int / ID / '...'
ID = [0-9.]* [a-z] [a-z0-9.]*
etc.

Again no ambiguity.

Once you allow '.4' to be a

Once I allow '.4' to be a float I have to reorder Expr to not hide '...'. Of course this is doable but then I add the next obstacle e.g. proposing string prefixes such as u'...' and r'...' as in Python that shadow ID or get shadowed by it. So each addition has immediately global consequences on the order of expressions and I have to check that they are consistent with my expectations.

Now suppose I want to write a tool which helps me to do those consistency checks or orders the expressions such that shadowing can be avoided or detects proper ambiguities where this is not possible. This happens on a sophistication level where I already begin to build a static conflict resolution strategy into my parser generator. But then I don't need ordered choice to begin with.

Why would '.4' hide '...'?

Why would '.4' hide '...'?

If you give Expr '...' as input, then it will fail to parse IPv4, so then it will try Float, which will also fail, then it will try Int, which will also fail, then it will try ID, which will also fail, and then it will succeed to parse '...'.

This is exactly what PEGs are good at: they support unlimited backtracking in linear time. With traditional parser generators it would indeed introduce difficulties because as soon as Float consumed the first '.' you'd have a problem.

In case of a PEG one might

In case of a PEG one might have to add a syntactic predicate in order to avoid the effect of shadowing SQL-expression by variable. In case of EBNF I also expect at least an LL(1) conflict between those two ( not necessarily an ambiguity in the strict sense of two or more equally valid resulting parse trees ). So grammar embedding, while simple at the surface, will cost lots of work.

I don't think either that not writing grammar rules in left-recursive form is a big loss but there might be hidden recursions and conflicts a programmer can't even guess. That's why ANTLR has to fall into backtracking mode occasionally.

One thing I don't understand about GLR and its fans: what does anyone want with a parse tree forest? What are the strategies of dealing with it? Picking an arbitrary tree as a solution? This might be justified in some cases and the true ambiguity in rules such as

E: E '+' E | 'a'

might not be harmful ( does GLR produce a unique tree in that case? ) but in many other cases this is not so easy to see.

So you're saying that you'd

So you're saying that you'd rather have LL(1) conflicts than ordered choice? As somebody who tried to deal with LL(1) conflicts, I disagree. Ordered choice lets you naturally express all the relevant design possibilities. Want to give priority to variables? Do `variable / SQL-expression`. Want to give priority to SQL? Do `SQL-expression / variable`. Want to explicitly exclude SQL keywords from variables? Define variable as `[a-z]+ & !("SELECT" / "COMMIT", etc.)`.

I also agree well with your

I also agree well with your position, Jules, but I would add some clarification to re-connect to the original context. You mention, correctly, that many parser generators simply can't handle the case of one language nesting within another. My clarification is that this is a problem of those parser generators, not of parsing. In my experience, leaving this unclear commonly leads to a false generalization from the properties of popular parser technology to the problem of parsing writ large.

My company rolled its own stack over 15 years ago, using a GLR parser over tokens (that is, "scannerful"). We use exactly that first solution you mention: both the scanners and parsers can handle multiple languages by "jumping" to a separate one. This solution has been known and feasible (in computational requirements and engineering) for decades; the only difficulty lies in clearing out the old assumptions from one's parser technology.

not a problem for programming languages

IMO parsing is a problem for natural languages because we have no choice but to use them. But for programming languages we do have choices and we can change them easily. We can make languages with a syntax that is really easy to parse. So not only is the problem solved for PL, but it may not need to exist at all. More complicated grammars and more complicated ways of manipulating them will only cause us more trouble.

Yes, if there was only one

Yes, if there was only one language, one data format and one tool the world would be a simpler place. But where is the fun?

Representing code as an AST

Representing code as an AST does not imply that there can be only one language. The fun is indeed perhaps part of the problem. Parsing is just such a fun challenge :)

And such an oh so open

And such an oh so open problem when tooling is considered. Full incremental parsing...very error tolerant parsing. An IDE would better benefit from a NLP-style statistical parser just because the error recovery would be so much better.

Right, though those things

Right, though those things are moot if you edit ASTs (of course you get other problems in return). Has anybody tried just applying brute force Viterbi parsing to programming languages for error tolerance?

Actually, there is a middle

Actually, there is a middle approach: if you have incremental parsing, you can analyze each edit delta in the context of the previous, which can go a long way in enhancing error tolerance (worked when I was using it for Scala, at least)...same thing actually applies to type checking (don't retype the program, type the delta!). The result very much resembles what you would get if you were editing the AST directly in syntax/type correct chunks.

Not sure about Viterbi, I haven't heard of anything. It would be an interesting experiment to take a large corpus of tokenized PL code and the resulting parse trees and train a statistical parser for it. Would ML be able to figure out the syntax of a language reliably by looking at the corpus?

The middle approach seems

The middle approach seems like a fairly hairy thing to do by hand!

What I meant was not learning the grammar automatically, but defining a probabilistic grammar that allows for errors with small probability and then parsing the most likely AST for a given input string. For example in a rule like this:

expr -> '(' expr ')'

You'd instead do something like:

expr -> '(' expr ')'     with probability 98%
expr -> '(' expr         with probability 1%
expr -> expr ')'         with probability 1%

This would allow you to handle unbalanced parentheses. There are parsing algorithms that can find the most likely parse given an input string.

http://en.wikipedia.org/wiki/Stochastic_context-free_grammar

Its actually not that bad.

Its actually not that bad. Once you memoize your parse trees at some granularity along with the parsing context, you just reparse the sub-tree affected by the edit in isolation. If a parse error occurs, you attribute the error to the memoized parse tree and keep everything else the same. Probability isn't necessary at all.

There are some spill over cases if the edit changes the parsing context or crosses memoized parse tree boundaries, but these are easy enough to detect and special case. Also, you really have to special case braces, make sure they are at least balanced before you start re-parsing, and add some magic to your editor that adds invisible closing braces automatically when they type an opening brace (or make the completed braces visible, but many programmers hate this even if it allows them to get better semantic/syntax feedback).

See my live programming paper for a very incomplete discussion of the topic.

What you are proposing is similar to what we refer to as an over-engineered grammar that contains productions specifically for better error recovery. In practice, these grammars are difficult to write, even with the addition of probability in the grammar.

I'm quite interested in seeing what ML could tell us about parsing in general that we don't already know through are more formal methods.

Local variation

On several occasions I encountered problems of locality/local variation. Re-tokenization and re-parsing are obvious examples but there are others as well. For example a program computes the edit-distance of two strings S1 and S2, then one character is changed/inserted/removed in one of the strings: how to recompute the edit distance efficiently?

I'm curious if there is a systematic treatment of those algorithmic problems? From tinkering with the edit distance I got the impression that the most efficient one-shot algorithms may not be suited well for efficient handling of small, "continuous" variations.

Belief Contraction

Concerning chart parsing, a parse can be done via forward chaining. A reparse can be done by first doing belief contraction and then forward chaining again.

This works fine for example when a single word is corrected. Just contract the old word and all that would be generated by it from the store, and then add the new word and apply forward chaining again.

Unfortunately I guess if a whole sequence of words is delete or inserted, then this doesn't work so easy anymore. For example if m words are inserted, then we could cheat by for example moving the chart elements chart[i][j] to chart[i+m][j+m] if they are after the insert position.

If the chart elements overlapp with the insert position I guess then they need to be removed and await for the results of forward chaining. I have no practical idea how to retrigger the forward chainer accordingly.

Bye

Here is a post that shows the easy case of replacing a word:
Fun with Forward Chaining

The above post shows a chart parser of the token list 1 + 2 + 3. The withdraw verb (belief contraction) is then used to remove the second + token. A further volunteer verb (forward chaining) then adds the token * where we had the second + token. After these two modifications the chart parser shows the correct parse of the new token list 1 + 2 * 3 = 7. We see a fact expr(7, 0, 5).

What you can do is instead

One simple thing that you can do is instead of working with string positions i..j, you work with the substring input[i..j] itself. Then you use memoization. That way if you make a change to the input, you just reparse everything. But the substrings that stayed the same can be looked up from the memoized values.

Unfortunately, in the worst case if you change a word in the middle of your input, then re-computation can still take O(n^3). A more refined version of this that recomputes even less can be built on top of self-adjusting computation (see below). Perhaps that would run fast enough for practical usage.

There is:

There is: Self-adjusting Computation.

Here is a nice talk. The first couple of slides explain the basic concept. http://www.mpi-sws.org/~chenyan/papers/implicitSAC-talk.pdf

Coming home with full hands

+1 For the idea and the link.
Today coming home with full hands from LtU.

I never understood how this

I never understood how this area connects back to other constraint-based programming areas as well as even FRP (which has constraint roots).

I've basically been doing what this work describes by hand: trace dependencies dynamically and select reasonable granularity units of memoized computation. I'm not really sure how they can do this automatically and get decent performance, will have to read their papers...

I'm not sure about the

I'm not sure about the connections to constraint-based programming but the connections to FRP are tight. Self-adjusting computation is the Behavior/Signal subset of FRP: the outputs are a pure function of the inputs, and the outputs get recomputed when you change the inputs. There does not seem to be a way to handle time and accumulation over time, like FRP event streams. On the other hand, the incremental recomputation of outputs is more advanced in self-adjusting computation: more work is reused from the previous timestep due to support for memo-functions, traceable data types, etc.

Traceable data types don't

Traceable data types don't seem new -- they were a clean characterization of common practice. Memo functions might be if they start to give control over how to memoize (all FRP I've seen defaults to only remembering the immediately prior local state). I don't recall anything along these lines, however.

I've struggled to understand Acar's work because of issues like this, which is a shame because his is the leading (publishing) group in pushing the compiler/optimization side of things right now. I asked one of the leading collaborators about the connection years ago, and they were unaware of it, but I think at least that's better now. (Likewise, a lot of this stuff is better understood in attribute grammar optimization literature, but that connection hasn't been significantly exploited/reinvented so far).

Clean characterization of

Clean characterization of common practice is what a lot of computer science is :)

Memo functions do seem new. For example think of a time varying list:

type List a = Nil | Cons a (ListB a)
and ListB a = Behavior (List a)

How do you write a map function over ListB that reuses as much of the previous run as possible? If a cons in the middle of the list changes, then with just FRP primitives the entire rest of the mapping will be recomputed. This is the problem that memo functions solve.

I think we're generally in

I think we're generally in agreement. Data structures in incremental languages is an out-standing challenge. Currently, each structure (lists, trees, etc.) are created with custom memoization schemes (cleanly characterized by the ADT paper), and the space of implementation decisions that have been explored is huge (and domain specific): how much history, static vs. dynamic, ordered or not, etc. We need constructs for black box / orthogonal / declarative control on implementation choice rather than reinventing the wheel each time -- perhaps we're talking about different memo functions when looking at the scope of the solution.

Memo functions are a general

Memo functions are a general construct provided by self-adjusting computation to reuse results from the previous run that cannot be reused without them. You can build changeable lists with them, but they *are* not changeable lists. If you want to know more about what they are you can find it in the self adjusting computation papers.

For example for changeable lists you just write the usual map function but instead of `let map f xs = ...` you write `let memo map f xs = ...` (memo is a keyword) and then it reuses the results from the previous run even after the cons cell that was changed. The point is that you don't need a special primitive changeable list type: you can build it on top of existing self adjusting computation primitives (namely behaviors and memo), which is not possible with FRP (which lacks memo).

I think you're missing my

I think you're missing my point. They're both general but at different specific design points. For a truly general construct, I want to see how to declaratively pick different design points.

[Btw, a 'memo' annotation is definitely not new. Again, the value here is putting it into a better model than previous work. ]

Your last sentence in your

Your last sentence in your previous message makes me believe that you are missing the point:

perhaps we're talking about different memo functions when looking at the scope of the solution

Which memo construct we're talking about is pretty clear: that provided by self-adjusting computation. Notably self-adjusting computation memo functions are NOT the same as the memoization that you're used to and is common in e.g. Lisp (although in a way I guess they are similar).

A memoized function takes a

A memoized function takes a list of arguments and the function body, looks up the memo table based on the arguments, and computes and stores the result in the memo table if the result is not already found in it

We can define a family of lift functions based on the memoization behavior. There are arguments for and against different ones (remember previous value, remember all values, only sometimes remember values, etc). Full memoization seems like a poor default in that it penalizes the common case, which is probably why it isn't done in other systems. My point is that the interesting work is in declaratively synthesizing the right implementation: this stuff is low-level and cross-cutting.

Yes, so that's not just what

Yes, so that's not just what self-adjusting computation memo functions are. Doing that blindly leads to space leaks. Memo functions in self-adjusting computation integrate with its notion of time and timestamps to remember just those results that can be reused in the next update. In a way FRP's update mechanism with the priority queue and such is just one half of self-adjusting computation.

Suppose function `a` calls `b` calls `c` calls `d`, and suppose that something in the arguments to `b` caused it to call `e` instead of `c`. The way FRP works is that the results from `a` that don't depend on `b` get reused from the previous run. But now suppose that `e` calls `d` just like `c` did. FRP's change propagation does not reuse the old computation of `d`, even though we still have that result sitting in memory. With self adjusting computation memo functions you can reuse the old computation of `d` without blindly memoizing everything and avoiding space leaks.

Response to edit: Right, my point isn't that there isn't work to be done, or that you couldn't do the same on top of FRP. The point is that there is new stuff in self-adjusting computation. Its functionality is not just a subset of FRP.

Hmm I did not mean to imply

Hmm I did not mean to imply it was a subset, but that it falls into the same framework (lifting, dynamic scheduling, etc.) and the main change is the amount of memoization. One thing that is unclear to me is if the memoization is non-local --- in f(a) + f(a), is the memoization of f() shared? Such questions may be the appropriate perspective of this work: determining the appropriate extent of memoization, both lexically and over time. Kind of analogous to concerns in laziness and continuations, but with a performance focus :)

If I understand it

If I understand it correctly, f(a) is not shared. Memo only reuses values in the timestamp range of the parent computation. When we rerun a subcomputation, only those values are looked up that were cached in the previous run of that subcomputation. This results in an alternating reuse-recompute process: when a memoized call is looked up, stuff that that call depends on might have changed as well, so we have to recompute just those parts, recomputation of those parts may result in memo table hits which cause reuse, which may cause more recomputation of parts of the reused results, etc.

Parallel and self-adjusting

I very recently read musings on self-adjusting and parallel computation being related by fine-grained dynamic dependencies: #TWIL: monads, parallelism, geometry.

Again, see attribute

Again, see attribute grammars. The connection has been extensively explored; the interesting thing here (to me) is generalizing it -- we want to hit general imperative code. AGs essentially gave us the non-first order FP case, FRP gave us the first-order case, and now we're moving even further. (STM and speculative execution paved a lot of the way for the imperative case.)

Is Imperative Self-Adjusting

It looks like what I

It looks like what I remembered. 'Imperative' support is analogous to 'cells' in FRP systems (among other names). Instead of a history of one, they have a full history (full memo table). This is a simple change in the design space.

Can you point me to a

Can you point me to a description of the equivalent of that in the context of FRP? And if that's not what you meant by imperative support, then what is?

Thanks. I haven't even

Thanks. I haven't even expected a PLT angle, but since Acar's work relies on dependency graphs the connection becomes obvious.

Dependency analysis, however, can only show what it costs to make a static algorithm dynamic i.e. the costs of updates. Creating deep dependency graphs shall be avoided though when an algorithm is designed for frequent updates.

For example in case of edit distance computations we would expect that for a good "dynamic algorithm" the complexity of the update doesn't depend on the update position. Inserting a character at the begin or the end of a string shall not impact the average computational effort.

One can make the case against parsing from this perspective: the insertion of a parenthesis can have effects on the whole of a parse tree and its mere existence implies non-local dependencies. Context free grammars are still too context sensitive.

I've done this before so let

I've done this before so let me add my two cents.

Most edits are O(1), even with respect to type checking (might as well throw that in, b/c that is what we really want). But there are some degenerate cases, mostly with respect to type checking rather than parsing. You can usually process your trees fast enough in anycase, the main focus is on preserving the parse and type structure of good trees in the presence of intermediate edits.

Braces...you have to special case those. Run a brace matcher with the new edit BEFORE you run the incremental parser. If the braces aren't balanced, find a reasonable way to balance them (brace completion if an opening brace was just added), or disable the parser/typer and emit a brace unbalanced error. Once you eliminate braces from concern, everything else (about parsing) is fairly easy.

Right, you often can't slap

Right, you often can't slap self-adjusting computation on an existing algorithm and expect to get the best incremental algorithm out of it. This is similar to how you can't expect to evaluate a functional program written in a strict language in a lazy language, and expect that all irrelevant sub-computations are skipped. To make these things work you need to carefully think about the data flow.

It does still seem helpful tough, because they were able to design incremental algorithms with better complexity than previously described in the literature.

I would go much further than

I would go much further than that: you can't slap self-adjusting computation on parsing at all, since it is about error recovery (preserving good information) as much as it is about performance. By that I mean: we might be able to make a batch parser incremental automatically, but the result would not be practical even if it was insanely efficient.

Well, you can slap

Well, you can slap self-adjusting computation on an error-tolerant batch parser. You're right that making a computation incremental does not automatically make it error tolerant.

Surely you could, but you

Surely you could, but you lose access any change history given what is basically time obliviousness. I would think that for 90% of the tasks that you want to be incremental, some amount of history is important and your computations can't transparently self adjust.

Its still good research, just doesn't seem useful for what I have experience with.

Parse as a Stable Model

I recently blogged about achieving stable, incremental systems while maintaining time-oblivious semantics. The price to pay is accepting non-deterministic semantics. There are many advantages of such mechanisms so with respect to system resilience and live programming.

A stable model makes a good validation semantics for incremental parsing. As text changes, we make minimal changes to the AST in order to achieve a new parse. Stable semantics provide the valuable invariant: we always have a valid parse according to the current code and parse rules.

I've explored a few generic approaches, including constraint-logic programming. I doubt a generic technique could offer the performance of a dedicated incremental parser. But perhaps it would be good enough given stability and learning-based optimizations. And it might better handle some problems I'm interested in, such as concurrent editors.

Given a preference model, I can also manage error-tolerant parsing by making errors very low preference (so I don't stabilize on them for very long).

That said, I'm not currently planning on doing incremental parsing. I favor deterministic parsing, as managed by PEGs. I design to instead use very small modules so I can parse them fast enough. I do plan to use stable models for linkers, for maintaining a stable dependency graph despite modules being maintained at runtime or being provided by the ambient computing environment (based on geographic location or local broadcast communication).

I can't look at the wordpress

I can't look at the wordpress blog since I'm behind the wall, but it sounds pretty much what I do with all my incremental algorithms, not just parsing.

Invalidate, clean, invalidate more, clean more, until there is nothing left to clean (the system is "stable"). When a memoized object (code + state) block executes, a dynamic variable captures it and any traced dependencies are attributed to it. When those dependencies change, the block is then cleaned and its dependencies are cleaned if a change is detected in its output. I'm sure this is basic stuff.

The trick to incremental parsing and type checking (or anything else really) is not to change an object's output if it is in an error state, which prevents its dependencies from being cleaned against bad outputs that destroy the good state they previously had. This is the part that is missing for me in all automated systems: can I preserve my previously-valid outputs if I'm in a bad state, or do I have to take down the rest of the world with me?

I'm Skeptical

My comment isn't really about parsing, but this non-deterministic stability in general. It seems like it's just sweeping the state under the rug. The state is still there and now it's hard to tell how much of it there is, because it's not out in the open. With the air conditioner example, wouldn't it be better to make the cooling and heating on/off bits an explicit piece of state and then define the update rules for the system? Rather than letting there be some unknown amount of implicit state that depends on how you got to the present point?

I'm having trouble following the analogy as it applies to parsing. What kinds of things would you make non-deterministic to avoid introducing explicit state?

New kinds of state emerge

New kinds of state emerge from being incremental. You trace what would be considered an immutable variable in a timeless environment, and it becomes mutable because the environment is no longer timeless. You could of course make this state explicitly, but then you lose obliviousness and while the explicit state becomes a spaghetti mess.

You still have to deal with your existing state also, like symbol tables, which become "more" stateful than they were. So say you have nested symbol tables that are filled but never updated (you don't replace a name-symbol binding after it has been established). So its state, but not really heavy state. Being incremental turns it into heavier state because it must now support updates (and this must appear like non-updates to the original code via tracing).

Well, there's that state --

Well, there's that state -- implementation state -- but that's not the state I was asking about. I'm asking about logical state that would prevent the parse system from behaving as if it was a deterministic function of the code. That seemed to be what David had in mind. Maybe it's something like preserving the state of constructs that are undergoing edits and that are temporarily in error, as you were suggesting earlier. In any event, I'm still skeptical of this idea.

Once you add history to your

Once you add history to your algorithm parsing won't be deterministic anymore, and we have to add a bunch of tests to make sure that in the good/correct case, the trees are always the same. In the case of an error...whatever, you preserve good state and guess on the rest, determinism isn't really a goal, except you do want to be consistent in your non-determinism.

Why does the parsing need to

Why does the parsing need to become non-deterministic? The history is just there to speed up reparsing. That doesn't mean that the result has to become dependent on the history.

Thats the problem right

That's the problem right there: being incremental isn't just an efficiency thing, actually, its often not really anything about efficiency at all (parsing is fast)!

If you do error recovery, then you are going to leverage history in unclean ways. If you aren't doing error recovery, then there isn't much point to being incremental; just rerun the parser, it is fast enough even for a large file.

Now type checking is another story...you need both the efficiency and the error recovery capabilities that being incremental gives you.

Why does doing error

Why does doing error recovery imply using history? You can do error recovery given only the code file.

You can do a much better job

You can do a much better job of error recovery with history. In the simple case, you can assume that the good parse trees around an edit are still good even if the edit is erroneous. For example, if someone types a brace and the braces were previously balanced, you can assume that they didn't mean to steal a mate from an existing good brace pair. If someone types "p = " and there is a parse error, you can assume they didn't mean to assign to a successive statement. If they type "if" you can assume they didn't mean to use the next non-paren statement as the condition-body for the if. It just goes on and on. As a heuristic, we can just preserve previous good parses around a bad parse tree; which is taking advantage of history in a simple but very effective way (same applies to type checking).

I see. That does seem like a

I see. That does sound very hairy and difficult to get completely right with tons of edge cases. I wonder if there is a clean & general solution to this problem, but there probably isn't (well except ditching text).

Countering some Skepticisms

While the state might still be there, the application isn't depending on it. If you lose the alleged 'state', the application will recover to a valid and stable condition. If partitioning occurs we can still keep each partition valid, though they may diverge a little (limited redundancy, independent updates), then converge back to a common valid state once partition recovers. Semantically stateless systems have the advantage of being resilient and never getting `stuck` in an invalid state.

Regarding your preference for "explicit" state - no, it is not "better", at least by many metrics. The state might become `invalid` whenever you change those update rules, in the sense that the current state couldn't have been reached by those rules. For the AC example, this can happen when you change the temperature settings. You'll need to account for this in your design, and even in the simple AC example it can be a hassle. You are trading simplicity for a level of control that you probably don't even care about.

Regarding the "unknown amount" of implicit state, I think that is less a problem than you are imagining. One point to keep in mind is that the space to store a state derivation is only logarithmic with the complexity of the model. We might keep several times that to support performance, learning, history. Yet, it could remain a predictable quantity.

The analogy doesn't apply to parsing in general, but rather to the position Sean was describing - maintaining a parse during live programming. Grammars already expose much non-determinism in their rules (A = xAz | xxB), and a grammar augmented for error-tolerance will have more non-determinism than most. (If you do favor deterministic parsing, of course that is possible. In RDP, it is the default way of doing things. But a small change in input could cause much bigger changes in decisions than the grammar requires.)

There are many other things I would make non-deterministic to minimize the hassles of managing explicit state. I name a few in the article - planning, configuration, layout, resource scheduling, dependency management. I think it could also be useful for interactive fiction, world modeling, storytelling, natural language processing.

I'm interested in getting much closer to the `essential` level of state in applications. I'd also like to avoid indeterminism, of course. It is useful to trade off between state, stability, and determinism based on what is most important to a particular problem.

A distinction

First, and just to make sure we're on the same page, shouldn't the word "should" be used instead of "may" in the quote below (from your blog)?

For temperatures less than 76, the air conditioner may be off.
For temperatures greater than 73, the air conditioner may be on.

Assuming so, I understand why you want to avoid explicitly encoding the transition rules for state and instead of stick with rules that constrain what properties the state should satisfy. That, I agree, leads to a declarative approach. It seems to me most of your points apply to the advantages of that approach.

But what I don't see is why you don't also declare the state explicitly. This would eliminate the worry of accidentally introducing state and would give you a deterministic semantics. What are the disadvantages of doing so? It also seems to me you will probably want some way to prioritize the stability of various pieces of state against each other (when a constraint of the form F(x,y) is broken, should we prefer to change x or y to fix it?). That suggests making the stabilization constraints themselves explicit.

May not Should

The word `may` was used correctly. That program expresses allowances. Taken together, we can see that:

  1. in the range of temperatures between 73 and 76 degrees (exclusive) the air conditioner may be on or off
  2. at 76 degrees and above, the air conditioner no longer may be off; it may only be on
  3. at 73 degrees and below, the air conditioner no longer may be on; it may only be off

If the world `should` was used instead, then condition 1 would be inconsistent. With `may` it is merely non-deterministic. (Note: the set of allowances is effectively exhaustive, because a constraint solver cannot verify any solution that is not explicitly allowed.)

OK

I actually meant that the rules should be:

1. If the temperature is under 73, the AC should be off.
2. If the temperature is over 76, the AC should be on.

I was worried about what "X may be true" means. I think your last sentence explains -- you're requiring the possibilities to be exhaustive.

Allowances and constraints

Allowances and constraints are both useful ways to express models. (Allowances correspond to generative grammars, or extra entailment rules for logic predicates.) I find it valuable to use both when seeking extensibility and modularity properties for constraint-logic systems.

Art of the State

But what I don't see is why you don't also declare the state explicitly. This would eliminate the worry of accidentally introducing state and would give you a deterministic semantics.

I could express similar rules to the constraint model above using Reactive State Transition or Reactive Term Rewriting. Either of those would effectively achieve just what you say - declaring some state then declaring some time-varying rules to guide it from one state to another, and getting the benefits of determinism.

Using reactive state transition, the same AC model would be expressed as:

enum { OFF, ON } .
When temperature is less than 73, include transition from ON to OFF.
When temperature is greater than 76, include transition from OFF to ON.

My RDP model would be responsible for continuously observing the temperatures and demanding the transition rules be included in the state model. (RDP has no internal state or indeterminism. It depends on controlling external models for state or indeterminism as needed.) We would observe the state of the model and use it to control the AC. Additionally we'd need to set the initial state. A reset demand would clear to state 0, which happens to coincide with OFF.

The AC example too trivial to motivate use of stateless stability. (I even said as much in the article.) Rather, it is intended as an example to demonstrate how we can do without state something we typically do with state. If I did use stateless stability for an AC, I could easily choose a simpler model than constraint-logic as the source of non-determinism.

I favor constraint-logic because it supports my vision of using stateless stability for several difficult problems that I was finding painful (even in pseudocode) to accommodate even with rich state models like reactive term rewriting. In many of these cases the `right` amount of state is an unknown anyway, and the non-determinism is natural - planning and cooperation, world modeling, tracking `context` in natural language processing, managing dependencies and links for ambient computing, and so on.

It also seems to me you will probably want some way to prioritize the stability of various pieces of state against each other (when a constraint of the form F(x,y) is broken, should we prefer to change x or y to fix it?). That suggests making the stabilization constraints themselves explicit.

I agree. Modeling preferences and controlling what parts of the model are more likely to change can be very valuable. Many constraint programming models are augmented with soft constraints to support preferences. There was a section in my article discussing such a mechanism.

We don't need `stabilization constraints`, though. And I'd rather keep my semantics oblivious to time and transition. Instead, a cost model that encourages stabilizing in certain ways (e.g. by describing some changes as `expensive`) should be sufficient.

More complex air conditioners

Matt,

David's air conditioner is trivial. Commercial HVAC systems are way more sophisticated in how they regulate temperature. They are designed for the specific building or room they cool and heat, and they are also designed to meet energy rebate guidelines: energy utility companies actually issue rebates to stores for reducing energy consumption during certain hours, or days of the year, or during brownouts. Running HVAC, as most data center designers are well aware of, is incredibly expensive.

For example: Jordan's IMAX Movie Theatre in Reading, MA.

For those who don't know, an IMAX Theatre is a theatre with stadium seating and a three stories tall projector screen. Consider the following scenario:

The steady state uninhabited temperature of the room will be 55 degrees, warm enough to keep pipes from freezing due to external cold air pressure in case of insulation issues. The steady state inhabited room temperature will be 70 degrees. How long before the room is predicted to be in use should we start raising the temperature in order to meet the desired performance parameters? A lot of the major factors in this control system equation have nothing to do with state, such as the size of the room, which is generally a constant. Room size is actually probably the single most important parameter, because if your HVAC is not appropriately suited to the room it is cooling/heating, then weird things can happen. For example, using a house AC unit: if the room is too small, you will see puddles of water on the floor beneath the AC. This happens because the internal thermastat essentially freezes due to the feedback rate at which air circulates through the room. You get constant stops and starts of the AC unit as well, creating further condensation. It is a vicious cycle due to a non-robust control system.

"How you got to the present point" is irrelevant. When systems can observe their own oscillation, it is irrelevant.

What actually matters is building models of what you expect will happen and what has happened (measuring error in probability expectation).

As for self-adjusting computation, in the general case there is no way to determine how you got to the present point, due to the Brock-Ackerman Anomaly. Its erroneous deduction. To decide provenance, you would likely need something more well-formed, like Kahn Process Networks. For this reason, most theoretical techniques for managing self-stablizing networks deal with nodes that exhibit "weak" and "meager" computational behaviors. For example, in Montanari's work on graph grammers, he explicitly handles the Brock-Ackerman Anomaly through the use of "contextual nets". Later, functional reactive programming researchers re-discovered similar techniques for dealing with non-deterministic computations, which is a frequent problem due to pushing state out to the edges.

Finally, besides correctness, there are of course other dangers with the "stateless core" approach, if you are basing your arguments on "performance" since you will likely never have an accurate simulation lab equivalent in size to the real world Internet. Ion Stoica was the first networking researcher I know of who called attention to these, when he described his Core-Stateless Fair Queuing algorithm design, which won the 2001 ACM Doctoral Dissertation Award. Ion's general idea was to push state out to the edges of the Internet into "edge routers" and that all "core routers" should approximate stateful routing (fair queuing) while remaining stateless. The potential downsides and assumptions under which this works well or not are discussed in the link above.

A puddle in a small room in an IMAX movie theatre in Reading, MA

I gather from the foregoing that we are in disagreement about something, but I'm not yet sure what it is. Does it help that I'm not arguing to keep track of "how you got to the present point?" The state in question in David's simplified example is whether or not the AC is powered on at present, not the history leading to this state. I understand David to be advocating an approach where we specify only some constraints on when the AC should be on, leaving the solver to decide that AC requires state that should be stabilized. I'm arguing against / expressing skepticism of that implicit approach.

The state in my example is

The only state in my example is the `solution` itself. The solution is what is `stabilized`. The solution just happens to be a sentence that describes a desired state for the AC. The solver knows nothing about ACs or AC state; it only understands solutions, and possibly histories of solutions that seem to have worked well or poorly in the past.

Relevantly, the solution does not (semantically) depend on the previous solution from one nanosecond past. That's just a practical concern. Stability is not enforced by the semantics, it is just a high-value desiderata for the implementation of constraint solvers, up there with performance.

Indeterminism, btw, does offer much opportunity to `learn` better or faster solutions. This is also described in my article. But it doesn't apply to the trivial AC example described earlier. If we had some distributed sensors, more than two AC settings, and some control over fans OTOH, the richer model would offer more opportunity to learn solutions that survive for a while and reduce oscillation and save costs. (And the amount of state to stabilize the solution would be a lot less obvious.)

Still skeptical

Right, I understand. I just meant that in your example, the solution that's stabilized amounts to the AC state. As to whether or not implicit stabilization is more natural than identifying the state, I guess I'd need to see examples, but I probably won't spend much time replying for a while.

Stateful Semantics

I do effectively control AC state, but I do so without any stateful semantics. I cannot express an accumulator using only stateless stable models. I cannot count how many times the AC has been active.

With stateful semantics, developers control transitions. And an implementation must maintain a causal relationship between past and future state. Stateless stable models lack those semantics. And their implementations lack any obligation for causality (and indeed are allowed to transition without any cause).

I'm not particular concerned with what is "natural" - only what is easy to use, compose, and reason about, especially with regards to live programming and distributed systems.

For this trivial example, it is easy to use explicit state. But the AC example was never intended to motivate stateless stable models, only to help explain them.

Yes, it helps.

Does it help that I'm not arguing to keep track of "how you got to the present point?"

I'm arguing you don't need to care what the present point is. You only need to care about quality of service. Saying whether an AC is on or off would be like an ISP writing into its SLA that their routers always use the largest window possible for TCP/IP windowing scaling. It is totally irrelevant to how responsive your network is and whether you are delivering on your SLAs. In the same way, feeling hot or cold when watching a movie and drinking soda and popcorn is the only thing that matters.

Are we on the same page now?

I understand you to be

I understand you to be saying that the AC example is simplistic and that for a real control system you'd need much more than just the current AC state. I agree with that.

.

.

Parsing is still needed

The nice thing about parsing is that it lets us type what we want to see on the screen: we type "1 + 2*f(x)" and it gets parsed into an AST that pretty-prints to roughly what we typed.

You could do the same in an

You could do roughly the same in an AST editor, without requiring any fancy parsing algorithms. You just have an AST with holes, and keyboard commands insert something into those holes.

For example. Square brackets indicate the selected expression (the cursor if you will, but it's on a sub-expression rather than between two characters). A dot indicates a hole.

ASTcharacter typedcommand
[.]"1"insert literal
[1]"+"apply operator to selection
1 + [.]"f"insert variable
1 + [f]"("apply operator to selection
1 + f([.])"x"insert variable
1 + f([x])

It is very natural to program with this kind of editor, and it's incredibly easy to create too, especially if you want "advanced" features like code completion and syntax highlighting.

Care to elaborate?

If the editor can convert the typed string into an AST, then it is obviously parsing. I don't know whether your counter to this is hiding in "roughly" or "fancy".

See edit :)

It's hiding in a little bit of both ;)

How are the key sequences

How are the key sequences "1+2*x" and "1*2+x" are handled by your editor?

Depends on what you want. On

Depends on what you want. On inserting an operator you could walk up the AST until you hit a node with lower precedence. Or you could make a language without precedence where you instead would have to use navigation. For example, if you want (1+2)*x then you first type 1+[1], then you press a command to select the parent: [1+1], then you press *.

The advantage is not in that it's easier to create than a simplistic parser that doesn't report readable parse errors and is too slow for interactive use and does not have error tolerance for during editing, the advantage is that you get these things for free. Another advantage is that once you desire IDE facilities like syntax highlighting and code completion they are trivial rather than almost impossible. In traditional editors, anything but very basic syntax highlighting is already a hard problem because during editing you go through non parseable states. With ASTs, each keyboard command results in a O(1) operation with a very small constant, whereas incremental parsing is a hard problem.

I replied to your post

I replied to your post before you split it, see my reply to parent.

I'm not arguing for traditional editing

I'm just pointing out that if you support WYSIWYG code entry, then you support parsing (call it what you want, but it contains parsing as a sub-problem). I'm using something of a hybrid approach for my own language. I treat blocks (lists of indented lines) structurally, but still do parsing. I get easy support for syntax highlighting and code completion, though I'm sure the mechanism is more complex than what you'd get with a pure structural editor. For my language, the choice of mechanism here heavily interacts with syntax extension and overloading.

Cool! I also think language

Cool! I also think language extension is one of the killer features of structural editing, by letting the code that is being edited extend the AST node types. Sort of like Lisp macros, but in addition to a complie time component they also have an edit time component (which simply manifests itself as a display() method for AST nodes in addition to a compile(env) method).

I'm not sure how overloading comes into the mix. Can you elaborate?

Right, that's basically what

Right, that's basically what I mean that I have structural blocks -- the parent construct of a block can completely change the parsing of the block.

Overloading is in the mix because the first thing I do is tokenize and then resolve overloading on the tokens to get to symbols. For example, the token * might get mapped to two possible symbols, for multiplication and dot product, that have different precedences or that take part in different syntax. Types are also in the mix here, and I probably even want to support typo detection eventually.

So basically, I've moved to a complicated edit time parser that works on short fragments (and would otherwise be computationally intensive), and a very simple compile-time parser.

That sounds cool. Maybe you

That sounds cool. Maybe you would like TeXmacs. It has some sort of context aware overloading. For example, in math mode, when you type "a" and hit TAB, you get the greek letter \alpha. You can also type "<=" and hit TAB keys, it would change into TeX symbols of similar semantic nature, such as "less or equal than", "set inclusion", etc...

Right, I do that too, except

Right, I do that too, except you don't even have to hit TAB for those completions (unless there is ambiguity). Symbols can pick how they're rendered at the definition site and can include Unicode, so that you can get Unicode symbols without keying them in at each use site.

.

.

In the Scala editor, I would

In the Scala editor, I would rewrite => as a unicode arrow as it was typed. It seemed to work well enough from a perspective of mechanics, but it was confusing enough that I think they stopped doing that.

I think you can't really

I think you can't really call it "parsing" if the decoding is trivial. The key point is that the design of encoding should make decoding easy. I wouldn't complain if we just need to parse "matched parentheses".

For example, this prototype structural editor supports some recognition of "keywords". So when you type "cl..." in a context which allows class creation, it knows that you want to create a class node. When you hit return, it immediately fills in a structure of a class for you, with holes for you to fill in. But you can never get your cursor outside of these holes.

I would argue that

I would argue that incremental parsing is not a very hard problem, having done it myself. Performance of an incremental parser, on the other hand, is often O(1) but not guaranteed as such (there are many degenerate cases). But the advantage we are looking for is simply error recovery, and not performance as re-parsing is often fast enough.

Actually, the big advantage of incremental parsing is incremental type checking, where type checking is much more expensive than parsing. With incremental parsing, you can keep your trees with their types and symbol tables, and you don't need to destroy them on an invalid edit because you have decent error recovery. You get to apply this error recovery to type checking also, which is an even bigger win.

And with all that preserved type information, semantic, syntactic highlighting is easy and code completion is incredibly responsive. This is something the new Scala plugin has not been capable of replicating (where they use lazy evaluation rather than being incremental). When you use very responsive code completion in a large file, it completely changes your editing experience.

You can get all of that with an AST editor also. I'm just claiming that you can have your cake and eat it too with about 6 months of engineering work (about the amount of time it took me to retrofit incremental into batch scalac).

I wouldn't consider textual

I wouldn't consider textual syntax cake though. It's more like "you can have your poison and eat it too" ;) I'd gladly trade poison for cake AND 6 months of free time.

Touche

Pragmatically speaking, people love their free form textual language aware editors and legacy textual languages.

We went through ASU editing in the 80s, and it didn't work out very well. There are actually things we can do to fix structured editing; and I've been thrown into that given my current tablet work (like auto extend in my current paper allows you program bottom up even in a structured editor). So I'm not really arguing with you on this point, but many others will :)

It is a parser, what I call

It is a parser, what I call an incremental delta edit parser. The parser handles deltas rather than the entire file as a giant string. Rather than define a grammar, you define transitions to construct a tree based on user input.

The parsers that we tend to talk about are the boring ones that take huge strings en-masse and produce huge trees en-masse. But they are all parsers.

What makes this easier than

What makes this (way) easier than a delta editor for a textual language is that the true state of the document is always a valid parse tree, rather than a possibly malformed string of characters.

It is the same thing I

It is the same thing I think; your not considering individual edits though, but edit aggregates that then fire valid transitions in your parser.

We'd really have to go back to the 80s to uncover the literature on this topic, but I'm sure its been covered before. Ronald Reagan, Genesis, Knight Rider, and enhancements to syntax-directed editing!

I'm not so sure it's the

I'm not so sure it's the same thing. With a text based editor you can e.g. delete a closing brace of a method definition, or insert a single quotation mark for a literal string. That really trips up your parsing and you have to insert a lot of special cases. Such things are simply impossible with AST based editing.

I should clarify: the kind

I should clarify: the kind of editing you are suggesting replaces the grammar with a state/transition system that essentially acts as and replaces a grammar. So Matt is right: you are still doing parsing (as input recognition), just in a different way than what we are used to.

Definitely, a state/transition system has different implications to editing than what we can do with a free-form text editor.

Sure, but you dodge all the

Sure, but you dodge all the *hard* parsing issues, like error tolerance (& error messages) and incrementality. Of course it is still parsing in the sense of turning a bunch of key presses into an AST :)

An interesting question is (IMO): if we accept that we're editing an AST rather than an array of bytes, how can we take most advantage of that? Many things now become much easier. For example I think that intellisense was originally invented in the context of a syntax directed editor. What are other things like this?

I don't know if this is

I don't know if this is actually true. Once you go down this path, you will find that some bottom-up programming capabilities are needed, that programmers will need to go through incorrect phases to express what they want to do efficiently without disrupting their flow. This can be handled in a variety of ways (I prefer inference), but it requires engineering effort. You have made some things easy but now must solve the flow problem head on (that's where I am).

From my research for Section 2 in my recent paper: IntelliSense was invented in Alice Pascal (1985), a syntax directed editor, and it was migrated to language aware editors for the same purpose in 1997, although programmers later found it more useful for exploring libraries (evidence: Microsoft 1997 press release on VB5 vs. current IntelliSense documentation on MSDN). They seem to care less about correct input, actually, and don't want to sacrifice their flow to be bound by it (hence the emergence of language aware editors rather than syntax-directed ones).

Very interesting topic!

What kind of incorrect

What kind of incorrect phases do you mean? Can you give examples? Perhaps I program differently, but my edits are always structural. Except on a micro level, but that's because current textual editors are optimized for non-structural micro level edits.

I certainly agree that for example type checking should be able to go through incorrect phases (I prefer the view that type checking is an optional static analysis in the first place). But things like completely messing up the parse tree by deleting a single parenthesis seem to be unnecessary flexibility.

Very interesting topic indeed. This is where the future of programming lies :)

Semantically, as you say,

Semantically, as you say, there are many situations where you don't want to constrain the programmer's flow by forcing them to write programs in a certain order.

Syntactically, just consider all the text hacking we do in a text editor, copy-paste-modify at the micro-level, even of expressions; commenting arbitrary segments of the file out, things like that. I think its difficult to do this completely at the AST level, although you should be able to get close. The problem is that the programmer will have to learn special functions to do what was otherwise easy to do in a free-form text editor (think of VIM vs. emacs).

Things you want to do will

Things you want to do will be as easy or even easier except if you want to work with something that is not logically an expression, for example copying 2)* out of (1+2)*3. You don't need more special functions than in a free form editor, but you do need to adapt the existing ones to work structurally. Copy/paste works the same. Selection with the mouse is the same, except that the selection snaps to an expression boundary. Insertion of new code is almost the same (just type what you want). Deletion deletes expressions rather than characters. Navigation with the keyboard will have a learning curve, and I'm not completely sure what the best structural navigation primitives are. Should the cursor be on an expression, or between them? What navigation commands should there be? Go to parent, go to right/left sibling? Is there good prior work on this?

Noone has figured it out

Noone has figured it out yet. Honestly, I think its easier to start from free-form text editing and hack a compiler to provide the responsive feedback; e.g., through the tricks I described previously, than it is to start with a restricted editor and add functions to make it feel more flexible. The user will often make changes where you'll have to throw up your arms and give up for awhile (e.g., b/c braces are unbalanced), but they will quickly bring the code back into a state where the compiler is useful again.

once you get the hang

Believe me, once you have used one such editor, you will never want to go back to free-form text editing. I have been using paredit-mode (an emacs mode for structural editing of Lisp-like languages) for more than three years. I needed to learn two to three special keys for moving S-expression nodes around (mostly reparent operations), but after that I notice how much more efficient this is than text editing. Now I can not write Lisp programs without it! I'm spoiled.

Imagine how you enclose the next node into the first ([] is the cursor):

(let ([x 1]) [])
(f x)

In free-form text editing, you need to: 1) delete the right paren after the cursor. 2) move the cursor to the space after (f x). 3) typethe closing paren.

With paredit-mode, you just type one key: C-right. (f x) will be swallowed by the first node. Notice how the AST changes: the node (f x) is reparented into (let ...). I found this key immensely useful. Similarly C-left will "barf" (f x) out of (let ...).

Observations of this kind of interaction between the programmer and the editor and extract essential operations is the key to the design. Sometimes programmers think that they need things that they don't really need, until somebody shows them a better way. I'm glad somebody showed me paredit-mode three years ago :-)

Sounds analogous to Vim

But then many people like Vim, after they get over the learning curve that scares most people away from it. It sounds very interesting!

.

.

generation, not recognition

The difference is that now we are doing AST generation from the CFG instead of recognition from text. We only call recognition parsing.

not a parser

It's not a parser. The sequence of keys: "1", "+", "f", "(", "x" are editing commands and not strings. It is not a delta parser either. You are not really typing text. The editor creates a large portion of the code (directly as AST nodes) for you.

The definition of a parser

The definition of a parser can be more general than something that converts strings into trees, but this definition is not very useful. At some point, we are always translating input actions into computation, so some parsing tech applies.

too general

"Translating input actions into computation" is too general. If we use this definition, it will be hard to find anything that's not parsing ;-)

Nonetheless, if I can obtain

Nonetheless, if I can obtain the parsed AST for a string by sending it one character at a time to your algorithm, I think it's hard to argue you're not parsing.

This is getting

This is getting philosophical :-) I guess it can be explained by thinking about the direction of transformation regarding CFGs, as I noted in another comment. The difference is that now we are doing AST generation instead of recognition. Normally only recognition is called parsing. Generation is trivial.

For standard programming

For standard programming languages, I assert that parsing is fine, and indeed necessary, because of the vast amount of legacy code in flat text files. For most complex uses, I agree that parsing seems to have fundamental flaws. As I noted in my conclusions, syntax directed editing seems to offer the only general solution for complicated language needs (notably composition). As your article notes at the end, there are various people working in this general area.

On a more philosophical note, the way we parse text - causing both its strengths and weaknesses - predates the Unix philosophy by at least a decade and, in my opinion, is only tangentially related to the fundamental thrust of your article.

deleted

deleted

I agree that parsing seems

I agree that parsing seems to predate the Unix philosophy. So maybe Unix has this philosophy because of parsing, or maybe the two have the same root. If we consider Turing machines, which operates on a tape containing strings, then this is really a prehistory problem originated from natural languages :-)

much truth

There is much truth in this comment, IMHO. If it is really hard to parse a language mechanically (eg C, C++) then the language probably sucks for humans too.

There is also a lot of consideration for performance. How important is that? Not much. A well designed programming language will allow independent parsing of each file, and factoring large files into small ones. Together this implies parsing is fast because the programmer can make it so by factoring, and even faster because independence implies caching the parse result.

My Felix system has these properties, the parsing of the whole standard library is a bit slow, but it only has to be done once.

In fact, the parser itself is constructed dynamically when the compiler is launched from user defined grammar specifications (and of course the automaton is cached so it's only rebuilt when you change the grammar).

The system is a whole program analyser so it processes the whole standard library and user program each time, but is still faster than the backend which compiles the generated C++. Because the design of C++ (all those header files) just sucks despite separate compilation to reduce the cost of code generation.

The parser I'm using is Dypgen, equipped with a bootstrap grammar which extends itself with the entire (rest of the) Felix programming language grammar.

What makes all this practical is the very point made above by yinwang0. I want an extension, if it leads to unexpected results I have a look, and sure enough the system is telling me there really is an ambiguity there a human would probably have trouble with too.

This does not mean that composition isn't an issue: it is a vital one. The implication is that performance and ambiguity aren't nearly as much of an issue as you might think if you're only concerned with designing new, extensible, programming languages.

What we need is structure: at the minimum scopes for non-terminals, parametric non-terminals, and of course a type system. At the moment grammars, as languages, are at the level of .. well approximately 1970's Basic.

Helvetia

There is some nice work being done by Lukas Renggli on embedding languages into a host language without breaking the tools:

http://scg.unibe.ch/research/helvetia