Pure and Declarative Syntax Definition: Paradise Lost and Regained, Onward 2010

Pure and Declarative Syntax Definition: Paradise Lost and Regained by Lennart C. L. Kats, Eelco Visser, Guido Wachsmuth from Delft

Syntax definitions are pervasive in modern software systems, and serve as the basis for language processing tools like parsers and compilers. Mainstream parser generators pose restrictions on syntax definitions that follow from their implementation algorithm. They hamper evolution, maintainability, and compositionality of syntax definitions. The pureness and declarativity of syntax definitions is lost. We analyze how these problems arise for different aspects of syntax definitions, discuss their consequences for language engineers, and show how the pure and declarative nature of syntax definitions can be regained.

I haven't compared this version with the Onward 2010 version, but they look essentially the same. It seems timely to post this paper, considering the other recent story Yacc is dead. There is not a whole lot to argue against in this paper, since we all "know" the other approaches aren't as elegant and only resort to them for specific reasons such as efficiency. Yet, this is the first paper I know of that tries to state the argument to software engineers.

For example, the Dragon Book, in every single edition, effectively brushes these topics aside. In particular, the Dragon Book does not even mention scannerless parsing as a technique, and instead only explains the "advantages" of using a scanner. Unfortunately, the authors of this paper don't consider other design proposals, either, such as Van Wyk's context-aware scanners from GPCE 2007. It is examples like these that made me wish the paper was a bit more robust in its analysis; the examples seem focused on the author's previous work.

If you are not familiar with the author's previous work in this area, the paper covers it in the references. It includes Martin Bravenboer's work on modular Eclipse IDE support for AspectJ.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Praying for the End of Time

It shows the merits of SDF syntax definitions. Did they implement a C parsing tool for those?

Did they implement a C

Did they implement a C parsing tool for those?

I think you've got your question backwards. The paper states that they had C tool chain from the start, and that Bravenboer implemented a Java-specific tool chain, which was blocking adoption by some software engineers who, in my own words, had a "Must Be Java" attitude.

The current version of the code at syntax-definition.org has broken download link, unfortunately. Bad timing for those guys; they get a plug and the site doesn't work.

Why?

I think you've got your question backwards.

Why? My opinion is that they shouldn't complain about people using tools but just do their job and supply better tooling if they think they can.

They're not complaining

Where does the paper sound like whining?

It sounds to me like they are describing legitimate pains that software engineers inflict upon themselves, and then arguing Hey, it doesn't have to be this way.

If anything, the paper could have been more practical, but risked sounding whiny, e.g. they could have said something to the effect of "Eclipse JDT's approach sucks! Only mortal fools would design an IDE language framework this way! Blasphemy to them." In other words, there are plenty of real world case studies they could have presented, such as Eclipse JDT and the ways various plug-in authors need to integrate with Eclipse JDT and the nightmare it causes, that suggests the approach in Eclipse JDT is fundamentally flawed and monolithic. However, talking about JDT would be beyond the scope of the paper, since it is really just about modular syntax definition. Also, it is not good to slam people who fund you. :)

Well

They made a point about a trend towards practical but not optimal solutions to building parsers, and they show that it is possible in a formalism, SDF, to define syntaxes not hampered by practical concerns.

Personally, I wished they had bitten the bullet and implemented better parsers for C and/or Java/C#. Point being that practical concerns are often the real concerns of software engineers.

Not that SDF isn't a nice formalism, but the proof remains in the pudding.

Actually

Actually, there is already an SDF definition for Java's syntax (available here). The authors of the paper used this e.g. for a compiling WebDSL to Java.

The SDF for Java has also been used for embedding DSLs: Meta Borg.

JSGLR

Hmm, great link. I found something else which I think is more in the line of what I was expecting, not a definition of Java in SDF but a replacement for flex/yacc/bison/.. tools generating Java. MetaBorg might fit too, a bit hard to see through the prose whether it is a compiler compiler system.

( That was why I asked, it's a bit of historical curiosity. I once worked a bit with an old version of ASF/SDF before it was even bootstrapped in the nineties, and ditched it then because it lacked proper semantics support. I.e., I was used to compiler compilers which most of them had sufficient support for semantic analysis by supporting attribute rules. I remember a friend of mine abusing the ASF/SDF system by opening several windows and doing the semantic analysis by rewriting the terms in those. That got him his master dissertation in a year or so, but at the expense of a tool no-one in their right mind would touch. So I wonder how much of a compiler compiler it is now. )

Shared many sentiments

Failed to adopt Antlr to really big grammar I ended up with my own parser engine. Here is extracts from the accompanying whitepaper:

Parsing is a mature technology. The field flourished with ideas in the 60s. At that time computer power was quite limited, this is why practical emphasis was on efficient parsing methods. LA(1) and LALR quickly became standard; both having linear parsing time complexity. Still, many grammars, such as C, escape relatively narrow class of grammars for which there is a standard linear time parser. The crux of the problem was ambiguity of certain language constructs, such as if-then-else. The SQL and PLSQL grammar are even more ambiguous, due to the fact that keywords are allowed as identifiers. There are various hacks, which are usually presented as satisfactory solutions from end-user perspective, but our general philosophy is that manual transforming the grammar into LA(1) or LALR form is inferior to leveraging more powerful (but slower) parsing method.

Parser engines evolve in this direction already. The popular bison parser has been extended to implement GLR method, which handles arbitrary context free grammar. Microsoft OSLO framework (this article has clearly been written before OSLO demise:) for building Domain Specific Languages (DSL) is based upon GLR parsing method as well. Likewise, Antlr parser engine evolved to accommodate more general grammar. In version 3 it leverages LL(*) parsing that admits some grammar ambiguity. Yet LL(*) parsing is weaker than GLR, so Antlr is still not powerful to handle arbitrary context free grammar.

Both methods -- GLR and LL(*) -- extend an algorithm, which initially had linear parsing time. Presumably, this won’t significantly affect performance in practical cases when there is little ambiguity. In worst case, however, such an approach can result in exponential time.

There are at least two more methods that can also handle arbitrary context-free grammars: CYK and Earley. CYK parsing algorithm
http://en.wikipedia.org/wiki/CYK_algorithm
was invented as a proof that arbitrary CFG grammar can be parsed in the time cubic by the size of the input. It is a remarkably simple idea, as core of the algorithm essentially contains only 4 nested loops. There is no automata theory, no states, no backtracking. The method provides clean prescription how to handle parse failure.

CYK never matches traditional parsers in terms of performance. While experiencing slow parsing time on large blocks of PL/SQL source, I went on and investigated Earley method, which is often claimed to have better performance than CYK. My experience, however, was the opposite. In addition to that, how to handle parse errors in Earley method is not clear. This, together with proprietary CYK optimization technique (which would be covered in subsequent sections), elevated CYK to the winner position...

Practical Earley Parsing

Have you read Aycock's "Practical Earley Parsing"? Even written in Python, I've not experienced significant problems with parsing speed.

I'm aware of this paper,

I'm aware of this paper, although I'm not sure of its impact. Both Earley and CYK need eliminating empty productions, which explodes the number of rules by a small factor. This small factor is much less of a problem IMO than cubic performance in the length of the source. The paper conclusion -- "twice as fast as standard Earley" -- is not good enough from the aforementioned cubic complexity perspective.

One of the big reasons to my affection to CYK method is intuitive visualization. The CYK matrix hints that the complexity of the method is at least quadratic, and it takes just little more effort to realize why it is actually cubic. Then, if we find a way to skip computations in some matrix cells, this can be reflected visually on the matrix -- hence a tool that visualizes CYK derivation is valuable, and I'm not sure of anything comparable in Earley world.

BTW, the "Paradise Lost" paper contains an intriguing reference to GLL method which claims performance between n and n^3...

Why go for quadratic or cubic performance?

Most compilers need to be fast to gain widespread adoption. The gcc adagium is not to allow, in general, quadratic or worse algorithms in the compiler. (One of the sole exceptions being inlining optimizations, and people do complain about the perceived slowness of that.)

Although parsing is mostly a minute part of the time spend in a compiler, there seems little incentive to go with a cubic parser.

Earley parser normally has

Earley parser normally has linear complexity just like LL or SLR parsers. It has cubic performance only in the worst cases, which the grammar is too ambiguous and linear parsers even can't handle it. If you have to work on such grammars, cubic algorithms are the best performance you can get.

cubic performance

The biggest problem with stuff like cubic performance of parsers is not with the parsers, but the really bad overall design of languages like C where the load on the parser is proportional to the complexity of the program.

It does not have to be this way and isn't in most modern languages like Ocaml and Felix (and probably Java, Python, and other languages too).

In such languages each unit of text is parsed exactly once, and with suitable caching and build tools, only reparsed on change to the text.

In Felix (but unfortunately not Ocaml) it is possible for the programmer to break down code into more or less arbitrarily small units of text (files) therefore bounding parsing time, even for a cubic parser.

Therefore I submit, given a language specifically designed to allow each file to be parsed independently of all others, and an ability to decompose code to more or less arbitrarily small files, the performance of the parser even for the ordinary case is really not a significant issue.

Beautiful trees

I find it somewhat hilarious how every researcher in the field seems to have an agenda and tries push his own favourite approach like a merchant or a preacher, even if the latter is meant to be funny.

BTW who is interested in "beautiful trees"? Surely not programmers who just retrieve data from them.

Practitioners in the field seem to have an agenda

and try to label every researcher as a idealistic, dogmatic, agenda-pushing blowhard. :)

SGLR, by the way, was initially rejected as a paper when Eelco was a student. One of the things he mentions in his blog is that it was rejected in part because it lacked practical applications. Unfortunately, his blog is currently down. Those guys at Delft are really unlucky with the free plugs this week. That's a shame, too, since I thought Eelco's blog entry reflecting on his Ph.D. thesis was a really well written retrospective, and very few people take the time to write these sorts of things.

It seems rational to, after

It seems rational to, after working on something to address a problem, to advocate it to be used for said problem. It's more concerning when that doesn't happen. Take theoretical physics, or at least the nuclear era stuff: the best theoreticians were getting pushed by experimenters ('explain this surprising or bad result!') and vice versa ('please test this!').