Paul Vick: Language Design & Paying the UI "Tax"

Weâ€™re at the point now with LINQ where weâ€™re really starting to take a serious look beyond the language features, cool as they are, and pondering how these features are going to integrate with the UI. (Itâ€™s not like we havenâ€™t thought about it at all, we have, but now weâ€™re really taking a look at things end-to-end.) And so we started by coming up with a â€œlanguage UI taxesâ€ list that Iâ€™m shocked we havenâ€™t put into a more concrete form until now. The various taxes that a language feature is going to have to pay include, but are not limited to...

What follows is quite a long list of things you have to think about designing the IDE.

I am one of those that think it is [o]ne of the wonderful things about working on a compiler... that... [y]ou donâ€™t have any UI, but that doesn't mean, of course, that IDEs aren't here to stay...

We still have a long way to go, I think, in terms of understanding the subtle relationship between language features and IDEs (and vice versa). I wonder when this issue will enter the PL curriculum.

Comment viewing options

IDE need a part of the compiler

As far as I understand it, IDEs need to access directly some features of the compiler, such as typing. For instance you can't provide Intellisense for a type inferenced language without reproducing the inference algorithm itself.

The compiler can still output an XML or other data telling which variable has which type, but if you want to do Intellisense while typing you can't send a syntax-erronic program to the compiler. Is there known ways to deal with this ?

The way I'm hoping this probl

The way I'm hoping this problem will be solved is by representing programs in a more expressive format than ASCII. The IDE does parsing (or else only lets you enter programs in a structured way to begin with), the compiler handles code generation, and they both share a type-checking library that works on pre-parsed code. Sections of the abstract syntax tree can be marked as invalid, which causes the type-checker to skip over them. Even the compiler might choose to ignore invalid sections of code: The Eclipse IDE can ignore syntax or type errors in Java code, simply raising a runtime exception if the code would have executed.

compiler writers should think about IDEs

I'm currently integrating the Scala compiler with a Scala Eclipse plugin. One of the problems I've found is that the compiler has often thrown away a lot of presentation-useful information too early for convenience of implementation.

For example, a compiler might inline final primitive fields at or before its type checking phase. For a compiler that just generates code, this is fine. However, in an IDE, we need information about the uses of the fields for highlighting and refactoring purposes. All in all, it helps if a compiler writer thinks about the IDE from the very beginning, otherwise the compiler will have to be drastically re-architected later when it is used in an IDE.

As far as intellisense goes, this is something we can't handle yet. I hope this won't be very hard. My plan is to improve the compiler's error recovery capabilities so it can perform comprehensive type checking in syntacitcally ill-formed files. Brace auto-completion can help this along (if braces are all balanced, error recovery is easier). A little bit of heauristic semi-colon inference also helps.

In the long run, I think a specialized compiler that can re-compile incrementally in the foreground thread would be very useful. The compiler can then re-compute changes before the user types another key, which can make intellisense much more useful. Because it runs the compiler in a background thread, intellisense can be really annoying in large files where the background thread catches up way too late. Incremental techniques in Harmonia doesn't seem to solve this problem, although this might only be because of a flaky implementation.

Building a foreground presentation compiler requires a complete change in compiler construction techniques. For example, parsing according to a grammar is not very practical, and heauristic parsing is probably needed instead. I'm building such a beast in my spare time if anyone is interested in chatting about it.

It's also arguable that a ver

It's also arguable that a very _explicit_ language like Java is easier to manipulate. There simply aren't any complex interactions to worry about. That's nice for compiler and IDE writers, but perhaps not so nice for programmers.

Scala provides many ways in which changes can have significant impacts across programs, and type inferencing can alter the picture dramatically. If every part of the AST keeps its list of dependencies and the terminating dependencies point all the way back into source code, to the file and ranges of character positions, it becomes workable, I guess. Editor changes can be matched against ranges, and marked as dirty. But at that point you need a localized compilation model, and the "phasing" stuff just doesn't work. Everything needs to be able to recompute itself in-situ, as needed...

Well, good luck. :)

Scala

Thanks!

Especially with type inference, I often get easily lost when reading Scala code written by someone else. So IDE support is essential for Scala, as it can provide type inference information to a code reader. The compiler I'm working with OK, but is lacking in some areas. For example, symbol offsets are not stored consistently, so I have to do some amount of re-lexing to figure out what offset a symbol is actually at, and this will completely break for type aliases (writing "Int" as "int"). After the first release (soon after we quash a major resident compilation bug), we are probably going to have to re-architect some parts of the compiler to get complete information (necessary when we add refactoring support).

Right now we are going with "run the batch compiler in resident mode." Separately, I'm working on a completely incremental compiler for another language I've been working on. You are right, the "phase" model has to be abandoned. Instead, phases are replaced by a stack of transformations over the text buffer. When the text buffer changes, the update can be propagated up the transformation stack.

I'll go with "use immutable/f

I'll go with "use immutable/functional data structures that share as much of the prior representation as possible but permit easy rollback and tentative compilation with failures". I'll add in "simultaneous outside in and inside out compilation, coupled with sincere hope that they meet in the middle".

use signals :)

I'll go with use declarative mutable data structures that transmit changes in their state using signals. I'll also go with sticky transformations that preserve previous valid computed values when the current computed values are invalid.

When we are finished, we can compare our approaches and see which one is more effective :)

Alternative (complementary) angle

You can extend the compiler, but you might also want (or need) to extend or modify the language and its syntax.

For one, extending the syntax allows you to make explict things that otherwise you'd need the compiler to figure out. By making them explicit and easily recognizable you make the IDE's job easier.

More importantly, you need to be able to figure out things by local analysis since (1) the rest of the system may not have been written yet and (2) you want to support separate compilation. Examples: The use of package specifications (i.e., module signatures that can be compiled independently of the implemenation), import clauses establishing the relationships between modules.

Notice that these issues aren't new. Hoare, by the way, writes about them in his hints on language design, IIRC.

Would these issues exist if the compiler was a dll?

If the compiler was one or more dlls that any program can re-use, would the issues between the IDE and the compiler exist? I think not. The problem then can be categorized as a software re-use problem, not an IDE/compiler problem.

reuse is more than a deployment issue

Compilers are generally written to be used in batch mode operation, where the only incremental feature is separate source file compilation mode. Additionally, compilers often are not designed to preserve semantic information in a way that can be mapped back to the original source code.

So making a compiler a "component" would not solve the major problem. Instead, a compiler has to be re-architected to support the needs of an IDE, which involves a bit more work beyond just reporting errors and generating code.

You did not understand.

When I said that the compiler should be one or more dlls, I meant that the compiler should come in a form of reusable components that provide the necessary APIs that make up the compiler. These components could then easily be used by any IDE.

It is totally irrelevant that a compiler is used in batch mode.

Easier said than done!

According to my personal experience, building reusable compiler components that are easily reused in an IDE requires a significant amount of special effort.

Can't just componentize existing compilers

Well, yes, but it would take quite a bit of rearchitecting to split up any existing compiler - or adapt the textbook research - into a set of DLLs. It might be feasible for a new compiler, but adapting an existing one would be basically as tough as writing a new compiler from scratch.

It's a lot more than just separating out the lexer, parser, typechecker, and code generator into different components. A lot of the design assumptions of traditional compilers need to change, for example:

• The parser has to be able to work incrementally, on small chunks of program, and operate with minimal pauses
• The parser and lexer can't bail out in the event of a syntax error - they have to recover and continue parsing as much of the program fragment as possible
• The typechecker has to flag-but-allow inconsistent types, since they may be corrected lately. If it's a type-inferencing language, the typechecker has to be able to recover and figure out a sensible type to assign to the expression so it reports the offending expression as wrong and not the whole rest of the program
• The compiler has to deal with missing variable or function references, because they might not have been written yet
• Line/file info has to be propagated throughout the program so errors can be easily marked. Most compilers do this already, but an IDE has to worry about these line numbers changing for already-compiled code fragments because the file has been edited
• The code generator needs to maintain a map of generated instructions to source lines, for the debugger
• The compiler has to be able to operate in a background thread, and be interrupted with events from the UI thread that might need it to restart its operation. Thread-safety issues suddenly become important
• The language needs to allow the compiler to quickly decide which methods are legal in a given expression, for autocompletion. Many dynamically-typed languages fail in this regard - Python IDEs like Wing do a valiant attempt at type and scope analysis for this, but they still make enough errors that autocompletion is a real pain to use.
• The compiler needs to generate a navigable project database that can be updated in real time yet searched quickly, sometimes for tens of thousands of classes.

I've been dealing with some of these issues at work, working with the Netbeans and Eclipse IDEs, and it's amazing some of the hacks IDE writers use. Many have hand-written lexers used for syntax coloring and building projects databases - presumably using a generated one or a full parse-tree is too processor intensive for a realtime environment. Autocompletion isn't based on your position in the parsetree - rather, they scan for the dot, semicolon, or "Class" and use that to determine whether you're completing a method, type, or class declaration.

So some compilers are not designed properly...

A compiler with the problems you mention is simply not designed properly. The first thing one learns in CS is "separation of concerns". If the tasks of a compiler can not be easily isolated, then it means that the program's design was flawed.

Integration of concerns

A lot of the problems are more like a lack of "integration of concerns". The traditional compiler architecture actually does a pretty good job of separating concerns - when I did my compiler design course, the scanner, lexer, parser, typechecker, intermediate representation, and code generator were each separate projects, and we needed to make minimal modifications to the previous phase for the next one.

Many of the problems I mention involve propagating info from one phase to another, a clear violation of separation of concerns, yet necessary for usability. For example, the scanner info (line/file number) needs to be propagated to the code generator and maintained for the debugger. The parser info needs to be available to the refactoring browser, and the refactoring browser needs to validate its changes with the typechecker.

Such cross-linkages in a batch compiler would be terrible design, but they're necessary in an IDE, because all that information gets shown and used by the user.

Also, the individual compiler components need to be fast, thread-safe, and reentrant, a requirement that batch compilers don't have. We all know the difficulties of writing concurrent software. That machinery is just dead-weight in a batch compiler, but is very important for a responsive IDE.

I'm convinced it's possible, but the design requirements are very different than for batch compilers. It's unfair to characterize them as "badly designed" for this: they're well designed for what they were intended for, but they were never intended to be used in an incremental, interactive way.

Not really.

Many of the problems I mention involve propagating info from one phase to another

Nothing stops compiler writers to have functions return the proper information.

a clear violation of separation of concerns

No, using a function that does one thing to get information, then passing that information to another function is not violation of separation of concerns.

For example, the scanner info (line/file number) needs to be propagated to the code generator and maintained for the debugger

That's not a problem of separation of concerns. That's a problem of returning the proper information.

Such cross-linkages in a batch compiler would be terrible design

What happened to returing data? has it been disallowed? :-)

the individual compiler components need to be fast

since a parser (for example) needs to store its information somewhere, and that information has to be kept, the actual container of information for the parser should be user-defined, not hardcoded into the application. But that does not affect the compiler's speed.

I don't see how that is a problem. If a parser (for example) does not have global variables, then thread safety is there without any extra effort.

and reentrant

Same goes for reentrant.

a requirement that batch compilers don't have

Because they are not provided as a set of APIs, but as a tool with hardcoded global variables.

We all know the difficulties of writing concurrent software

I still do not understand what do you mean about concurrency and compilers. If I had a compiler lib, I would say to it: "here is my source code, compile it, and give me back type info, error list and bytecode" for example. And all that in a worker thread.

they're well designed for what they were intended for, but they were never intended to be used in an incremental, interactive way

In other words, the design can not be changed because there are lots of global variables spread all over the program...a sign of a bad design.

Not a sign of bad design...

a sign of highly-optimized design. The highest priority of production compilers (after correctness) is throughput , measured in terms time required to compile X thousand files. The highest priority of IDE parser components (after correctness) is speed of incremental reparsing, measured in terms of responsiveness to keystrokes. These are completely different performance requirements, and require completely different architectures. Compilers normally use a table generated parser, which provides good throughput. IDEs normally use some finely hacked version of a recursive-descent parser, which provides both good responsiveness and better understanding of context.

In theory one could design a parser in the way you describe, but it would probably be too slow to use in a production compiler or too unresponsive to use as an IDE. Most likely both. Worse, since all of these performance requirements are relative to your competition, Moore's law won't save this design.

Allow me to say that I disagree.

Throughput does not have to do anything with having global variables. The same throughput can be achieved by passing the compiler's context as a parameter to a compiler API.

IDEs can use parse tables, just like compilers use them.

It's all a matter of design.

IDEs cannot use traditional table-based parsers because traditional table-based parsers don't do incremental parsing.

"Throughput does not have to do anything.."

Throughput does not have to do anything with having global variables.

In the case of a batch compiler, throughput is gated by one extremely important global variable: the file system. It takes significant coding to optimize access to that global variable, and unsurprisingly that affects the parser code.

In the case of an IDE, throughput is gated by a half dozen global variables, including the edit buffer(s), the global parse tree cache, the global symbol resolution index, the file system, and the version control system. All of these variables need to be updated in concert, if not necessarily in complete synchronization, and these updates need to happen no less often than every few keystrokes. It takes significant coding to optimize access to these global variables, and unsurprisingly that affects the parser code.

Now it is certainly possible to abstract the parsing out, and have an parser library which takes sets of streams. No one does this, because the resulting systems are too slow to be commercially viable.

IDEs can use parse tables, just like compilers use them.

Sure, but they would be to slow to be usable.

It's all a matter of design.

No, it's a matter of system architecture. In this case system architecture places constraints on design that you aren't taking into account.

In theory, there's no difference between theory and practice. In practice, there is.

This isn't about separation of concerns, it's about data propagation. This imposes additional requirements on the tasks above and beyond being usable separately.

This is precisely backwards

The necessary effort to make language features work in an IDE shouldn't be thought of as a "tax" to be paid, but as an "investment", whose returns can be maximized.

To maximize the return, you would look for ways to design your language features for a usage environment in which all developers have modern IDEs. Also, you would look to avoid features whose presence makes implementation of modern IDEs difficult.

This assumes that one's purpose in designing computer languages is to improve the actual practice of programming. If you've got other purposes, you're probably better off sticking with text, and ignoring IDEs completely.

Political semantics

The necessary effort to make language features work in an IDE shouldn't be thought of as a "tax" to be paid, but as an "investment", whose returns can be maximized.

Same thing—in theory, a tax brings returns, too. The only real difference between a tax and an investment is who decides whether you're going to pay it. So, if you're writing a compiler, and you want IDE integration, then you consider the work it takes to be an investment. On the other hand, if you don't care about IDE integration, but you believe that you have to support it (say, if you think people won't use the compiler otherwise), then you consider it a tax.

Emacs was the "UI" tax, plus add math

Many languages are simply not feasible to use without emacs. I don't know of any serious Lisp or Scheme projects that could have been approached without it, just as a f'rinstance, so I kind of see this as an old if unspoken topic. Language developers still spend a huge amount of effort on elisp files. It's a real tax.

On making a richer syntax, it's well past high time. Let's at least get simple greek, subscripts, superscripts, and binary operators into it. TeX has shown us that it's really not all that hard to keyboard 'backslash-lambda' (though it may be challenging to remember exactly how to quote the backslash in this or that environment).

Viaweb

Paul Graham wrote somewhere that his Lisp development style is to edit in vi and copy-and-paste into the REPL. I know a few prolific programmers with similarly "eccentric" habbits. This makes me suspicious of the true value of these features that I "couldn't live without" and even of things like being able to type fast.

That said, I can't help pulling faces when I watch most people use their computers :-)

I'm willing to bet programmer

I'm willing to bet programmers prolific in vi would be even more prolific using a modern IDE.

yes and no

Modern IDEs do a wonderful job of things that vi will never do (not even vim), but most modern IDEs are terrible at the actual editing of the text itself. I'm eagerly awaiting an IDE with a vi text editing interface. One day, out there, beneath the pale blue sky.... :)

Done!

IDEA currently has a vi editor mode, and I believe Eclipse does as well. I haven't personally used these (I tend to mistrust vi emulation modes), but I've heard they're quite good, and I know people who use the IDEA vi mode very extensively.

The Different Ways to do it

Several different ways to handle this, come to mind:

1) The tool duplicates the compiler, at least the front end (which is the part easiest to duplicate). Works for any language or compiler, even those without built-in IDE support. Lots of duplication of effort, though. Helps if there is an open-source implementation of the language around which can be used.

1a) A generic protocol/interface is deviced for IDE parsing support. Implementations of this can be targeted at many different languages and tools.

2) The language includes, as part of its definition, libraries to support the translation/compilation of itself. Great for monolingual environments; not so good for a generic environment which needs to support many languages beyond the one(s) it was written in.

3) Compilers are encouraged (and a mechanism is published) to provide appropriate hooks for development tools to get additional information (parse trees, other internal phases) about the compiler's behavior. This puts much of the work on the compiler writer rather than the IDE writer. May allow companies with proprietary compilers to hook into a third-party IDE, without revealing source code. (Of course, the primary vendor of proprietary compilers used in IT, would prefer that you use Visual Studio and not Eclipse or some other third party tool....)

Most of these usually require

Most of these usually require the code be compilable, and carry high overhead (full compile). However, do check out the ASIS standard for Ada...

Wirth

Thinking of trade-offs between language, compiler, IDE, etc makes me think of Niklaus Wirth's book Project Oberon. He designed and implemented a computer, programming language, compiler, and operating system with very explicit trade-offs between parts. Sucks that the cheapest second-hand copy I can find is always around \$100, because it's a book worth owning.

available from: http://www.oberon.ethz.ch/books.html

Hi,
you can get the E-version from
http://www.oberon.ethz.ch/books.html

http://www.oberon.ethz.ch/WirthPubl/ProjectOberon.pdf

Heiko

Incidentally...

I don't really have much to add here, but I think this is a really, really interesting topic and area of work. An increasingly large amount of the standard compiler architecture is being invalidated by the rise of IDEs, and I think there's a lot of room here not only for technical innovation, but also for some pretty heavy-duty curricular innovation. I'm excited to see the first generation of compiler texts that really account for these new requirements.

(Not to mention the impact on language design, which is really the most interesting part.)

Quite

Notice that I wrote I wonder when this issue will enter the PL curriculum, the emphasis being on when, since I am sure this is going to happen sooner or later.

(Not to mention the impact on language design, which is really the most interesting part.)

Obviously :-)

One of the most important thi

One of the most important things about this curricular change is that it mirrors the general trend in software development away from black-box batch processors and toward APIs, plugin architectures and frameworks. I think it will have a beneficial effect on both sides to really have this stuff fully integrated into the college curriculum.

On the one hand, the smart folks working on research compilers will have a hand in improving the state of the art for frameworks and APIs (the excellent work in "Scalable Component Abstractions", for example, was motivated in part by the desire to implement an IDE-friendly Scala compiler), and on the other hand, students will hopefully have an opportunity to see more in-depth case studies of component development, and writing software in this way will come to seem more natural.