The AST Typing Problem

A few weeks ago I was visiting Mark Jones, and feeling particularly foolish I asked him what people do about "the AST typing problem" in a language like Haskell or SML. To my great surprise, he answered that he knew of no clean solutions. I'm curious whether a bunch of people with lots of PL-fu can identify some solution that we have failed to see.

The "AST typing problem" is a problem in static typing that consists of three parts:

  1. We intend to build an AST in the usual way. New information is added by several phases - most notably by the symbol resolver and the type checker - but we are required by strong typing to provide dummy initializers for fields associated with later phases or to make heavy use of "option" (or equivalent), with resulting code cruft. It seems that we cannot build a series of successively extended AST types for each phase, both because the data structure is recursive and cyclic and because later passes rely on being able to re-run the various decorating phases to re-generate (e.g. by type inference) information on newly created AST nodes. How can we exploit strong static typing without making this code unmaintainably ugly?

  2. Within the parser, we get tree fragments from sub-parses and compose these as we work our way up the tree. A given sub-parse usually returns an AST of known node type, or in a few cases (e.g. expression) one of a very short list of possible node types. It seems that we should be able to express the fact that there is no need, in most cases, to perform union tag dispatch on such nodes, but there does not seem to be any evident way to do that. Can this be expressed in Haskell or ML in human-usable form?

  3. Each pass of the compiler essentially proceeds in recursive-descent form over the ASTs. In a given pass, many node types will receive the same treatment (typically an oblivious recurse), but most safe languages provide no means to handle more than one union leg type in the union dispatch syntax. For the remaining AST types (the ones we handle in detail), each child generally has a well-known type or must be one of just a few node types; we would like to exploit this knowledge to avoid unnecessary union dispatch. This is, in some sense, the dual of part [2]. Once again, how can this be expressed in Haskell or ML in human-usable form?

Note that I'm looking for techniques here that are practical in a serious commercial grade compiler. Solving the problem at a high cost of indirections, e.g. by building a map of some sort, is not a production-grade solution; we're looking to compile at a rate very close to 100,000 lines of source code per second. Similarly, solving the problem by unifying the symbol resolution and type inference passes would compromise maintainability.

It's hard to imagine that this isn't well-trodden ground, but if Mark doesn't know, and I don't see it, it seems like it's at least a question worth posing. Dependent type can certainly handle some of this problem, but I'm really looking for clarification on how to do this in non-dependent type systems.

I should perhaps add: the first issue may not be apparent in current Haskell or ML compilers. The data structure presumed is unnatural in a functional language, so implementers building compilers in these languages have likely selected other ways to handle the evolution of AST structures across phases. I'm interested in those solutions, but I would expect (naively) that they fail the compile speed goal due to memory indirection overhead. I would be very interested to learn otherwise!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I would suggest to use Scala

I would suggest to use Scala (I am not sure how this fits with 100,000 lines/sec, though :-)) instead of Haskell or ML. You can use case classes to represent the node types. Via subclassing (each case is a type in its own right because it is a case CLASS) you can handle the case where you have different cases that are similar enough to appear in the same places.

So the crucial difference is the existence of "case classes" in Scala. Also, because "null" is a valid value of any class, you can have your dummy initializer without clumsy typing.

Here is an excerpt from the AST structure for Babel-17 programs:

...
abstract class Statement extends Term
...
case class SVal(pat : Pattern, e : Expression) extends Statement
case class SAssign(pat : Pattern, e : Expression) extends Statement
...
abstract class Pattern
...
case class PInt(value:BigInt) extends Pattern
case class PBool(value:Boolean) extends Pattern
...
abstract class Expression extends Term
case class ESimple (se:SimpleExpression) extends Expression
case class EBlock (b:Block) extends Expression
case class EWith (se:SimpleExpression, b:Block) extends Expression
...

Does Scala benchmark at 100,000/lines/sec?

(no message body)

I have done no measurements

I have done no measurements in this regard, but if Haskell can, I guess Scala can, too.

I think the Scala compiler

I think the Scala compiler was in the 10k lps range or less the last time I looked (a few years ago to be sure), while performance depended wildly on what type system features you were using. For comparison, C# and Java are in the 1 million lps range and that's pretty consistent. But there is less code to compile, so it doesn't feel that bad (implementing an IDE is tough though).

I'm not sure how Haskell get's away of 100k lps. Perhaps the Haskell type system doesn't do non-deterministic lub computations, or doesn't have an extensive implicit mechanism? Would love to see formal benchmarks on various compiler performance, especially for very typed functional languages (Haskell...Scala...F#).

I think you have to

I think you have to distinguish between the speed of the language compiler and the speed of a compiler written in that language. As Scala compiles to Java, Scala could be as fast as Java is. The actual speed of course depends on the kind of Java code the compiler is compiled to.

As Scala compiles to Java,

As Scala compiles to Java, Scala could be as fast as Java is.

This is not necessarily the case, and it doesn't even place an accurate upper bound on performance.

For example, there are many ways to encode pattern matching, especially on types, and the performance details will depend on the performance profile of (a) your back-end instruction set architecture (b) the amount of memory the (virtual) machine allows you to allocate [this can be important for special cases like nested pattern matching, since the code generation strategy here can be easily combinatorially explosive if naively coded]

I used to be able to point out differences in VB.NET and C# and could explain why VB.NET would be more efficient than C# in some cases, due to slight differences in their component models, but my memory is fading and I haven't compared the generated IL across the two languages in 3 years. You could do certain encodings in VB.NET that would compile down to a single IL instruction, but in C# you would have additional overhead.

Foggy memory

I know there is one such case where VB.Net uses an IL facility for filtered exception handling that C# entirely ignores (see Partition II, Section 18.4 for some of the details). Ignoring it is probably for the better, its a very thorny area.

Early on, VB.Net lacked an equivalent of the C# "as" expression, for combining a type test with a cast avoiding both double-checking and exception-throwing. That changed a long time ago, though.

I share a vague recollection that there are a few other corners along the same lines.

I thought that they were

I thought that they were going to use Scala as some sort of embedded DSL, rather than to build a compiler. For the former, Scala compiler performance is definitely relevant because you are essentially reusing it as your own, while for the latter it doesn't matter.

The speed of the Scala compiler has nothing to do with the target platform, most of the time is spent in the type checker.

subtyping, nullification

I agree with the intuition that subtyping is part of a clean solution here, and I'll have a look at case classes.

Concerning "all objects can be Null", this seems to make things worse rather than better. Leg selection is presumably eliminated, but at the cost of a failure to check.

But your comment does raise an interesting point: things like "option 'a" induce lots of union leg checks. The particular case of "nullable 'a" may be sufficiently common and important to justify language support by doing the checks automatically and throwing some form of exception. This is a point I had not previously considered, so thanks for provoking the thought. :-)

Library function

The particular case of "nullable 'a" may be sufficiently common and important to justify language support by doing the checks automatically and throwing some form of exception.

Don't most languages with option types provide a library function that does just that? For example, the option type in F# has a Value method that gets the value if it has one and throws an exception otherwise.

Rebuild IR at each stage is one solution

Instead of having default or null initializers that make sense for the optional fields that are computed by each phase, you could instead define different IRs for each phase, and thus the type signatures for each phase indicate what they compute. E.g. a typechecking routine would have the type Node -> TypedNode. TypedNode may be a separate algebraic data type or class hierarchy from Node, necessitating construction of the new IR from the old IR (rather than simply adding fields to the old IR).

For the basic semantic analysis and typechecking phases, two to four IRs can be quite manageable if your implementation language has decent pattern matching and algebraic datatype support. You can choose to encapsulate the new IR (i.e. indirect back to it), extend the new IR via subclassing (if your implementation language offers classes), or define them independently and rebuild completely (i.e. keeping nothing of the old IR around). I've used the later approach in a Scheme compiler implemented in SML and it worked quite well. I found that combined with the type inference of ML, defining the correct signatures for the types of transformative phases (e.g. to A-normal form) led me immediately to correct, recursive solutions.

The Ikarus implementation of Scheme R6RS uses this in extreme form. It is implemented in Scheme, so not strongly typed, but it works well and is quite fast (on the order of 1 second for a bootstrap, as I recall).

I don't have direct experience with Haskell compilers but I'd be surprised if they didn't use exactly the same approach.

This corresponds to my

This corresponds to my rather unscientific intuition (perhaps informed by the short series of IR's in Appel's "Compiling With Continuations") that 1) with no type discrepancies, one "transforms" (or accumulates information, etc.) using a given IR; otherwise 2) when encountering any type discrepancies, one instead "translates" to the next IR in a series.

This isn't simple

For the basic semantic analysis and typechecking phases, two to four IRs can be quite manageable if your implementation language has decent pattern matching and algebraic datatype support.

A variant of this is the first solution that Mark and I came up with. We concluded that it doesn't work well. The graph has internal cycles (e.g. decl node points to def and vice versa). This means that the graph cannot be regenerated by a bottom-up rewrite in any simple way.

Let's simplify slightly, and assume that only the symbol resolver and the type inference passes are responsible for filling in fields after construction. If we can make it work for these, we can probably extend the solution to other cases if they arise.

The tacit assumption when multiple levels of IR are proposed is that we always proceed from the less populated AST to the more populated AST. Unfortunately, this isn't true. Many passes proceed by performing local rewrites on the AST graph and then re-running the symbol resolver and type check passes to re-construct the resolver- and type-related fields. Until the resolver and checker are re-run, we have what can be viewed as a mixed-type AST.

The alternative, I suppose, is that we strip back the more decorated AST to the minimal form, and then re-run the resolver and checker to regenerate the enhanced AST. That seems like a lot of redundant work.

It depends

I guess further discussion is going to hinge on the other details of your compiler design. From your response it would appear that you are planning on doing quite a bit of AST rewriting, and I am wondering whether you plan on using this strategy for optimization or just for desugaring complex syntactic forms into simpler ones. I have found rewriting, rebuilding, or otherwise molesting ASTs to be extraordinarily frustrating in the past and have finally rid my compiler of AST-based strategies for optimization and code generation and would counsel against it unless your implementation language has excellent support for pattern matching. Instead I am firmly of the opinion that ASTs should be decorated only, and the decorations--e.g. resolved symbol references--should guide a translation to an intermediate representation that is appropriate for advanced optimizations, without necessarily any resemblance to an AST.

Compilers--fast, optimizing ones, at least--seem to have this unfortunate curse that no matter which piece you try to simplify, it complicates all other parts. The art then comes down to choosing where to place complexity, and in some cases centralization works best, other cases decentralization works best.

If I am not wrong about your direction, it would seem you are favoring decentralization, since you seem to be planning many potential passes on this AST. This is going to reduce modularity and hinder the speed at which you can evolve the compiler. E.g. changes to the source language will propagate farther into the back end of your compiler if it is based on ASTs and relies on them for optimization.

So to summarize, my basic argument is that the separation into distinct IRs may not necessarily reduce the overall complexity of your compiler, but it will certainly modularize it and make it easier to evolve.

As for the compilation speed target, coupled with a desire to use the static type system of the implementation language to enforce a higher dimension of correctness than seems to be borne out in existing compilers, and your overall goal of creating a new language, be careful not to ask for too many miracles :-)

Typed IR

My first reaction when reading your question is the same as Oleg's about performance: are you sure that using a map (with a good implementation) is too slow?

Using a language with type inference is convenient for the programmer, but the compiler can still use an explicitly typed IR once you're past the type checker. Having a typed IR might sidestep the specific example you mention about rewriting an AST and having to re-run type-inference afterwards. My experience with augmenting rewrite systems to a typed setting is that it appeared much harder than it actually was once I sat down and did the work.

Having the option of running a type checker after each pass is still very useful for finding and isolating bugs though, and this is one of the things the core-lint mode in GHC does.

Yay for multiple IRs

+1 for typed IR separate from the initial parse tree. I haven't had the pleasure of working on a compiler that uses an out and out new AST structure, but all multi-phase compilers I've worked on are written to have different assumptions about the AST as you go along. I'm more and more sympathetic to simply having a separate untyped tree and a typed one, and for Java-targeting compilers, perhaps a third AST after type erasure.

Short of having two or more ASTs, though, here are two tricks for having different invariants in the same AST:

1. There can be alternate constructors that take symbols as arguments instead of strings, and that auto-compute the type of the AST at construction time. The parser uses the constructors that take strings as arguments, but most everything else in the compiler uses the constructors that take symbols. It takes some discipline to use the right constructors at the right time, but the nice thing is that the symboly ones tend to be more convenient to use.

2. Adding flags to the AST that you can toggle as the phases go by. If you can't get a static check, then a dynamic check is honestly pretty good in practice. For example, the GWT compiler includes a flag for whether or not you can create new string literals. Once the string interning phase has past, you can't create a new one! The string interning phase toggles the flag, and if you try to create a literal with the flag off, the compiler aborts with an assertion failure.

Speaking of performance, I'd reconsider the off-hand comment about needing to run type checking and/or symbol resolution after the initial type check phase. It's first of all very slow, easily dominating the cost of a phase. Additionally it risks correctness. If you try to build a reference to a variable via its string name, your implementation is only correct if the name lookup in fact gets you back to the right variable. If you build the reference via a direct reference to the variable's symbol, it's foolproof.

Bound and unbound AST

I'm more and more sympathetic to simply having a separate untyped tree and a typed one

My old bytecode compiler for Magpie made that exact distinction. The "unbound" AST nodes were what came out of the parser and used strings for identifiers. There was a mirrored set of bound AST classes where references pointed directly to their target.

The main pass of the compiler would take an unbound expression and convert it to a bound one by resolving identifiers and building a symbol table.

It helped me keep things clear in my head about what stage I was in in any given region of code. I managed to minimize code duplication by having the bound and unbound AST classes share a common base class when it made sense. Like so:

// Base class.
public class CallExpr<TExpr> {
    public TExpr Target;
    public TExpr Arg;

    public CallExpr(TExpr target, TExpr arg) {
        Target = target;
        Arg = arg;
    }
}

// Unbound form.
public class CallExpr : CallExpr<IUnboundExpr>, IUnboundExpr {
    public CallExpr(IUnboundExpr target, IUnboundExpr arg)
        : base(target, arg) {}

    public TReturn Accept<TReturn>(IUnboundExprVisitor<TReturn> visitor) {
        return visitor.Visit(this);
    }
}

// Bound form.
public class BoundCallExpr : CallExpr<IBoundExpr>, IBoundExpr {
    public BoundCallExpr(IBoundExpr target, IBoundExpr arg)
        : base(target, arg) { }

    public IBoundDecl Type {
        get { return ((FuncType)Target.Type).Return.Bound; }
    }

    public TReturn Accept<TReturn>(IBoundExprVisitor<TReturn> visitor) {
        return visitor.Visit(this);
    }
}

Simple AST classes like literals would implement both IUnboundExpr and IBoundExpr and return themselves to transform to bound form.

I don't know how well it would scale to multiple passes, but for my very simple compiler it worked OK.

DRY

You should only do type inference once- after that, even node in the IR should be decorated with it's type, and IR transforms should maintain the type information. Likewise name resolution- it should be done once, as part of alpha renaming. Name representations can even be changed at that point to something more efficient (like integers).

Defining IRs as variant types have a lot of advantages. For one thing, they can be used to enforce invariants. For example, after lambda lifting, there should be no lambda expressions in the AST- so the post-lambda-lifting IR would simply not have the lambda expression variant. For the general things- such as type validation or debug printing, you can define a superset of all the IRs. Each individual IR just provides a one-way conversion from the specific IR to the generic. For example, the generic IR would need a lambda expression node, but the post-lambda-lifting IR would never generate it. Since you're not going from the general to the specific, a specific instance doesn't have to worry about how to handle those things it doesn't support.

To add information to ASTs,

To add information to ASTs, you can use the "two-level types" pattern, which is to separate the datatype into one representing the type constructor, and another which actually takes the fixed point. Then, you can add information at every recursive node as needed, which solves 1. Defining the functorial action of the type constructor lets you solve 3. (I didn't understand what you meant by 2, though.)

Here's an example in Ocaml, for a simple arithmetic expression language.

type var = string

(* Here, we give a datatype which is *almost* the usual recursive
   data type, except that it's not recursive -- we just leave all
   of the recursive positions polymorphic. *)

type 'a shape =
  | Num of int
  | Bool of bool
  | Plus of 'a * 'a
  | IsZero of 'a
  | If of 'a * 'a * 'a
  | Var of var
  | Let of var * 'a * 'a 

let map_shape f = function
  | Num n -> Num n
  | Bool b -> Bool b
  | Plus(a, a') -> Plus(f a, f a')
  | IsZero a -> IsZero (f a)
  | If(a1, a2, a3) -> If(f a1, f a2, f a3)
  | Var x -> Var x
  | Let(x, a1, a2) -> Let(x, f a1, f a2)

(* To get an actual expression type, we can tie the knot of the
   recursive type, and take advantage of the fact that we are doing it
   explicitly to augment the recursion with some extra data.  If you
   don't want extra data, you could use

   type exp = Fix of exp shape
*)

type 'b exp = Fix of 'b * ('b exp) shape

(* This function shows how to enrich the a term with additional
   type information, by adding type information to each node, 
   while parametrically preserving any other information that 
   is already there: 

   val typecheck : ctx -> 'a exp -> ('a * tp) exp 

   This would be even nicer with row polymorphism, since
   you could name the data you added. 
*)

type tp = Integer | Boolean
type ctx = (var * tp) list 

let rec typecheck ctx  (Fix(data, e)) =
  match e with
  | Num n -> Fix((data, Integer), Num n)
  | Bool b -> Fix((data, Boolean), Bool b)
  | Plus(e1, e2) ->
    let Fix((_, Integer), _) as e1' = typecheck ctx e1 in
    let Fix((_, Integer), _) as e2' = typecheck ctx e2 in
    Fix((data, Integer), Plus(e1', e2'))
  | IsZero e1 ->
    let Fix((_, Integer), _) as e1' = typecheck ctx e1 in
    Fix((data, Boolean), IsZero e1')
  | If(e1, e2, e3) ->
    let Fix((_, Boolean), _) as e1' = typecheck ctx e1 in
    let Fix((_, t2), _) as e2' = typecheck ctx e2 in
    let Fix((_, t3), _) as e3' = typecheck ctx e3 in
    let true = (t2 = t3) in
    Fix((data, t2), If(e1', e2', e3'))
  | Var x -> Fix((data, List.assoc x ctx), Var x)
  | Let(x, e1, e2) ->
    let Fix((_, t1), _) as e1' = typecheck ctx e1 in
    let Fix((_, t2), _) as e2' = typecheck ((x, t1) :: ctx) e2 in
    Fix((data, t2), Let(x, e1', e2'))
    
(* Next, here's a simple function to rename a variable within an
   expression, which illustrates how we can use map_shape to automatically
   handle all of the boring cases.

   Note that we only have to explicitly give the cases where we do
   something other than a structural recursion, and that if we added
   new branches to shape, rename would remain unchanged. (And it would
   remain correct, too, if those branches didn't mess with the binding
   structure.)
*)

let rec rename x y (Fix(data, e)) =
  Fix(data,
      match e with
      | Var z when x = z -> Var y
      | Let(z, e1, e2) ->
	if x = z
	then Let(z, rename x y e1, e2)
	else Let(z, rename x y e1, rename x y e2)
      | _ -> map_shape (rename x y) e)

I agree with this answer.

I agree with this answer. As I understood it, point 2 was just about the ability to pass specific node types around, rather than requiring everything be of type Tree - shouldn't be a problem with this pattern. Another reference is the Datatypes a la Carte paper.

I've also used this

I've also used this approach with success. I also used a similar design to support 'open' and extensible types in languages that usually lack them, though that works best if there is a bottom type (bottom type in a variant can eliminate a member of the variant).

Basically, it affirms my fear of commitment. Do not tie the recursive knot until you're forced to do so. :-)

The "big recursive knot" issue

I agree in delaying the recursion. Delaying wrapping up the recursion is necessary as closing the recursion limits reuse. I also find that a pattern of "top level" recursive "closure" is the cleaner design that keeps the overall architecture in control.

The down side is the sometimes large number of functions/functors that needs to be wrapped into the top level "co-recursive" knot. That could be an issue for some, I have found that it pays off.

This approach can become

This approach can become painful to use in practice (at least in OCaml) when the AST is constructed from a number of mutually recursive datatypes. Then you end up having to parameterise by all of the mutually recursive datatypes.

Painful

Originally in my compiler I used Ocaml polymorphic variants with open recursion (after getting Ocaml fixed so that polymorphic variants were covariant!).

Now in theory, this is beautiful because the type of every operation can be specified exactly, or as close to exactly as one desires. For example term elimination and introduction. Furthermore there is a huge advantage coding the process because many unaffected terms can just be transfered, and terms only affected by structural recursion mapped. Not only is this nice to code it is efficient because passing these terms across does not require reconstructing them.

Nice in theory, but utterly untenable in practice. When you have 10 term types in the AST alone, you need 10 parameters.. and then you have to provide another 10 fixations to get closed types from the open ones .. and you've only just described the inputs to the first algorithm.

As soon as you try to remove or add terms to describe the output of the first algorithm, you have to write huge wads of it all out again. It is true you can group the summands but doing so requires advance knowlege of what your algorithms do and is fragile.

I never got past a single term split into two: general type terms which included type functions and type matches, and reduced terms (with the type functions and matches reduced away) and in the end we chucked that out because the error messages ended up being 100s of lines (not kidding!).

OK, so now I will mention another approach which helps to solve the problem: get rid of the AST ASAP.

This is what I do. Add a level of indirection. Replace the children in a tree with a synthesised identifer (I use an integer) and then map the integer to the children.

This then requires an extra lookup to do any recursive analysis: but it allows very simple linear analysis and makes it very easy to add extra information with any number of additional maps.

My functional code is instantly lambda lifted, and complex expressions are unravelled into three-address code. Problem of cyclic references are eliminated by the indirection: you can now use multiple maps to aggregate information.

For example, the type of a function can be calculated with a partial analysis without fully binding the function: in Felix I calculate return types by examining the returns of branches and backtracking to find the types of any variables or functions used: if I hit a recursion I just throw that case out and unify the rest (there is always a non-recursive branch or you'd have an infinite loop).

I have to emphasise this isn't a general solution, since as others note it is heavily dependent on the language syntax, semantics, and phasing of calculations. It is particularly difficult in Felix because it uses setwise scoping (everything is mutually recursive without any need for forward declarations) and because it supports overloading as well as ugly things like a typeof operator.

To be sure, whilst the typing problem took several years to get to the current half correct status, a much harder problem still remains: managing optimisation, particularly inlining. Here the typing is no big deal: the problem is efficiently inlining almost everything without duplicating any steps (for compiler performance) and without allowing recursive calls to lead to infinite expansions. I don't know how to do this. I believe this is a much harder problem. It is, however quite similar because one is constantly and simultaneously expanding and reducing terms.

Avoiding non-termination

.. without allowing recursive calls to lead to infinite expansions. I don't know how to do this. I believe this is a much harder problem.

There are several ways to avoid infinite expansions. The obvious one is to never inline anything that is recursive, but you already said you wanted to do that. The GHC inliner and COSTA identify groups of strongly connected components, take one function out of it and hope that the remaining parts are not cyclic. If so, you can freely inline the other functions. More details (and heuristics) about GHC (Section 3), and COSTA (Section 6).

If this isn't strong enough there are other options, but performance will suffer compared to the small and local approaches outlined above. The basic idea is that you keep track of (some of) the expressions you've already seen, and if your current expression in hand is related (in a certain way) to any of the previous expressions you need to stop. If I still haven't scared you off, keep reading.

The relation between the terms need to be a well-quasi-order (wqo), which is sufficient to guarantee that you can not create infinite chains of terms without at least two of the terms in them relating to each other. The homeomorphic embedding is a pretty standard one, but there are plenty of small variations to this relation. Ilya Klyuchnikov defines a relation which is not a wqo on arbitrary terms, but it is a wqo on the terms his supercompiler produces. Neil Mitchell uses a different relation.

The presence of types should not cause a problem for the homeomorphic embedding, the (safe) sledgehammer approach is to say that all types are related to each other (a more refined version is do structural comparison on types and consider all type variables equal, so [[a]] is embedded in Maybe [[b]], but not in [a]). The trouble is performing generalization on terms in the presence of types, but luckily that has already been solved.

Polymorphic variants and map/fold objects

In the OCaml world, we have two disjoint tools to deal with points 1. and 2. on one side, and point 3. on the other side.

For points 1. and 2., we may use "polymorphic variants". They are "open" sum types that behave quite nicely in such setting : it is easy to specify that a function accepts or returns only a subset of a summation. See the Code reuse by polymorphic variants by Jacques Garrigue.
The drawbacks of polymorphic variants is that, due to their open nature, they do not cope very well with unconstrained inference, tend to generate ugly error messages. It is recommended to use type annotations generously when using such open types.
That said, though I know some users of polymorphic variants for that purpose, a lot of people still prefer to define lots of different (closed) algebraic datatypes.

For point 3., there is a very nice technique that, to my knowledge, has not really be publicized, and originates from Nicolas Pouillard (at least that's the way I learnt it), and that I name "map/fold objects". The idea is to generate a "map class" that does the dummy mapping base work, with one method per summand, and then define a subclass for each particular traversal you need. You only need to redefine the behavior on the summands that do not behave like "the obvious traversal".
This can be used for folds also, and such "map" and "fold" objects can be automatically generated by metaprograms parsing the datatype definition -- this requires a well-defined type at some point, but may be used on polymorphic variants. It is widely used in Camlp4, an OCaml preprocessor that has to work on large heterogeneous ASTs. This is badly documented on the Camlp4 wiki.

Visitors in OO

Your "map/fold objects" technique sounds very much like the standard OO Visitor design pattern. It's typical to define a DefaultVisitor for a particular structure that just performs the "obvious" traversal, so clients just need to override the behavior at nodes of interest, and can also delegate to the super-class to continue the default traversal.

It's fairly straightforward to extend this technique to more ad hoc compositions of unrelated objects (e.g., trees with unanticipated structure). And particularly in a language with parametric polymorphism ("generics" in the JVM parlance) and traits or multiple inheritance, the resulting code can be kept reasonably clean, modular and well-typed.

On the other hand, I don't think this is a complete solution to the original challenge question. A combination of two-level types per Neel's answer and open visitors is probably as close as I know how to get, but I think the challenge remains open.

I wonder whether any of the "XML processing type systems" that were trendy for awhile would have anything to say about this? The job of building/traversing trees of known schema but unknown structure seems like a good match...

Maps and generic programming as solutions

It is indeed quite common during compilation to repeatedly traverse an
AST doing operations only on some of its nodes (distinguished by a
type or a particular data constructor). That's why compiler
construction has been not only the biggest consumer for generic
programming libraries but also a notable producer. The best example is
the venerable Stratego and Strafunski (the latter even has its own
Wikipedia page). The most recent example is Alloy

http://twistedsquare.com/Alloy.pdf

from the abstract: ``We describe use cases for a generic system
derived from our work on a nanopass compiler, where efficiency is a
real concern, and detail a new generics approach (Alloy) that we have
developed in Haskell to allow our compiler passes to traverse the
abstract syntax tree quickly. We benchmark our approach against
several other Haskell generics approaches and statistically analyse
the results, finding that Alloy is fastest on heterogeneously-typed
trees.''
The Alloy library has an advantage: if it can be statically determined
that a sub-tree of an AST does not have nodes of interest, the
traversal algorithm will skip the entire subtree.

shap wrote:

Solving the problem at a high cost of indirections, e.g. by building a map of some sort, is not a production-grade solution; we're looking to compile at a rate very close to 100,000 lines of source code per second.

How is it determined that `a map of some sort' gives an unacceptably
high cost? I'd like to see a benchmark that shows that the compilation
speed is too low specifically because of map indirections (rather than
because of a high cost of type inference or points-to analysis, for
example). It is not at all a priory clear that rebuilding an AST
(which happens as you add new data to an AST) is so much better than
leaving an original AST alone and building a separate map or a hash
table. The allocation rate is comparable. Furthermore, pure functional
maps of OCaml and Haskell are well optimized. Building a separate map
has quite an advantage in making the code and analyses modular and
maintainable. One can add as many analyses after the fact without
changing the existing AST or existing code.

As one data point, OCaml, which is a notably fast compiler, uses pure
functional maps and sets all over the place. For example, the compiler
environment, used extensively during type-checking and code
generation is a large collection of maps

type t = {
  values: (Path.t * value_description) Ident.tbl;
  annotations: (Path.t * Annot.ident) Ident.tbl;
  constrs: constructor_description Ident.tbl;
  labels: label_description Ident.tbl;
  types: (Path.t * type_declaration) Ident.tbl;
  modules: (Path.t * module_type) Ident.tbl;
  modtypes: (Path.t * modtype_declaration) Ident.tbl;
  components: (Path.t * module_components) Ident.tbl;
  classes: (Path.t * class_declaration) Ident.tbl;
  cltypes: (Path.t * cltype_declaration) Ident.tbl;
  summary: summary
}

where Ident.tbl is a pure-functional map, a balanced tree,
with (qualified, in general) identifiers as keys.

Mostly-structural Recursion

A nice solution for writing functions which mostly proceed by structural recursion is given in the paper "A Pattern for Almost Compositional Functions". In short, write once a pure structural recursion function in open style, and then define your function by giving the interesting cases and using the helper for the default. When there is mutual recursion between multiple types it's necessary to work with records of multiple types (or merge the types into a single indexed type).

I don't understand what you mean by union tag dispatch, but it seems to be a performance concern. Do you have benchmarks to substantiate this? I would be quite surprised to see much of an effect from tag-checking - e.g, large performance loss when adding an unused constructor and case alternative to a single-constructor type in some test program. It depends on the compilation strategy, but branch prediction should be able handle highly-biased cases pretty well.

If the concern is instead expressing the fact that only some constrictors appear in some places (and avoiding having to write u reachable default cases) then O'Camls variants (mentioned above) are the only nice solution I know. GADTs with phantom type-level boolean indices can also express subsets with no runtime cost, but the notation is not remotely nice.

Unsolution - Just keep it simple

I ran into the same problem writing the Hi compiler. I can't say I formulated a solution, the only thing which came to my mind was to: 1) place all AST nodes in one type AST, 2) write a set of higher-order transforming functions on them.

Regarding 1. I don't like clutter. It sometimes seems a good idea to specialize different kinds of AST nodes to different types, but after reading some compilers it seemed to me that you end up with unmaintainable unreadable code where it isn't clear from the reading the code what certain functions do and one just keeps on expressing the same transformation code for each specialized type. Types really are a sliding scale, by combining everything in one type (basically moving in the untyped direction), only a few higher-order functions are needed.

Regarding 2. By keeping everything in one AST type, and considering that most transforms boil down to an interplay between analytical (top-to-bottom) and synthesized (bottom-to-top) information while rewriting the AST, I caught that notion into one higher-order type called transform, and wrote one general function 'transform_child' which handles child cases of a transform. The transform function is subsequently specialized to functions which keep the AST, or information, fixed and therefor are only used to gather information, or rewrite the AST with given information.

Combining one and two makes it possible to write down very short algorithms/transforms. Some of the passes are shorter than a dozen lines.

It is a very direct solution which looks only at what one needs abstractly to define transforms. I.e., the need to handle only specific AST nodes during a rewrite and have all other nodes handled generically by using a higher-order function which knows how to apply a transforming function to its children.

Some code to clarify the ideas above (maybe some typos since I work on another computer):

The AST is one big switch:

type ast = 
     [ comment         position text | ...
     | expr_number     position text | ...
     | expr_ite        position ast ast ast | ...
     | expr_case       position (list ast) | ...
     | ... ]

A transform rewrites an AST while passing around information.

type transform = \a => a -> ast -> (a, ast)

We want a default handler for each transform handling child cases:

def transform_child: transform a -> transform a =
   [ f, c ->
       [ ...
       | expr_ite p e0 e1 e2 -> // apply the transform to the children
             (c,e0) = f c e0;   // while passing around the context
             (c,e1) = f c e1;   // in a fixed order.
             (c,e2) = f c e2;
             (c, expr_ite p e0 e1 e2)
       | expr_case p ee ->
          (c, ee) = transform_list f c ee;
          (c, expr_case p ee)
       | e -> (c, e) // default is to do nothing 
       ]
    ]

Some rewrites just reshape the AST:

type reshape = \a => a -> ast -> ast 

def reshape_child: reshape a -> reshape a = 
// expressed in terms of transform_child

An example

def ast_substitute: ast -> ast -> ast -> ast =
    ast_sub = fix 
       [ ast_sub, ee, e2 ->
           (e0, e1) = ee;
           if e0 == e2 then e1
           else (reshape_child ast_sub) ee e2];
    [ e0, e1, e2 -> ast_sub (e0, e1) e2 ]

instance Sub ast where (
    def substitute: ast -> ast -> ast -> ast =
        ast_substitute
)

Regarding your requirements: When it comes to cyclic data, Hi is pure, it doesn't have any. Regarding speed, yeah well, everything is linear in the AST [though most naive algorithms tend to copy too much; but that is totally under your control], you can gain somewhat by specializing the specific transforms. Regarding extra fields necessary: I just, well, don't. There are manners which solve the expression problem and more, but since everything is under my control, I didn't need them.

It isn't the solution I guess you asked for since it doesn't address most of your concerns, but the resulting code is short, probably fast, and maintainable.

A transform just is a specialized case less general than a fold, more general than just working with attribute grammers since it can rewrite the AST (though information is passed around in a fixed order), and the key is to define a higher-order handler abstractly on the children so each function can be expressed as the union of handling the interesting nodes and the rest generically.

(In another view. It is a soft 'hack' to work around structural recursion and the expression problem. Inserting a new node in the AST means mostly adding one new case to the AST type, and one new case to 'transform_child' where most algorithms then can be left unchanged since the new node record will be handled generically.)

[ In a near Palin experience, I dubbed this an 'unsolution.' Since after mesmerizing somewhat and writing a naive compiler, it just turns out, for me, that there just isn't a lot of information one really wants to store in the AST. You might want to reconsider your design if that is not the case. ]

Addendum

I reread my post and thought it wasn't too clear about how transforms are used. An example, replacing each conditional 'if c then e0 else e1' with the equivalent term '[ true -> e0 | _ -> e1 ] c' with a transform which only rewrites the AST.

def remove_ite: reshape unit =
    [ u, expr_ite p c e0 e1 ->
         alt0 = // some code which builds true -> e0;
         alt1 = // some code which builds _    -> e1;
         e    = expr_application p (expr_case p [alt0; alt1]) c;
              remove_ite u e
    | u, e -> reshape_child remove_ite u e ]

A lot of my code looks like the above, it is a very abstract robust approach to writing a compiler.

(Note that it is surprisingly close to neelk's contribution. I just made the decision not to have generic information in the AST, it's a separation of concerns issue I guess. In my view, the AST by definition already holds, or is elaborated to hold, exactly the information to be able to compile - all extra information you are interested in should be reconstructable from it and, if necessary, passed to a transforming function.)

I'm confused by problem 1.

I'm confused by problem 1. Do you really have a cyclic data structure? How can you propose to do any operations on it while maintaining sharing?

If what you have is really a data structure that contains cycles, but works via indirections, then I don't see the typing issue.

If what you have is really a data structure full of refs, then I see the typing issue, but worry that you've got the wrong approach/a premature optimization.

But more details would be helpful.

For problem two, if a subparse function returns an AST node of a known type, then don't return the constructed node, but just the tuple of its arguments -- you can then pack it into a node as you pass the information further up.

For problem three, two-level-types/compos/uniplate/multiplate/multiplate are all viable alternatives. (but, of course, two-level-types are more generally powerful as well).

See also this post and related discussion.

Yes, really cyclic...

But now that I think of it, all of the fields in the AST divide precisely into two categories:

  1. Downward pointers. These constitute a strict tree.

  2. Memoization pointers: e.g. the symbol resolution pass updated the defn/decl pointer fields during its traversal.

All of the memoization pointers are nullable in any case. So I think that aspect of my "part 1" can be disregarded.

Attribute grammars

Attribute grammars address the problem you're running into quite nicely.

Phases - or different passes over the same tree - are easily described simply as different attributes that are computed over the tree. In fact, AGs eliminate the need to think of these as "phases" at all, it's just semantic information about the node the attribute is being computed on. The AG evaluator takes care of the details, of both the evaluation and the representation of this tree.

This also conveniently eliminates the complexity you describe later involving "less populated and more populated" ASTs (though, to do local rewrites, you might need an AG feature called "forwarding")

I'd love to recommend our AG specification language, Silver, but we fail badly on your performance requirement (we haven't really done any performance work at all, yet.) I'm not as intimately familiar with other AG tools as I should be, but to my knowledge, they all suffer from being too slow, too constrained, or too thin of a wrapper around a host language to be pleasant to work with (in my opinion), or several of these. If the performance requirements are serious JastAdd might be the best offering... it may still not be able to meet them, though. (Is 100kloc/s serious? Even my quick test of javac puts it at only a few thousand... If random blog posts from a search are to be believed, even Go only gets 18kloc/s.)

Re: everyone's confusion about cyclic data structures: My guess is that his symbol resolver is placing a reference to the declaration AST node in the referencing AST nodes. (In the AG literature, this is called reference attributes.) Reconstructing the trees, as many have been suggesting, would mean having to redo symbol resolution, or at least something similar, to get those references updated to the newly reconstructed AST nodes.

Not sure how this helps

Years ago I built an attribute-grammar based parser generator, so I'm familiar how these work and I don't see how they help. In a practical AG system you have both synthesized (computed bottom-up) and inherited attributes (computed top-down). In fact you usually have interdependencies between the two types. The attribute-associated fields have to be present in the AST structure when the AST node is allocated, and must be given values at allocation time. These values are then overwritten by the AG traversal pass.

But the same root problem exists: you need a default initializer for the attribute slots long before the correct value is known.

Oh. Maybe you're thinking about lazy attribute grammars (presumably with dynamic cyclic dependency detection and resolution)? The thing you have going for you here is that the attribute value construction can be performed lazily. This means that every attribute access is already wrapped in a get() pattern that is machine-generated, and the union dispatch can be hidden inside the machine-generated get() function. Is something like this what you have in mind?

That's an intriguing approach that I had not considered, and it bears some thinking on. Thanks!

Maybe you're thinking about

Maybe you're thinking about lazy attribute grammars...

Sorry, yes, this is exactly what I was thinking.

Lazy attribute grammars are perfect for this

I agree that lazy attribute grammars work well for this problem. I consider the AST to be fixed at construction time. Attributes are essentially memoised functions from AST nodes to attribute values. The laziness (or equivalently, demand-driven evaluation) takes care of dependencies between attributes, so you don't need to explicitly schedule AST traversals.

In our approach, the attribute values are not stored in the nodes. This evaluation method is also used in JastAdd, except that JastAdd is a generator, so they can add fields for the attributes (and methods, actually) to the AST classes at generation time.

We use this approach in the attribute grammar part of our Kiama language processing library for Scala. See this page for a short explanation and simple example. It also extends nicely to fixed point evaluation of circular attribute definitions.

We have some experience with building smallish language processors this way, but can't comment yet on performance when it's used to build large compilers.

data structure games for speed?

shap: It seems that we cannot build a series of successively extended AST types for each phase, both because the data structure is recursive and cyclic and because later passes rely on being able to re-run the various decorating phases [...] How can we exploit strong static typing without making this code unmaintainably ugly?

Later you say you want avoid high cost of indirections, say from maps. This sounds vaguely like the sort of problem I tackle now and then, but seldom in programming languages. Can you explain more? Note I consider that question rude, so I apologize. (Having folks say "tell me more" is irritating.) I might be able to help with graph representation mechanics, though a lot of your PL detail eludes me. It would help to know your implementation language. I've been assuming it's C or something similar.

Maps are not automatic performance losers, despite adding more cache line misses, and space cost of references instead of embedding info inline. In the absence of at least one indirection, often problems get very inflexibile and hard to resolve, from fragility. You might count up cache line accesses in different approaches, and find the same number when using maps. But you might need a map written specifically to minimize cache line footprint in context.

What problem do cycles cause? Difficulty in ending recursion? Preserving cycles over rewrites? I don't want to be rude and say what you already know (like common techniques for handling cycles in recursive printing, for example, via maps or marking.)

Strong static typing in a language compiled, or in your implementation language, or in both? I've noticed strong typing can sometimes explode into ugliness caused by volume of code. (I can't ever make strongly typed constaness come out small in a standard library.) Is that what you mean?

Your focus sounds like performance. Then try: 1) make data used together adjacent; 2) don't move old data to fresh copies; 3) pre-allocate expected cases; 4) get more concrete closer to leaves of execution call trees. Speed is often specific, which adds complexity. The most easily maintained code may be abstract and general, which isn't quite as fast.

If possible, make it possible to switch between simple-but-dumb and complex-but-clever versions of a subsystem. If a more complex version is not actually faster despite being clever, don't use it. If you can't make this comparison, you don't know if complexity added is necessary.

Indirection really not good here

"Can you explain more?" is never rude. If the question isn't clear, it needs to be refined to have a productive conversation. If I'm not willing to do that, I shouldn't have wasted everyone's time by asking!

In this case, the strong static typing is in both the compiler source language and the target language, but the language of interest is the source language used to implement the compiler.

Kaminsky-san is correct about the source of cyclic references, and per my response to this comment, I now think I see how that part of the problem can be eliminated.

To explain the problem with map:

There is a type field in every AST. The value of this field is computed by a separate pass, so we need a "neutral" initializer that can be used when the type field is initialized at AST node instantiation time. The only reason for map in this situation is to move the field into a second data structure in order to defer initializtion.

Because the relationship between type fields and ASTs is 1:1, the introduction of a map imposes log n overhead on an every type field access, which is an otherwise constant-time operation. Access to the type field is done pervasively. The added marginal cost imposed by a typical map implementation is therefore bounded below by ((k+1)n log(n))D, where D is the L2 cache miss cost (Interior index nodes in the map will not remain cached in L2 at this scale), n the number of AST nodes and k is the average number of input types that are accessed to compute each result type type during inference. k here is approximately 3, but n here is typically on the order of hundreds of thousands to millions. The killer here is the L2 cache miss latency, which on modern implementations is the high hundreds of cycles if you have an L3 cache (which may be absent on consumer-grade processors). The added cost imposed by the map (or any other comparably scalable indirection) is unacceptable in both relative and absolute terms. It's quite literally naked-eye perceptible.

I haven't collected the numbers you propose. I agree that would be a fine way to examine this if practical experience didn't already tell me that a map isn't a good approach here.

As to added complexity, I don't think that's at issue here. If we don't add a map, then we either need a canary value to initialize these fields (and some reason to think we won't access that by mistake) or we need to "lift" the canary value into type in order to ensure statically that our access pattern actually performs the canary check at all required locations.

thanks for map detail

Thanks for a detailed breakdown. I also avoid log n access time maps. I try to make every operation constant time when possible. When you say "typical map" it sounds like you're describing a tree. For example, I understand std::map in C++ is a tree, so I never use it (unless following orders). Also, I think some functional languages use tree-based maps to avoid updating old nodes, since tree alteration can be done by patching the path of change, so the new graph simply doesn't refer to old nodes that needed a change.

But even if a hashmap was constant time (because flat and contiguous -- either logically or physically) it sounds like you don't want one because it moves 1:1 info somewhere else unnecessarily. That would only be a good idea if you wanted one side immutable while the other side of the relation varied as different passes of code associated new info each time.

Yeah, cache line misses are very costly. I think of them as "about 100 cycles" times some constant factor that may vary. But sometimes they are necessary. Large hashmaps normally incur at least one miss per lookup; but if lookup is necessary, there's no choice. Also, I use doubly linked lists when I want constant time removal, even though this incurs a miss on both sides -- it's still better than non-constant time.

I'm not trying to talk you into using hashmaps. I meant to suggest when a cache line miss is unavoidable, because of how other things are organized, sometimes the extra data is better organized in a hashmap for lookup, if you can globally optimize how things lay out in relations. But if you're using tree maps, I would try to talk you out of using them.

Possible solution

Ted Kaminski's post suggests a solution:

Instead of using ast.type at use occurrences, use typeOf(ast), which is a procedure that internally checks for either a canary value or performs a union dispatch on maybe (or some comparable thing), and raises an exception on use of an uncomputed value. With a suitable postfix operator, this serves to get the clutter of the dynamic check out of the way, which is the major concern (due to maintainability). A nice part of this approach is that the type system will help you detect places where you have failed to use the convention.

The bad part of this is that the check is dynamic. While I can see (in abstract) how to go about discharging an exception freedom proof in this case, expressing that constraint in a type system lacking dependent types is something I don't see how to do. Expressing it with dependent types would be a pretty laborious business.

So this seems to be one of those cases where the elimination of the dynamic check is more tied to program behavior than static type, and therefore a case where a language-embedded proof system would be the way to go if you cared enough about the cost of check or the consequences of possible run-time exceptions.

In this case, my feeling is that it isn't that hard to check (manually) that the type inference pass assigns types to all suitable AST node types that appear in the tree (that is: the visit is complete), and also not that hard to do coverage checks on the inference implementation. Better still, it isn't that hard to build a sanity check walker that tests the consistency of the result tree during software acceptance qualification - or even in production. After which one can probably drop the canary checks with acceptable de practico confidence.

While I can see (in

While I can see (in abstract) how to go about discharging an exception freedom proof in this case, expressing that constraint in a type system lacking dependent types is something I don't see how to do.

One thought: assume you support linear/unique types and strong type updates, at least on phantom types.

The AST is a linear/unique type, and you use phantom types to track the 'maybe' state. When the value is resolved, you can update the phantom type in-place which will propagate up the tree. This being purely a type change operation, it has no runtime cost.

This is really just simulating typestate, which seems closer to what you want.

Perhaps so...

Perhaps so. But empirical experience with linear types makes me very reluctant to introduce it as a language feature. BitC is aimed at non-research users. As such, there is a class of ideas that are worked out from a formal perspective but still pretty rough from a presentation to users (and therefore utility) perspective that I try to avoid. It is an unfortunate truth that this class of "engineering for utility" work is not something that the current research funding paradigm is effective at funding. The current tech transfer funding paradigm, for its part, is bad at recognizing, examining, and (when appropriate) exploiting the resulting opportunity gaps. It's part of why "deep" research results with "obvious" benefit are hard to translate.

shap

Mozilla's Rust language has

Mozilla's Rust language has adopted typestate for reasoning about these sorts of properties. That's probably the better option here.

Physical significance of linear typed objects: Fan out

Exposing linear or unique types to people writing BitC code doesn't seem too horrible to me. People working at that level should be somewhat comfortable with the idea of using particular values in limited ways, because hardware has fanout limits. If you want to send some computed bits to a million different places, you pay for that in hardware, and the same can just as easily apply in low-level code.