archives

The IO Monad is 45 years old

Oleg Kiselyov wrote a mail to haskell-cafe today titled, The IO Monad is 45 years old. I thought LtU readers might like this.

The AST Typing Problem

A few weeks ago I was visiting Mark Jones, and feeling particularly foolish I asked him what people do about "the AST typing problem" in a language like Haskell or SML. To my great surprise, he answered that he knew of no clean solutions. I'm curious whether a bunch of people with lots of PL-fu can identify some solution that we have failed to see.

The "AST typing problem" is a problem in static typing that consists of three parts:

  1. We intend to build an AST in the usual way. New information is added by several phases - most notably by the symbol resolver and the type checker - but we are required by strong typing to provide dummy initializers for fields associated with later phases or to make heavy use of "option" (or equivalent), with resulting code cruft. It seems that we cannot build a series of successively extended AST types for each phase, both because the data structure is recursive and cyclic and because later passes rely on being able to re-run the various decorating phases to re-generate (e.g. by type inference) information on newly created AST nodes. How can we exploit strong static typing without making this code unmaintainably ugly?

  2. Within the parser, we get tree fragments from sub-parses and compose these as we work our way up the tree. A given sub-parse usually returns an AST of known node type, or in a few cases (e.g. expression) one of a very short list of possible node types. It seems that we should be able to express the fact that there is no need, in most cases, to perform union tag dispatch on such nodes, but there does not seem to be any evident way to do that. Can this be expressed in Haskell or ML in human-usable form?

  3. Each pass of the compiler essentially proceeds in recursive-descent form over the ASTs. In a given pass, many node types will receive the same treatment (typically an oblivious recurse), but most safe languages provide no means to handle more than one union leg type in the union dispatch syntax. For the remaining AST types (the ones we handle in detail), each child generally has a well-known type or must be one of just a few node types; we would like to exploit this knowledge to avoid unnecessary union dispatch. This is, in some sense, the dual of part [2]. Once again, how can this be expressed in Haskell or ML in human-usable form?

Note that I'm looking for techniques here that are practical in a serious commercial grade compiler. Solving the problem at a high cost of indirections, e.g. by building a map of some sort, is not a production-grade solution; we're looking to compile at a rate very close to 100,000 lines of source code per second. Similarly, solving the problem by unifying the symbol resolution and type inference passes would compromise maintainability.

It's hard to imagine that this isn't well-trodden ground, but if Mark doesn't know, and I don't see it, it seems like it's at least a question worth posing. Dependent type can certainly handle some of this problem, but I'm really looking for clarification on how to do this in non-dependent type systems.

I should perhaps add: the first issue may not be apparent in current Haskell or ML compilers. The data structure presumed is unnatural in a functional language, so implementers building compilers in these languages have likely selected other ways to handle the evolution of AST structures across phases. I'm interested in those solutions, but I would expect (naively) that they fail the compile speed goal due to memory indirection overhead. I would be very interested to learn otherwise!