Static typing and direct AST manipulation

Is this possible? Have static typing, meaning all types are defined at compile time, and at the same time have direct read/write access to the abstract syntax tree of the program?

Maybe this is possible if type definitions are not part of the AST.

Maybe I am missing or confusing sth. here...

Let me ask a more practical question: How would ANSI C, as an example, would have to be extended in order to provide direct read and write access to the abstract syntax tree at runtime while not loosing its compile-time type-checking feature.

edit: I like to clarify that what I mean by "AST" is not nessessarily a datastructure representing the entire program's contents but can also be seen as a typed representation of the program's entities: e.g. the language exposes the functions etc. as editable values. So the question is: If the program entities (is there a better word?) may be manipulated at runtime, is it still possible to type-check the program at compile-time? Or what restrictions must be applied to the access in order to not loose the compile-time-check-ability.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Power

I think it depends mostly on how powerful the language is and how much control you want. For example, I've been doing some metaprogramming in Coq recently, which is possible because Coq's type system is powerful enough to express proofs of type-correctness.

For example, let's say 'AST' is our AST datatype and 'eval' is an interpreter. We could give 'eval' the following type:

Definition eval : AST -> option {t : Type & t}

This says we take in an AST and give out an option/maybe value, ie. "None" or "Some x", where "x" is a dependent pair containing a type and a value of that type. The 1st and 2nd elements of a dependent pair can be extracted using "projT1" and "projT2", which lets me write the following dependent functions:

(* Extract the type, defaulting to unit for "None" *)
Definition extractType (x : option {t : Type & t}) : Type
        := match x with
               | None    => unit
               | Some x' => projT2 x'
           end.

(* Extract the value, defaulting to unit for "None" *)
Definition extractValue (x : option {t : Type & t}) : extractType x
        := match x with
               | None => tt
               | Some x' => projT1 x'
           end.

I can compose these quite easily:

Definition runExtract (x : AST) : extractType (eval x)
        := extractValue (eval x).

runExtract is perfectly valid anywhere in my program, as long as its return type unifies with the type required at that point (ie. it type-checks). This requires that 'extractType (eval x)' can be computed statically, in other words the type of the AST's value must be static, but its value can still be dynamic; for example if we compute the type using lazy evaluation, the value will be left unevaluated.

I'm not sure how much of this is lost as we reduce the power of the language; the formulation above makes heavy use of dependent types, for example. Note that this is actually a common pattern for implementing DSLs in dependently-typed languages, although I've not seen it used to self-host (ie. where AST can represent any Coq term). Doing so seems to be undecidable. I've been working around by limiting recursion depth, so Coq-with-depth-N+1 can evaluate ASTs for Coq-with-depth-N.

I am not sure I understand

I am not sure I understand completely what you are saying. If you want to, you may rephrase some of your points.

I'll try. The essence is

I'll try. The essence is that an AST node can be evaluated to produce some unknown value of some unknown type (or evaluation may fail, which we can handle with option types[1] which I'll ignore for now).

Since the type is unknown, we must return it along with the value, hence 'eval' needs to return a pair.

Putting a type into a pair and returning it from a function requires that our language treats types as 'first-class', ie. as regular values which can be computed and passed around. We can't do this in Java or Haskell-98, for example.

Next we need to work out what the type of this pair should be, ie. what A and B should be in (A * B). We know that one of the elements will be a type, so we'll put this first to get (Type * B).

We don't statically know what B is, but we *will* know it dynamically: it will be the first element of the pair.

We can represent a relationship like this by using a more general pair type called a dependent pair[2]. Dependent pairs have the form {x : A & f x}, which can be read as "a value we'll call 'x' of type A, along with a value of type 'f x'". Here 'f' is just some function, so we can calculate our types from runtime values.

In our case we want A to be Type and f to be the identity function. This gives us {x : Type & x} which is the type of pairs where the first element is the type of the second element. This is exactly what we want to return from eval.

Once we've called eval, we want to be able to pull out the second element (the value the AST evaluates to) and use it in our program. I did this with "projT1" and "projT2" above, which are just the dependent-pair equivalents of "fst" and "snd".

The thing is, we only want to pass around this extracted value in a type-safe way. This requires us to know its type at compile-time. How can we find out its type? We can look at the first element of the dependent pair. But the dependent pair is calculated from an AST node at runtime, how can we know what it is at compile time?

The only way we can know is if it is always the same. For example, I can write the following function which evaluates an AST and adds 5 to it:

evalThenAddFive x = (runExtract x) + 5

Now, what is the type of this function? It clearly returns an Int, but what is the type of x? It is *not* just an AST, since that wouldn't be type-safe. The type of x is 'an AST which evaluates to an Int'.

In Coq I can write such a type by adding an extra argument which proves that x evaluates to an Int, like this:

Definition evalThenAddFive (x : AST) (p : extractType (eval x) = Int) : Int
        := (runExtract x) + 5.

Note that we never mention the *value* which x evaluates to, so that can be completely dynamic. We only require that it's always an Int. Also note that the reason we returned the type from eval was originally because we couldn't work it out beforehand. That's still the case *in general*, but having access to the type allows us to write proof obligations like p above.

[1] http://en.wikipedia.org/wiki/Option_type
[2] http://en.wikipedia.org/wiki/Dependent_type#Dependent_pair_type

In the case of C, the

In the case of C, the language has basically no existing support for type-safe code rewriting. Adding it would require the following:
1) A representation of ASTs, along with a library of functions for creating them, manipulating them, traversing them, etc.
2) A theorem-prover with a model of C's type system and the AST representation.
3) A compiler for turning ASTs into machine code.

Given the above, we can generate programs in AST form by using the AST library. We can then use the theorem prover to show that our AST program has the C type we want. We can them compile it with the AST compiler, load it into memory and jump to it (this may require some indirection, depending on the machine architecture and OS; eg. x86 Linux doesn't allow executing data).

The thing is, C's type system is incredibly weak, so there's not much point going to all this effort just to maintain some flimsy semblance of safety. It's much easier to just dump code into RAM and jump to it, safety be damned.

If a language has a powerful type system, like Haskell for example, then the safety guarantees may be worth the effort of dumping ASTs through a compiler.

If the language has a really powerful type system, like Coq, we can push those safety guarantees right back into the AST-generating functions, to guarantee that ill-typed ASTs can never be generated in the first place.

"x86 Linux doesn't allow

"x86 Linux doesn't allow executing data"

Actually you can change the protection status of a region using the mprotect() library function (perhaps that's what you meant by "indirection", but usually people use that term for a jump table or similar).

I don't know why you'd want

I don't know why you'd want to mutate the AST at run time. I'm not aware of any language that allows this, statically typed or not (except maybe Perl 6, I'm not sure). You *can* however do Lisp style macros in a statically typed language. Check out Scala for example. Also check out MetaOCaml for a language extension that lets you generate new code at run-time.

I don't know why you want that either.

Your question is weirdly interesting, so I want to know your train of thought making you think of this. Can you articulate your motive prompting the idea? (It's okay if it's hazy and hypothetical; a lot of useful reasoning starts that way before careful sorting.)

If the program entities (is there a better word?)

Term "entity" is sufficiently vague you can use it to refer to anything. So it's good as an all-inclusive bucket catching everything.

may be manipulated at runtime, is it still possible to type-check the program at compile-time?

That sounds like, "If I dynamically alter at runtime what I statically checked at compile time, is the static check still valid?" Only by luck, I would think, since you ought to be able to violate any invariant established at compile time, making the static check no longer true.

Let me ask a more practical question: How would ANSI C, as an example, would have to be extended in order to provide direct read and write access to the abstract syntax tree at runtime while not losing its compile-time type-checking feature.

In your scenario, does altering the AST affect the running code, immediately or later? When does new code corresponding to the new AST get generated and run? It's a desire to write the AST at runtime that puzzles me. Is it for exploratory programming? Would generating new source to run later be enough for experiments?

I plan to rewrite C only to change when things happen, so async calls can occur, structured as yielding to a trampoline at call time. But the idea is to alter only when something happens, which C doesn't really care about, as opposed to what happens, which ought to stay the same. I only want to end up with another encoding of the same AST, so the goal is unchanged invariants when timing doesn't matter.

My transformation is narrow and simple, while yours sounds arbitrarily complex, if you allow any runtime rewrite. All I have to do is find all calls to functions that yield, then heap allocate a refcounted parameter block for formal parameters and return to the trampoline dispatcher to make the actual downcall, with the rest of the function turned into a separate function called as the continuation (also provided with the downcall info).

[A function that yields is one that blocks, makes an async call, takes too long, or calls another function that yields. The resulting "yield propagation" is similar to const propagation, so you could infer most of it from global analysis, provided the base cases are clearly declared for blocking, long-running, or async-calling.]

If you tell folks what you want to happen, they can make better suggestions. So far it sounds like you want to destabilize something already painfully fragile, in the case of C. You must have a payoff justifying some risk.

"If I dynamically alter at

"If I dynamically alter at runtime what I statically checked at compile time, is the static check still valid?" Only by luck, I would think, since you ought to be able to violate any invariant established at compile time, making the static check no longer true.

Compile-time invariants and static checks are types, so we should be replace parts of a program if they have the same type. Note that this notion of type may be stronger than that of the language, it depends how far we want to go.

In the specific case of C, its type system is incredibly weak, so we can get away with changing a lot of things in a type-safe way. For example, many functions take a pointer and a length; there is no static/type information linking the two, so it's perfectly type-safe to alter the length value to any unsigned int we like, whenever we like. The code will clearly break, but it is still type-safe.

Trapped in a twisty maze of types.

I infer you find C hopeless as a tool because its weak types cannot statically find invalid integer values, for example, such as relative to correct limits on space safe to write in a buffer. (Unless you fold buffer size into struct type and then specialize functions in terms of such struct as arguments.)

Compile-time invariants and static checks are types, so we should be [able to] replace parts of a program if they have the same type.

(My mind inserts able to in that spot so automatically, I must have read it ten times before seeing those words were not there already, until I stared at it. Funny thing.)

That sounds wrong to me. We should be able to substitute expressions with the same value, not any instance of the same type with a different value. What you said would only be true if each unique value had its own unique type, right?

Here I was going to go into Pascal strings with a specific size folded into type, but thought better of it. (They are really irritating.) Anyway, this strikes me as indirectly related.

You seem to be alluding to arbitrary self-modifying code.

You seem to be alluding to some kind of provisions (via library support or native language constructs) to allow for arbitrary self-modifying code.

I can't make any judgment a priori (nor either way) without more info on the use cases you have in mind, but I'm rather intrigued by your, say, "insisting" to attempt it at the AST level in particular.

Of course, self-modifying code is pretty common these days, if only in the case of managed environments (Java, .NET, etc), but it's usually done in a "disciplined" way (well, if I dare say) via runtime support and reflection capabilities over runtime "entities" that have been designed to support it in a (relatively) type safe way, precisely - e.g., app domains, assemblies, modules, reified runtime types, etc, in .NET (cf. System.Reflection, ...)

But also, and perharps more importantly, when that's done, it's AFAIK only allowed within the controlled bounds of one *major* assumption - not random at all - and kept in check by the same language runtime :

that it will only be "okay" of *augmenting* the computation facilities of the running program (i.e., compiling *new* types and functions/methods bodies), as opposed to tampering with/unloading the pre-compiled modules that were loaded at application start (not allowed)**.

In short, my perplexity/curiosity joins quite naturally that of McCusker's, as I concur.

'HTH.

**I personally interpret it in such a managed environment runtime, for an analogy with the Church-Turing thesis, as :

once a new theorem (~ new runtime type) is derived/learned (~ loaded) thanks to a pre-existing theory (~ pre-loaded modules), you can't quite afford to change it or to "unlearn" it, or at least not before you're willing to discard it along with everything else (~ application shutdown) that allowed to introduce it, and to put it to work to possibly build others, etc (if only for keeping consistency)

Perplexity can inspire occasionally.

Perhaps tkirchner is done with this topic. Your closing paragraph rings true. Unlearning code is dangerous if something still depends on it. If a language can't see all references to code, changing it isn't safe, unless the new version is equivalent. (JIT compiler output aims to be the same thing, but optimized.)

Of course, self-modifying code is pretty common these days, if only in the case of managed environments ...

Modifying state just has to be worth the risk of variability, even if it's code. As a feature, changing code is a side effect with potential non-portable quality. Depending on it may narrow deployment options. I'm not picking on managed runtimes — depending on anything narrows options. If you don't depend on modifying code, another way might work as well without giving you data races in code space.

If you squint, you can view most computation as bytes moving around, in a kind of tower-of-hanoi game whose disks are different parts of memory hierarchy, with secondary storage at the bottom and registers at the top. The interesting changes only occur in the white hot tip as the result of CPU instructions, and everywhere else bytes are just flowing around in i/o, messages, and caching layers.

As a metaphor, that's a useful reductionist view when you want to say, "Well, the only thing happening is XYZ," so you can focus on consequences of moving bytes around, and getting such movement serviced by CPU. If no one can ever observe a result directly or indirectly, it doesn't need to occur, which applies to more than one level of the tower. A C compiler might ignore code with undefined effect, and schedulers can terminate processes whose main output is unobservable.

As soon as your programming language includes any concept of process management, what happens in interaction with the host operating system gets harder to define, because it sounds less and less like PL and more like bytes moving around in i/o and scheduling systems. Which makes sense given the tower-of-hanoi computation model.

wrong spot

.