Closures without function pointers

I am working on compiling a dynamic, collection-oriented, (mostly) first-order language onto hardware which does not efficiently support function pointers (or much indirection of any sort). This would be easy if my language were fully first-order, but the language has a limited set of primitive higher-order operators (such as map, reduce, and scan). I avoid the creation of function values by specializing these higher order operators on each distinct function argument they receive. So, map(sqrt, ...) is compiled separately from map(cos, ...) and so forth. In my current translation/compilation scheme I have a template for each operator, such as:
map_template(f, arg_type, result_type) =
   emit_fn(arg)
     z = allocate result_type[size(arg)]
     for i = 1 to n:
        z[i] = INLINE(f, arg[i]) /* copy body of f above this, replacing its input variable with arg[i] */
   return z
I run into trouble, however, when I try to pass a closure to a higher order operator. I can't simply inline the function portion of the closure, since it will be missing its closed-over arguments. Do the closure parameters need to be passed into the emitted function? It seems that I need to distinguish ordinary naked functions from closures in some way-- though I'm not sure what a rigorous way to do this might be. Can someone suggest some reading which will either give me ideas worth copying or help me clarify my thinking?
edit: For clarification, compilation happens at runtime, so the set of possible first-order functions is unbounded.
Thanks,
Alex

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Defunctionalization

The general technique is known as defunctionalization, and the respective Wikipedia page contains some links, in particular to John Reynold's original 1972 article.

Edit: Defunctionalization as such is somewhat different from what you describe, but I think the difference mainly amounts to inlining and partially evaluating the apply function.

Not a closed set of functions

The trouble with defunctionalization (aside from resulting in extremely large function bodies) is that I don't have a known finite set of functions. Compilation is initiated from within an interpreter, which means I need a scheme that can handle previously unseen functions.

However, you do have a known

However, you do have a known finite set of functions for each call site: that's how you're able to inline cos in map(cos ...), and end up with a specialised map-cos function.

I've seen the defunctionalized light

Ok, I think I see what Andreas means by "the difference mainly amounts to inlining and partially evaluating the apply function".
If I wanted to recast what I'm doing as performing defunctionalization then I can pretend to have a separate apply function for every every higher order operator.
So, the following code:
  f(x) = sqrt(x)
  g(x) = map(f,x)
is defunctionalized into
  f = F
  g(x) = apply_map(f, x)
  apply_map(tag, arg) = match tag with F -> /* implementation of map(f,arg) */

I then perform a trivial control flow analysis to get the unique tag being passed to each instance of apply_map (and apply_reduce, etc..), and specialize that call site to get:
  g(x) = apply_map_f(x)
  apply_map_f(arg) = /* implementation of map(f,arg) */
Now, if f is actually a closure, I just attach the closure data to the function function tag.
  f = F (c_arg1, c_arg2, ...)
and ultimately end up with
  apply_map_f(c_arg1, c_arg2, ..., arg)

Not that light

>I then perform a trivial control flow analysis to get the unique tag being passed to each instance of apply_map (and apply_reduce, etc..), and specialize that call site to get

Hmm, your control flow analysis might not be that trivial. As I understand it defunctionalization solves the generic case where it is not a-priori known what argument is given to map. It's an either/or switch: Either you solve the generic case by passing what is essentially a closure around by defunctionalization (where you might want to consider whether defunctionalization is actually such a good approach for solving that generic case). Or, well, you don't, and you might as well just specialize by constant propagation.

I've been assuming that,

I've been assuming that, like APL, the OP's language restricts how functions are used, so that they're not quite first-order, and the flow analysis becomes trivial.

Should be Doable

I run into trouble, however, when I try to pass a closure to a higher order operator. I can't simply inline the function portion of the closure, since it will be missing its closed-over arguments. Do the closure parameters need to be passed into the emitted function? It seems that I need to distinguish ordinary naked functions from closures in some way-- though I'm not sure what a rigorous way to do this might be. Can someone suggest some reading which will either give me ideas worth copying or help me clarify my thinking?

To be honest, I think the answer is quite trivial: It is always possible to specialize a function with a constant by substitution.

In your case, where I assume that you always must pass a constant denoting a function to a higher-order function, you don't need to the think about the general case where the function argument may be unknown at compile time - which makes it a lot easier.

And indeed, yes, you need provisions to deal with possible extra closure arguments; but you saw that, you can't just let them dissappear. So I think you could achieve that by making those explicit in the syntax. For example, by letting the declaration read map_template(f(args), arg_type, result_type) where args receive the closure parameters.

What I don't understand is that you build closures in a first-order language (without function pointers) in the first place. How do you do that? Do you Curry some function applications, provide a partial list of constant arguments, or do you have lexical scopes where variables may be bound? Maybe you should give a concrete example where you run into problems.

[ I mean, are you not sure you are trying to implement higher-order programming? In that case, you could try the defunctionalization approaches mentioned in the other posts. ]

Not quite higher order

> It is always possible to specialize a function with a constant by substitution.

Do you mean something beyond the specialization I'm already doing?

> For example, by letting the declaration read map_template(f(args), arg_type, result_type) where args receive the closure parameters.

This is the only solution I've come up with so far-- but I felt somewhat stuck in my thinking and wanted to see if there were better alternatives.

What I don't understand is that you build closures in a first-order language (without function pointers) in the first place. How do you do that? Do you Curry some function applications, provide a partial list of constant arguments, or do you have lexical scopes where variables may be bound? Maybe you should give a concrete example where you run into problems.

I end up with closures both from lambda lifting of nested functions and from explicitly partially applying a function.
Example:
vector_index(x,indices) =
   g(i) = x[i]
   map(g, indices)

The expression map(g, indices) would get translated via the map template into a low-level function which is called with the vector indices.

> I mean, are you not sure you are trying to implement higher-order programming?

It's really an awkward mix of first and higher order. The programmer is free to create functions values and pass them about. The only restriction is that function values must all be "statically" determinable after exhaustive inlining and constant propagation. I guess this scheme might seem poorly motivated (and my notion of static confusing), so it's worth noting that the target architecture is a graphics processor and compilation happens at run-time. An example of a function which won't be accepted:

f(b, x) =
  g = if b then sin else cos
  map(g, x)

> In that case, you could try the defunctionalization approaches mentioned in the other posts.

There are two problems with using defunctionalization here.
1) The set of possible functions is not statically known-- I'm compiling at runtime and more functions may be defined later.
2) Even if I knew all the possible functions, code bloat and compilation time would become a serious problem.

I see, mostly


>> It is always possible to specialize a function with a constant by substitution.

> Do you mean something beyond the specialization I'm already doing?

No, I mean exactly the specialization you are doing. You allow higher-order programming with the restriction that you only specialize/propagate constant symbols but not variable symbols. This is a well-known technique. Good: very fast execution of specialized higher-order functions. Bad: possible exponential blow-up in code. (Imagine specializing compose(f, g, x) = f(g(x)) for sin and cos pairs, you end up with sin/sin, sin/cos, cos/sin and cos/cos specializations.)


> I end up with closures both from lambda lifting of nested functions and from explicitly partially applying a function.
> Example:

vector_index(x,indices) =
g(i) = x[i]
map(g, indices)

> The expression map(g, indices) would get translated via the map template into a low-level function which is called with the vector indices.

Uh, I have serious problems reading this example. What variable is i? Do you loop over the elements of x? To what is g(i) bound, to the i-th element of x? (My best estimate at interpreting that blob now is that i is universally quantified and you declaratively define g. Which is, I mean, like, wow, almost logic programming. g = lambda i -> x[i]?)


> It's really an awkward mix of first and higher order. The programmer is free to create functions values and pass them about. The only restriction is that function values must all be "statically" determinable after exhaustive inlining and constant propagation. I guess this scheme might seem poorly motivated (and my notion of static confusing), so it's worth noting that the target architecture is a graphics processor and compilation happens at run-time. An example of a function which won't be accepted:


f(b, x) =
g = if b then sin else cos
map(g, x)

No, I think I understand the motivation. But, constant propagation, as you have pointed out, will only get you so far. At some point, you'll need to define precisely where and when you can propagate. (Which is easy, I guess, you can only propagate constant function so any higher-order function, when applied, should only allow definitions as arguments for functions.)


> There are two problems with using defunctionalization here.
> 1) The set of possible functions is not statically known-- I'm compiling at runtime and more functions may be defined later.
> 2) Even if I knew all the possible functions, code bloat and compilation time would become a serious problem.

To be honest, I expect that that shouldn't be a problem getting defunctionalization to work. One way of looking at (part of) defunctionalization is that it just specializes with a specific symbol apply which will switch between all the functions possible, where the generic case switches with closure arguments. If I don't consider closures for a moment I expect that adding functions to apply at runtime might not be such a big problem as you think?

[ I just read your other response. Yah, we seem to agree. But, to be honest, if you're going to consider defunctionalization, you're going in the direction of allowing generic higher-order programming, and then I guess there are better approaches than defunctionalization around then. I.e., then just bite the bullet and implement a lambda interpreter. ]

Bad: possible exponential

> Bad: possible exponential blow-up in code. (Imagine specializing compose(f, g, x) = f(g(x)) for sin and cos pairs, you end up with sin/sin, sin/cos, cos/sin and cos/cos specializations.)

True, though I'm hoping that since I only specialize function arguments the programmer actually uses the number of specializations will stay manageable.

> Uh, I have serious problems reading this example. What variable is i? Do you loop over the elements of x? To what is g(i) bound, to the i-th element of x? ...g = lambda i -> x[i]?)

Sorry, that was sloppy terseness on my part. You're right about the definition of g, it's just "g = lambda i -> x[i]". So the code "map(g, indices)" simply returns a subset of "x" at the locations specified in "indices".

> there are better approaches than defunctionalization around then. I.e., then just bite the bullet and implement a lambda interpreter.

I wish I could, but my execution environment makes it difficult. I'm compiling to NVidia graphics cards and even on the newest models where dynamic memory allocation and function indirections are possible the performance hit is terrible. How do you implement a lambda interpreter without function pointers or dynamic memory allocation?

Looks Hard

> I wish I could, but my execution environment makes it difficult. I'm compiling to NVidia graphics cards and even on the newest models where dynamic memory allocation and function indirections are possible the performance hit is terrible. How do you implement a lambda interpreter without function pointers or dynamic memory allocation?

Well, it's quite hard to see for me at the moment what code remains on the CPU, and what code goes to the GPU. It looked like the GPU has branching, so you could implement function pointers by a specialized routine which branches on a label instead of a function pointer. Dynamic memory management, I don't know. But, hey, all dynamic memory management is static management with a garbage collector on top, so that might be solvable too.

[ I am also puzzling myself with a lambda-calculus interpreter which just needs two stacks, one for stack frames, one for values with garbage collection, but that is a bit premature I think. ]

Having said that, I was thinking that a simple Haskell-like approach where GPU coding is captured in a separate monad must be doable. Thus it is always theoretically possible, but such an approach is probably too slow for your requirements.

But take it up with Shap, he has tons more experience in this than I do.

Runtime compilation ought to be quick

>Well, it's quite hard to see for me at the moment what code remains on the CPU, and what code goes to the GPU.

GPU code is (selectively) synthesized from a combination of an array operator (map, reduce, scan, etc...) and its known function argument. The function argument is then inlined into a template GPU program.

>It looked like the GPU has branching, so you could implement function pointers by a specialized routine which branches on a label instead of a function pointer.

That's true-- I think that's the defunctionalization sledgehammer approach. Compile all your function bodies into a single switch statement-- if the user defines a new function, recompile this gigantic apply function. I believe this approach would work but the compilation costs would be extreme.

Depends on the shader model

Depends on the shader model you are compiling for. For a basic pixel/vertex/geometry shader combination, compilation is fairly fast, at least for small amounts of code. For any CUDA/OpenCL/Compute Shader, compilation is dirt slow...even compiling once at run-time is probably too much. Its not clear to me if the new linking features coming out in CUDA/DirectCompute alleviate this problem when source compilation isn't needed (though they are language specific).

Compile times

>compilation is fairly fast, at least for small amounts of code.

I'm questioning how small you can keep the code-- any use of a function value in a GPU computation needs to inline the body of every function in your program into a giant switch!
Furthermore, if you want to create an interactive system you would need to recompile every time the user adds a new function.

I guess if you're only supporting non-interactive compilation and are OK with a loss of modularity, you can do CFA to limit the number of potential functions flowing to a particular argument.

>For any CUDA/OpenCL/Compute Shader, compilation is dirt slow...even compiling once at run-time is probably too much.

Compile times with CUDA depend on how and what you're compiling. My project-mate and I chose to directly emit PTX (GPU pseudo-assembly) and the combined time for both our compiler and NVidia's JIT has been in the 10s of milliseconds (for smallish but still useful benchmarks). If you try to use the nvcc toolchain directly (like Copperhead or Nikola) you can expect to wait a second. Even worse, if your system uses C++ templates to parameterize CUDA kernels (or uses a templated library like Thrust) then compile times can really drag. There are, however, many advantages to reusing an existing compiler so I don't necessarily advocate going the PTX route.

DX11 supports dynamic HLSL

DX11 supports dynamic HLSL compilation directly (even for compute shaders), but it can be very very slow when they need to do some heavy duty optimizations in here. The problem appears to arise when their are small loops that can be unrolled in huge ways. I'm not sure if this overhead is mandatory, slightly beneficial, or hugely beneficial. I asked the devs but they couldn't give me anything that could be considered a final answer. Dynamic compilation of shaders is now discouraged even if still supported, unfortunately.

There is still a lot of work to be done on fast compilation for GPUs. Your PTX approach sounds very nice, especially since you aren't using any of the high level features of the language, but I wonder what the trade offs are of skipping the NVIDIA tool chain.

Outside of GPUs, Microsoft has better support for dynamic compilation than anyone with the dynamic language runtime. This means Bling can generate fairly efficient (unabstracted, non-allocating) code to maintain the GPU shaders.

Compilation tradeoffs

>but I wonder what the trade offs are of skipping the NVIDIA tool chain

We have ended up reinventing nvcc in a trimmed down less featureful (and sometimes buggier) form. This includes maintaining a C-like kernel templating language, an internal representation of PTX and a simple compiler between the two. If dynamic compilation of CUDA C code were sufficiently fast I would happily switch to using that instead.

Exponential code explosives

It seems to me that particular problem could be resolved with a little hotspot analysis. This would only specialized long-lived, curried functions that are used often. And the specialized functions could still be garbage collected.

Alternatively (or additionally) you could favor a 'specialize' tag to support programmer annotations. specialize f x = f x. There is no functional semantics to the specialize tag, but it tells the runtime to generate a new compiled function. In this sense, it is up there with 'par' and 'seq' in Haskell.

Aside: I've been leaning ever more towards favoring those 'identity' tags as a basis for program annotation, to the point that I contemplate even comments be turned into a comment tag in my ASTs: comment c v = v. I've come to think of pragmas as a little ugly, in comparison. It might be interesting to provide the functions behind these annotations via pluggable modules.

Presumably, we could also tag 'inline' and 'noinline' if we wished it. inline f = f. noinline f = f. These would compose. To inline a specialized function:

z[i] ← inline $ specialize f arg[i]

Though I'm not sure this would make sense unless you'll be applying additional arguments to f, and it's all a bit hand-wavy until you add compiler support for the annotations and hammer out the details.

True, but may not be relevant

To be honest, I think the answer is quite trivial: It is always possible to specialize a function with a constant by substitution.

While this statement is true, there are two problems with it in the context of what Alex is trying to do:

  1. Extensive practical experience in the CLR world has demonstrated that template-style expansion over value types (i.e. unboxed types) leads to an explosive amount of code. Alex's statement that indirection on this machine is problematic leads me to infer - or at least ask - whether he may be relying heavily on unboxing in his context.

  2. Closure pointers are the result of run-time allocation. They are not compile-time pointers. In fact, they are not constants at all.

All that being said, I'm very surprised at the proposition that a sensible architecture can exist without adequate for function pointers (or more precisely, for indirect control transfer), because every architecture I know about implements some form of RETURN instruction. While it may be true that there is no straightforward indirect call or indirect jump instruction, clever use of stack+RETURN should suffice in this case.

Alex: if you contact me offline (shap at eros-os.org), I'ld be happy to discuss implementation techniques with you. There are a range of techniques here, and I looked at a number of them in the context of the BitC implementation. The right choice depends on both your target architectures and your interoperability requirements.

Oh. I should perhaps add that we rely heavily on a variant of defunctionalization in BitC: we implement polymorphism by concretely instantiating each procedure/value at each type where it is used. For kernel-level code the ability to do that is a hard requirement, but we've long since concluded that in many practical implementations we're going to have to spread that instantiation between compile-, link-, and run-time. Thankfully, we have some [perhaps excessively] clever ideas about how to achieve that in a type-safe, low-overhead way.

Beg to Differ

> Closure pointers are the result of run-time allocation. They are not compile-time pointers. In fact, they are not constants at all

Interesting. If I take graph rewrite semantics for a lambda-calculus interpreter then closures correspond to thunks, i.e. a call f(x,y) corresponds to pushing a closure [F, x, y] into a heap (or stack) and x and y are constant values, where F is the tag for f, and evaluating that. There's no reason to not see that as a ternary constant (symbol) which is evaluated by some abstract machine which knows the meaning of F, i.e. the procedure f.

When moving from the abstract machine to a concrete machine, I think the observation is that people just see that, hey, it is just faster to push x, push y, and push the pointer to f to a stack and call that and pop of the frame instead of implementing an abstract machine which allocates such a thunk on the heap and somewhat awkwardly only does symbolic manipulation.

But it's a fine distinction in semantics: Is evaluation manipulating constant symbols, or do symbols have dynamic meaning? It just depends where you place a divider in your head, I guess.

All in all, I think I don't agree with you there.

[ Note that this also holds true if one thinks of closures as tuples which capture part of an environment in a function which should be evaluated. I.e., we determined f and x, thus construct [F, x] and await the value for y. ]

[ I guess in the end it also just matter with which glasses you look at code. In the lambda calculus any fixed value is a constant, whereas in C, anything compile-time determinable is a constant. It's just a mix-up of terminology. ]

Perhaps I am missing something obvious

Generally, when people start talking about closures, they are doing so in the context of languages having first-class procedures. In such languages, closure formation is more than symbol manipulation. Given:

(def f (lambda (x)
         (lambda (y)
           (+ x y))))
(f 5)

The result returned by f 5 must record the value of x which is not knowable at compile time. It must do so for an indeterminate time. We cannot usually tell how many times f will be called, so we cannot (in general) preallocate the closure itself at compile time. The closure must be allocated at runtime, and if procedures are first class it cannot reside on the stack.

Which brings me back to my original statement: closure pointers are the result of run-time allocation; they are not compile-time constants at all.

There are several important and useful special cases:

  • Closures consisting exclusively of the capture of global variables can be entirely elided.

  • Closures whose associated procedure does not escape can be stack-allocated.

  • In specific cases we may be able to determine some of the things that we can't determine in general, and in those cases several of my statements do not apply.

Your proposal to view a closure as a ternary constant presumes that you can know at compile time all possible values of x and y. For very special cases I can imagine enumerating them, but it's hard to imagine that this would be useful in many real cases.

Just Different Terminology

My original remark was that you can always specialize functions by substituting a constant.

With constant and function, I mean either 5 and f in your example above, or cos and map in the original post.

I think a better phrasing would have been: You can always specialize function arguments by substituting a constant argument if that constant is statically determinable.

On a slightly different level: What I think I see in the defunctionalization approach is that that procedure essentially 'twists' higher-order functional programs to first-order (interpretative) programs by evaluating closures, expressed as data, with an apply function when necessary. Given that, in LC terms, one can see closures as records (or in LC terms, constants) which are mapped to functions (i.e., their meaning) with an apply operator. [ On hindsight, I probably shouldn't call a record a constant. ]

This has little to do with the operational interpretation of a closure as a run-time decorated pointer. Hence the misunderstanding.

Your interpretation makes better sense in traditional compiler building terminology.

You'd probably need lambda

You'd probably need lambda lifting for closures generated on the GPU, but you certainly could specialize any closures passed into the snippet of code that is to eventually run on the GPU. As noted in the OP, this compilation can occur at 'run-time'.

I guess I'm feeling a little confused. What, exactly, would you call the time of run-time compilation? Is it run-time or compile-time? or justin-time?

A greater problem would be semantic mutability. If a closure encapsulates mutable references it may be impractical to replicate that closure to the GPU. But if our values and variables are immutable, then we can often specialize them as they become known - and 'constant' can refer just as easily to immutable records and closures as to values.

What you can do with a

What you can do with a function pointer is only call it: f(x). There are a finite number of different functions in your program. So what you do is this: replace a call to a function pointer f(x) with call(f,x). Then write call like this:

function call(f,x){
  switch(f){
    case FUNCTION1: function1(x);
    case FUNCTION2: function2(x);
    ...
  }
}

Defunctionalization?

This is the same as defunctionalization, right? Except for closures I would need to have both a function tag and a data structure containing the implicit arguments.

Anyhow, I think this doesn't really work for me since compilation isn't happening all at once but rather dynamically, which allows for new user defined functions to appear.

Don't forget link-time

Anyhow, I think this doesn't really work for me since compilation isn't happening all at once but rather dynamically, which allows for new user defined functions to appear.

Thinking about link-time has been a big help for me. At link-time, you always have a statically known set of definitions. Once you bring dynamic compilation into the picture, your link-time happens (also) at runtime (in the dynamic linking loader). But that doesn't change the fact that stuff at link-time is still static, and allows you to do various optimizations. It's just that with dynamic linking, you might have to invalidate earlier assumptions, and recompile things just-in-(link)time.

Loss of code caching?

recompile things just-in-(link)time.

Doesn't that mean that I would lose the ability to cache previously compiled code? If the user loads or defines a new function and then uses it with some higher order operator, I would have to completely recompile the defunctionalized implementation of that operator-- duplicating most of the work of the previous compilation.

No


> Doesn't that mean that I would lose the ability to cache previously compiled code?

No, defunctionalization means you actually don't need to specialize higher-order function; the higher-order function is made generic by being able to switch between functions with an apply operator. (Ok, you need to read defunctionalization with a bit of a skewed eye for this.)

Recompilation of apply?

the higher-order function is made generic by being able to switch between functions with an apply operator.

Even if I didn't specialize and inline 'apply' into the body of a higher order operator (which in my case I must), wouldn't I still have to recompile 'apply' every time a new function gets defined?

In that case there is no

In that case there is no other solution but recompiling certain functions. For example if the user defined:

function foo(f){ f() }

What are you going to compile that to? If the user defines a new function bar, he can pass that to foo. But at the time of the definition of foo there was no bar.

That said you are going to compile down to hardware, but not all functions are known at compile time?!

Perhaps he meant that not

Perhaps he meant that not all functions are known statically, but are know dynamically on a general purpose CPU that masters the hardware?

Yep

Yes, that's what I meant. An interpreter initiates type specialization and compilation at runtime. This messes with my usual notions of static and dynamic. I think it makes more sense for me to think about the distinct phases as (1) static (2) compile-time (3) dynamic.

This is exactly where I'm at

A more accurate way to describe it: master compile time, master run time/slave compile time, slave run time. Dynamic compilation and code generation allows you to blur master compile/run time, especially if you are generating/compiling code for the slave. But this distinction is very common... GPU (DirectX/OpenGL/OpenCL/CUDA/...) shader APIs allow you to submit shader code dynamically to be compiled for the GPU.

In my case, all of my functional code only needs to execute on the mastering CPU, and the result is basically a functionless expression tree that can then easily be translated into HLSL. The blowup in loops and inlined procedures is significant, but at least with older versions of HLSL, these would have been blown up anyways (GPUs didn't support real procedures, indirection, or even dynamic looping very well). However, with newer GPU hardware that does support real procedures, indirection, and dynamic looping, I'll probably have to modify this approach.

Stages

I like your description of the stages-- they are clearer than the trio I used before.

I'm also trying to end with a "functionless" representation(necessary for efficient execution even on newer GPUs), but I have to do a lot of work to transform from a more expressive source language into this final form.

I think all projects trying to run high-level code on graphics processors (Copperhead, Nikola, etc...) are forced to "concretize" code into a language without the usual abstractions (functions, polymorphism, etc...).

Yep, abstractions are fine

Yep, abstractions are fine as long as they can melt away at run time. This is why I think GPUs are an excellent place for meta-programming to thrive...they really have a purpose there.

Staged computation in Bling

Here is a solution that works for me in Bling for GPU shaders (which is very much hardware that doesn't support indirection very well): write code that generates the code that will run on the hardware. Bling dresses up expression trees with strongly typed abstractions, which is sort of painful to do in C# and would be a non-issue in a dynamically typed FPL like Clojure. But the result is that its easy to compose, manipulate, and otherwise build intricate computations easily using a real language (C#) rather than a stripped down shader language. Another benefit is that you can make integration between staged and non-staged code more seamless; e.g., by transparently refreshing global parameter shader registers via CPU-hosted code to interface with generated shader code.

You can use the full power of your FPL in manipulating expression trees, but it definitely becomes more difficult when you actually need indirection to occur on the hardware. I've done a bit of this by basically inlining if statements in the generated code, though care must be taken not to explode jump tables. I've found that I haven't needed that much indirection on the hardware yet, though I haven't explored the technique fully yet.

Expressiveness?

Bling looks very cool, I wish I had known about when I was (in a former life) coding in C#.

I'm curious about the expressiveness of the expression sub-language. Is it restricted to scalar computations, or can you use aggregate operators within a shader?

If you mean aggregating

If you mean aggregating results on the shader, Bling doesn't really support that. You could do some hack with a geometry shader, but its too dirty to talk about. Right now Bling's GPU support is purely for the purpose of writing graphics shaders.