## Type Classes as Objects and Implicits

Type Classes as Objects and Implicits

Type classes were originally developed in Haskell as a disciplined alternative to ad-hoc polymorphism. Type classes have been shown to provide a type-safe solution to important challenges in software engineering and programming languages such as, for example, retroactive extension of programs. They are also recognized as a good mechanism for concept-based generic programming and, more recently, have evolved into a mechanism for type-level computation. This paper presents a lightweight approach to type classes in object-oriented (OO) languages with generics using the CONCEPT pattern and implicits (a type-directed implicit parameter passing mechanism).

This paper also shows how Scalaâ€™s type system conspires with implicits to enable, and even surpass, many common extensions of the Haskell type class system, making Scala ideally suited for generic programming in the large.

Martin Odersky and team's design decisions around how to do type classes in a unified OO and FP language continue to bear fascinating fruit. Implicits look less and less like "poor man's type classes," and more and more like an improvement upon type classes, in my opinion given a quick read of this paper.

## Comment viewing options

### Implicits

Scala's approach gives the programmer valuable flexibility in choosing instances: you can be explicit (instances are objects and can be passed around freely), or implicit (a la Haskell), or mix and match.

Scala seems to be missing the backtracking logic solver that gives Haskell type classes their expressive power. I don't see a Scala equivalent of a definition like instance (VectorSpace(v,s),Field(s))=>VectorField(v,s) where ...; you have to fall back to explicitly generating one where it's needed, right? Someone please correct me if I'm mistaken.

I believe the preferable solution is typeclasses as record types, instances as records, scoped implicit parameters for propagating them around, associated types as record members, and a backtracking logic solver for instantiation. But that just solves the immediate problem; on top of that I'd like a dependent type system with staging to enable type dependency on values (e.g. arrays-with-length), and a higher-order logic solver so simple logic-programming idioms like let zip(as,bs)=f(x) in ... can be expressed and translated via instantiation into efficient runtime code. And a total (non-partial) function subset of the language for reasoning with proofs as programs.

### You can do that in Scala

You can say: implicit def vectorField[V,S](implicit vs: VectorSpace[V,S], f: Field[S]) = new VectorField[V,S] { ... }.

This works pretty well unless Scala can't infer the type parameters V and S at the site where the implicit VectorField is needed. This used to happen a lot with higher-kinded types - Scala could not infer these at all prior to 2.8.

### Generics

Sounds great, although I have little experience with the advanced topics you allude to.

I wonder though, if the recent advances in generics-based approaches (type constructors, constraints) could be used to reap some of the same benefits? I myself have a much easier time understanding their mechanics than those of the "implicit approaches" like Scala's and Haskell's.

### Don't ask for much, do you?

a dependent type system with staging to enable type dependency on values (e.g. arrays-with-length)

By "staging" here I assume you mean something roughly like MetaOCaml, where compilation blends with partial evaluation, and types at one stage can depend on values known in previous stages.

If I'm wrong disregard this message; if not, this doesn't actually require a true dependent type system, does it? Any computation producing a value that determines a type in a later stage can be simply lifted wholesale to the type level. So long as there are no cycles in the dependencies, it seems that a few mechanical translations would be sufficient to eliminate all functions from terms to types.

There's talk about enabling this sort of thing in GHC Haskell itself, but since GHC's type level is already Turing-equivalent, a preprocessor is actually sufficient.

Not that implementing such a system is easy, but it seems less of a burden (especially for the eventual programmer) than full dependent typing.

### Staging and dependent types

Regarding staging, I'm thinking of something more general. A fully dependent type system must mix evaluation with typechecking, so we need constructs to ensure that evaluation is sufficiently expressive, yet not divergent in the presence of recursion. For example, if typechecking depends on the value of a recursively-defined function fac(x), we want the ability to recognize fac(3)==6 and fac(x)==fac(x), and not diverge when reasoning about fac(x).

Prolog provides a sort of "conditional, instantiation-dependent staging" via constructs like atom(x) and compound(x), which the programmer can use when writing recursive functions to enable their computations to unwind when sufficiently definite values are present, but avoid divergence when reasoning abstractly.

True dependent types would be preferable to the solution that has evolved in both Haskell and C++, where the type level is Turing-complete yet remains a separate and bizarre computational realm where you can't directly reason about numbers and strings but need to reconstruct them from inductive data types at terrible cost in complexity.

### I believe the preferable

I believe the preferable solution is typeclasses as record types, instances as records, scoped implicit parameters for propagating them around, associated types as record members, and a backtracking logic solver for instantiation.

That's exactly what Scala does. Backtracking is provided for implicits that take themselves implicit arguments.

But that just solves the immediate problem; on top of that I'd like ...

You need another language/more language research for that, I am afraid.

### BER-MetaOCaml

Tim, have you had an opportunity to look at BER-MetaOCaml and, if so, are there any particular lessons to draw from it?

### MetaOCaml

I think BER-MetaOCaml is a great translation of the LISP quasiquote/unquote/eval mechanism into a well-typed, well-scoped language setting.

My criticisms, perhaps very naive:

* The simple "power" example shows that this approach carries high syntactic and mental overhead. Instead, given a library that supplies the function "power(n,x)", how about letting me write "square=optimize(\x->power(2,x))" and have either the compiler or runtime generate an optimized function by specializing its code by the variables that are known? Thus, no need for brackets, and still no possibility of compiler/optimizer errors: if "power" was originally valid, then it remains valid when we specialize for n=2.

* This approach to code generation forces the programmer to very explicitly separate statically-known and dynamically-known data. I can write a general "makePower" code-generator, and then use it to expose "power(x,n)", "square(x)", and "twoToThePowerOf(n)". But I'd rather write "power(x,2)" everywhere and have the compiler specialize it for n=2 right there. This sort of example extends all the way to e.g. writing a DSL parser and interpretter that is then specialized into an efficient compiler via the Futamura projection, without syntactic bracketing.

### Lightweight Multi-Stage Programming

Tim,

I think I understand your point... it essentially sounds to me like you want to "take partial evaluation seriously." if I understand correctly, the ongoing challenge there is integrating that with some notion of separate compilation, or maybe pushing aspects of the partial evaluation off to JIT time, as I believe you can do with LLVM; see Partial Evaluation with LLVM.

The recent story Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs also sounds relevant: "We argue that lightweight modular staging enables a form of language virtualization, i.e. allows to go from a pure-library embedded language to one that is practically equivalent to a standalone implementation with only modest effort."

### Explicit staging / partial evaluation

I don't really go in for requiring the programmer to explicitly stage computations either. We just need smarter compilers. It occurs to me that if more language implementations would have specialization optimizations, or better inlining, then there wouldn't be a need for research into staged languages.

I believe inlining is strictly more powerful than specialization, when code size is not a problem. The research question is how to design a compilation system with good heuristics that can perform whole-program optimization, employing specialization, inlining, and partial evaluation in the right circumstances.

### JIT it

Recent thread(s?) on the state of the art in JIT technology are relevant here. A glance in that direction also reveals one other topic that should be added to the list of research questions: How to maintain determinism and predictability? In other words, in an increasingly smart system, how can I be sure that subtle changes to the program or (worse) to the input data will not lead to dramatic changes in performance? Offline compilers have the advantage over JITs here, but it's still an issue, and offline compilers give up a lot when it comes to inlining and partial evaluation.

### For now

The growing complexity of runtime systems is really worrying to me. The amount of nondeterminism is out of control. You quite rightly put your finger on one of the big issues: predictability of performance, and the "chaotic" nature that performance is beginning to take on.

The burning question is: how much of this cost (complexity and nondeterminism) is required to get the benefits (ease of programming, safety, ease of debugging, dynamic introspection, portability, and peak performance). No doubt the trend has been towards greater dynamism, leaving a larger and larger chasm with legacy compiled languages like C and C++. I think this has led to a perceived dichotomy between "nice" languages like Python, Java, and Scala, which are mostly dynamically compiled and "not nice" languages like C, C++, etc which are entirely statically compiled. I don't think this dichotomy exists.

There was a lot of research into doing lots of static, often whole-program, optimizations for object-oriented languages in the mid to late 1990s, e.g. the Whirlwind and Vortex optimizing compilers. Languages like Cecil and Dylan had small niches back then, and their principal implementors all ended up migrating to working Java runtimes. Somehow Java laid a thick, dense pheromone trail that attracted too many researchers to follow.

This is unfortunate. First, because nowadays the size of memories and the speed of processors makes static, whole-program optimization absolutely viable, even without any new algorithmic advances. (Even while the runtime-optimization folks would claim that they can use all of this extra space and time to house their giant runtime systems and "fancy" runtime optimizations which are really just restricted versions of classically known optimizations, fed with profiling data). And second, because the processors and memory systems which our programs are running on are more nondeterministic than ever, and if anything, software should be endeavoring to tame this nondeterminism and give programmers and their programs the best, most predictable performance.

### The amount of nondeterminism

The amount of nondeterminism is out of control.

This is somewhat methodological, though. Nondeterminism can always be made more deterministic, methodologically. This is the very essence of architectural design. What you're really pissed off about, I'm guessing, is that we are going back to the 70s and 80s with lots of different hardware for CPUs. As Gerard Holzmann points out, it is not just performance, but error isolation and divergence as well... a huge reason why the stack machine has remained a dominant figure in computer architecture.

First, because nowadays the size of memories and the speed of processors makes static, whole-program optimization absolutely viable, even without any new algorithmic advances.

I think you've got it wrong. Nowadays, differences in architecture makes optimization of even basic algorithms dependent on architecture (which means hardwiring a language or runtime for one strategy to implement the algorithmic 'concept' may be folly). Some need experts in 'cache-oblivious' and others don't. The major change is that communication is becoming the bottleneck, or at least an equal partner with issues like exploiting instruction-level parallelism.

### This is somewhat

This is somewhat methodological, though. Nondeterminism can always be made more deterministic, methodologically.

I don't think we're talking about the same things. Managed runtimes add several new sources of nondeterminism to your program's behavior. E.g. the runtime system may not use deterministic mechanisms to profile (i.e. timer-based sampling like Jikes RVM). Yet even with profiling counters, they may lose updates in multi-threaded applications, making compilation triggers, inlining decisions, code layout, and even register allocation nondeterministic. Even in a single-threaded app, compilation may be performed on a background thread, making the actual counter values potentially vary when compilation actually begins. Placement of objects in memory can affect cache and virtual memory system behavior, and are also influenced by alignment settings of allocators, GC algorithms, triggering heuristics, and nondeterminism from multi-threaded execution.

Not to mention the hundreds of tuning parameters that may or may not be under the control of the programmer or person running the application, but still may have a large effect on performance, such as TLAB size, young gen size, parameters for heap resizing heuristics, GC parameters such as mark stack depth, a gaggle of compiler optimization flags and parameters, tuning parameters for locking implementations, etc.

What you're really pissed off about, I'm guessing, is that we are going back to the 70s and 80s with lots of different hardware for CPUs. As Gerard Holzmann points out, it is not just performance, but error isolation and divergence as well... a huge reason why the stack machine has remained a dominant figure in computer architecture.

Your comment doesn't make sense to me. Do you mean hardware "in" CPUs, i.e. the amount of logic inside? What do stack machines have to do with anything anyway? Modern CPUs are Von Neumann machines. But it doesn't really matter. That's not what pisses me off at all. Our programming systems and runtimes have radically more complex and unpredictable, but what actual benefit do we programmers reap from this that is not possible with a less complex (read: statically-compiled) system? We're accepting a level of complexity in our software systems that IMO mostly incidental and _not_ inherent in what we are ultimately trying to accomplish.

I think you've got it wrong. Nowadays, differences in architecture makes optimization of even basic algorithms dependent on architecture (which means hardwiring a language or runtime for one strategy to implement the algorithmic 'concept' may be folly). Some need experts in 'cache-oblivious' and others don't. The major change is that communication is becoming the bottleneck, or at least an equal partner with issues like exploiting instruction-level parallelism.

I'm not sure where you are going. Differences in architecture have always influenced optimization, even of basic algorithms. The difference is that nowadays, applications' profiles are not distributed in the same way they used to be. People don't run SPEC kernels on their desktop system every day. They run big apps, and servers run big apps built on big middleware, built on big VMs. There is a vast amount of "slightly warm" code and dozens of layers of complexity. This huge pile of warm code and huge pile of "slightly warm" objects puts a lot of stress on the memory system, which is less dependent on architecture idiosyncracies like whether the CPU supports the XFERE2 special bit-molesting instruction. Again, I have a feeling that all this software complexity is incidental, and NOT inherent in the problems that people are trying to solve with their software.

But whatever, this has wandered way off topic and I end my rant here.
[edit: misquoted parent]

### Amen

Not to mention the hundreds of tuning parameters that may or may not be under the control of the programmer or person running the application, but still may have a large effect on performance, such as TLAB size, young gen size, parameters for heap resizing heuristics, GC parameters such as mark stack depth, a gaggle of compiler optimization flags and parameters, tuning parameters for locking implementations, etc.

You speak like someone who's been there and has the scars to prove it. My sympathies. ;)

I'm not so convinced that we'd be better off throwing out the whole stack, but I certainly see where you're coming from.

### Development effort

The bulk of this complexity you describe was not created to solve the problems that people are trying to solve with software. The bulk of the complexity was created simply to help solve the problem of creating any working software at all. The piles of VMs, virtualization services, app servers, libraries, frameworks and managed runtimes are a lot of what has enabled enormous amounts of software to be created this last decade or so using a rapidly increasing population of developers of seemingly ever-decreasing average skill level. There simply aren't enough skilled developers in the world to build the sort of deterministic systems you describe in the quantities that are currently being demanded, and no one has any real idea of how to train them in quantity.

In short, the complexity was designed to optimize at the level of economies, not instruction sets.

### I don't think we're talking

I don't think we're talking about the same things. [...] Your comment doesn't make sense to me. [...] I'm not sure where you are going. [...] But whatever, this has wandered way off topic and I end my rant here.

I had lumped your reply with your other one about staging, since you seemed interested in alternative approaches.

### I really wonder about the benefits of JITting

The previous reply stated it more clearly. But, in the end, if a machine JITs something, doesn't it just end up with just more code than a good optimizing compiler would have delivered compile time?

Then again, I once read the ML (OCaml?) adagium just seems to be: Just don't optimize anything. I.e., for them it seems to be possible to compile to efficient code trivially. Anyone knows if they actually optimize?

### Benefits of JIT

The previous reply stated it more clearly. But, in the end, if a machine JITs something, doesn't it just end up with just more code than a good optimizing compiler would have delivered compile time?

Not necessarily. As mentioned above, inlining and specialization are, among other things, space tradeoffs. And code bloat can directly impact performance: instruction cache pressure can become a real issue. So a JIT that allows much more selective inlining and specialization can produce considerably less code than a "good optimizing [offline] compiler," at least in some cases.

Then again, I once read the ML (OCaml?) adagium just seems to be: Just don't optimize anything. I.e., for them it seems to be possible to compile to efficient code trivially. Anyone knows if they actually optimize?

### Okay... What about inline?

This actually throws up another question: Why did they get rid of the inline behavior in GCC/C? I.e. instead of 'it must,' it is an, 'it may' be inlined. Do they think the compiler is better at inlining statements than the programmer, and if so, why?

I mean, with an 'it must' semantics, I can compile a program, run a profiler on it, and add inline statements. With an 'it may' semantics, all that is gone.

Apparently, a good optimizing compiler just knows better than a programmer where to inline anyway?

### Why did they get rid of the

Why did they get rid of the inline behavior in GCC/C? I.e. instead of 'it must,' it is an, 'it may' be inlined. Do they think the compiler is better at inlining statements than the programmer, and if so, why?

I asked this exact question when I was in undergrad (2004?) to Dan Berlin (then IBM, now Google) and Andrew Pinski (Sony), both of whom worked on gcc. As I recall, the answer was effectively just that. But the other side to this is that 'inlining' is really just a specific case of interprocedural optimization. You can always force gcc to inline functions in a source file using compiler switches, but C keywords do not matter.

Apparently, a good optimizing compiler just knows better than a programmer where to inline anyway?

Good is a function of knowledge about what the best possible decisions are. Static compilers can have a bit more freedom in making decisions (it is not the end of the world if there is a slow pass), but JIT compilers can use dynamic information obtained at runtime to guide where to focus on optimizing.

### It occurs to me that if more

It occurs to me that if more language implementations would have specialization optimizations, or better inlining, then there wouldn't be a need for research into staged languages.

No. Staging is one way you can accomplish what Jose Meseguer calls "parallelism under your feet" - it's the sort of parallelism most people don't think about but is common in "active libraries" that rewrite their procedure bodies during compilation. Whether you are using staging or taking advantage of a Replacement rewrite rule, you need a way to specialize code in a portable way.

### Optimize

Yes, writing "square=optimize(\x->power(2,x))" is almost exactly what we have in our Haskell implementation. It specializes the function at runtime and then JITs it.

### Hmmm...

Tim, I thought that was what you were developing! :-)

### Dependent types and multi-stage programming

There is a paper that describes combining the two here.

I was wondering, if you had a total functional language which supported meta-programming with the same core/kernel language. Then maybe you could introduce partiality through meta-programming?

A small dependently typed language with meta-programming support using the same kernel language or multi-staging capabilities would probably be a C++ killer.

### Then maybe you could

Then maybe you could introduce partiality through meta-programming?

You may want to look at Manuel Clavel's Ph.D. thesis under Jose Meseguer. Effectively Clavel shows how you can build a logical metaprogramming layer that is Turing-equivalent.

A small dependently typed language with meta-programming support using the same kernel language or multi-staging capabilities would probably be a C++ killer.

With enough engineering resources and time, it is a possibility. For now, a prototype does exist and it is called Maude. C++ is backed by over 20 years of engineering resources, plus an extra 15 years if you count C. 35 years of manpower scattered across IBM, Microsoft, Intel, AMD, Apple, etc. is a lot of bodies. Not to mention all the domain-specific programmers optimizing against those compilers techniques for turning common C++ static production patterns into efficient programs (e.g., encoding tricks like using pointer-to-member functions to simulate goto).

### Implicit values

Scala's implicits allow you to specify a value for each type to use as the value of an implicit parameter of that type. Does it makes sense to generalize this to specifying implicit values for arbitrary predicates? Perhaps this can be used to put something similar to Scala's implicits in dynamically typed languages.

For example:

 def foo(s = implicit is_string, n = implicit is_int) { for(i in 0..n) print s }

 // specify two implicit values: implicit 5 implicit "bar" 

foo() // prints bar five times 

### Predicate-satisfying implicit values

I see two problems with that idea :

1) The obvious implementation is to try each implicit for each predicate. It doesn't scale well.

2) In the Scala paper, implicits are used in a way similar to haskell type classes, to guarantee properties that seems hard to test dynamically : comparison functions, neutral element of a monoid, etc. How do you dynamically test that something is a comparison function suitable for the given collection ?

I could see an answer to the second question : you could define, instead of a runtime contract, a kind of interface for comparison functions. For example, you could say that a comparison function is an object with two method "compare" and "has_intended_type" (wich, for an integer comparison, would be the "is_int" function). Predicates based on (nested) interface/instance checks could be practical.

### I had already given the

I had already given the second problem some thought. The (as far as I know unsolved) problem of writing a generic sum :: Monoid a => [a] -> a function in a dynamically typed language is a good example (without explicitly passing the monoid operations to sum). The problem is sum []: what does it return? One solution that came to my mind is dynamically checked types, like contracts. The list passed to sum would be a runtime structure containing the actual list data as well as a type description of the elements of the list, for example aList.elementType would return int. Sum would then be able to get the right implicit monoid for that type. For example assuming Monoid is a structure with 3 methods: zero, plus(a,b) and isApplicableTo(type). Then sum could be written like this:

def sum(xs, m = implicit m.isInstanceOf(Monoid) and m.isApplicableTo(xs.elementType)):
case xs of
[] -> m.zero
x::r -> m.plus(x,sum(xs,m))

object AddMonoid extends Monoid {
def isApplicableTo(type): type == int
def plus(a,b): a + b
def zero: 0
}

object ConcatMonoid extends Monoid {
def isApplicableTo(type): type == string
def plus(a,b): concat(a,b)
def zero: ""
}

implicit AddMonoid
implicit ConcatMonoid


One problem with this is that to make use of this you need to provide runtime types to all your lists, which means that you have to write a lot of boilerplate. Still I think it is interesting that you can get something like type classes / implicit in a dynamically checked type or contract system.

The first problem you mention could perhaps be solved by something like partial evaluation. I expect that very many implicitly provided objects could relatively easily be ruled out statically.

### Computed Optional Keyword Args Nice Feature for Prototyping This

CL, Scheme (all?) offer keyword args with computed defaults, meaning the default value for a parameter can be computed from args to the left in the argument (or is it only positional required and opt args?).

Anyway, given such parameter passing machinery in place, I imagine a few judicious macros could whip up an interesting prototype of predicate-driven implicits.

FWIW, I'd put a fast type dispatch (typecase in CL and/or hash on typename - in a dynamic language) first. It'll be the most common case anyway, I'd bet.

Additional predicates on the parameter can thus "pay their own way" for refinement on the "privileged" type test. One can always skirt around the "type test first" scheme by type testing against "t" or "Object" or whatever the top of the type hierarchy is.

Finding the closest type match with a user extensible type hierarchy (classes, maybe records if nominally typed, etc.) depends on the type system and language primitives, but when rolling one's own language this isn't going to be real a challenge.

I imagine that for most purposes, type-predicate-determined implicits will cover most cases, while additional component predicates (makes me think of guard clauses in a pattern matching clause, I guess) will nicely handle some odd corner cases in a system. But I'm just guessing, perhaps the non-type predicates could would play a huge "semantic role" for certain problem domains.

Interesting idea, even if the language is not dynamic!

Scott

p.s. Tossing back into the discussion all the diverse language cruft of inheritance, types, usually somewhat odd type hierarchies (splitting immediate and reference types near the top of the hierarchy - because representation matters so much?), traits/interfaces/mixins/etc. schemes....

I'd hope semantics of default-value-determination semantics of such a scheme would be kept *simple* even if *non-ideal*. No having to teach (although it's nice to) partially ordered sets or topological sorting to programmers to explain selection of a default argument value! :-) I guess keeping away from general predicate dispatch in method selection is one step toward keeping things simple.

### Meta-programming is inevitable

Ben L. Titzer wrote:

We just need smarter compilers. It occurs to me that if more language
implementations would have specialization optimizations, or better
inlining, then there wouldn't be a need for research into staged
languages.

The experience does not bear out that conclusion. In fact, in
high-performance computing at least, the consensus, borne by extensive
experience seems to be that we have reached the end of the road for
compiler optimizations.

The first section of the following paper makes that argument elaborately,
with many supporting citations:

I should point out that all the authors except one are from
super-computing centers.

Tim Sweeney wrote:

This approach to code generation forces the programmer to very explicitly
separate statically-known and dynamically-known data. I can write a general
"makePower" code-generator, and then use it to expose "power(x,n)", "square(x)
", and "twoToThePowerOf(n)". But I'd rather write "power(x,2)" everywhere and
have the compiler specialize it for n=2 right there

What you describe was the great hope of partial evaluation. The
compiler would just figure out all the specializations one could
possibly do. It is worth examining what all the people who wrote
papers and books on partial evaluation are doing today. To me the
revelation was the answer I received when I mentioned partial
evaluation as a possible solution to a person who wrote a well-known
book on partial evaluation. His answer (given at a public meeting) was
like this: Ah, indeed over the years we sort of figured out how to
write programs in the way that a partial evaluator can deal with
...''. He is a very smart person, and the his answer is profound.

Black-box solutions are great if they consistently give
good results. Partial evaluation has not reached that level, and
neither has optimization compilation. Sometimes, they work great. A
small, seemingly irrelevant change in the program might suddenly
preclude all optimization. So, the programmers are forced to guess
at the internals of the black box, they are forced to adjust the code
to please the mysterious beast.

The following simple exercises may give one a taste of the complexity of
the problem facing the partial evaluator. Let the function gib(n,(x,y))
be defined as

gib(0,(x,y)) = x
gib(1,(x,y)) = y
gib(n,(x,y)) = gib(n-1,(x,y)) + gib(n-2,(x,y))


The problem is to write an efficient implementation of gib, and
specialize it to a fixed value of n. The specialization must be
efficient. As the first approximation, a solution with no correctness
guarantees is OK. Bonus points are given for solutions that do make
correctness guarantees, in particular, that the generated
(specialized) code shall always compile. That is, the generated code is
assuredly well-formed, well-typed, and all used variable are assuredly
bound.

The great benefit of staging is that we do know what we get, because
we wrote it. We know exactly which computations will be performed at
earlier time.

Staging is far from being fully developed; brackets and escapes are
are the assembly language of staging; they are too low-level. We are
still discovering good abstractions.

### From the above mentioned

From the above mentioned article.

It is well known that manual optimizations degrade portability: the performance of a C or Fortran code on a given platform does not say much about its performance on different architectures.

Um, I know I'm missing something, but the very idea that 1) the "same program" (however defined) *should* 2) "perform the same" (however defined) on 3) different machine - actually, I prefer 'system' - architectures sounds like a, well, reasonably ridiculous idea.

Last I paid any attention (long ago, I'll admit), even HPC's "bread and butter" floating point processing for the Cray machines of yore conformed to no common standard, requiring custom coding for the "HPC gold standard" Cray platform.

So what? It's a trade off made by those who purchased or licensed those machines. Don't like the trade off? Get another machine and get on with life. What's more, this trade off was not "technical." It's just a simple trade off between the currently purchasable options in the good old marketplace.

But wait, one might say - it's the nature of funding for research groups (e.g., politics, influence peddling, military/industrial complex) that they inevitably end up with very heterogeneous systems in their shops (I was briefly at Harvard HUSC and saw this up close).

I'll retort with another old truism from way back: don't try solve fundamental social/organizational problems with hardware and software technology. In other words: the problem is the nature of and constraints on our oh-so-very-social *funding sources*, not that applications don't happen to peg out CPU or paging or I/O or whatever identically on any old machine or OS or compiler combination one throws at it (to me - more or less, a non-problem; or at best, a very, very, very narrow niche problem).

Another personal data-point that makes me feel old, but one of my former boss/founders at an Oracle diagnostics startup back in the early '90s gained his RDBMS/Oracle tuning expertise in working for some government agency or other. There, every time they bid-out for new mini-computers, he led a team that meticulously bench marked their apps and workload on EVERY candidate system with the goal of: as cheaply as possible handling their application throughput on a set of machines at, IIRC, a sustained load of ~70% (however defined, but surely an aggregate given the RDBMS-centric applications - again, IIRC, etc.).

So here, 1) they *did* expend labor required to perform fairly extensive tuning, but here just during the *bidding process* (!), and still 2) in the end, the hardware was selected for the application(s) performance, not vice versa.

I can only view development of programming languages designed to gloss over - via accommodation in the languages' and or compilers' designs - performance differences between diverse system architectures as, pragmatically, very likely a pipe dream out in the real world of constant flux not just in systems architectures but, just as much, even in the very companies who design and sell these systems into the niche HPC market in the first place.

While a nifty idea, even within the HPC niche market, if I were at DOE (let's say), I can't imagine writing too many grant checks in pursuit of this particular goal.

S.

p.s. I can recall my father telling me about IBM's "consent decree" and how now people could actually purchase compilers and other tools from vendors other than the hardware vendors. I imagine this had to be a pretty giant milestone in the pragmatic realization of "portable software applications" - at all (and hardly even an historical inevitability). Heck, by this standard, we've been in portability hog heaven for decades now ;-)

### Today, optimizing for 1 platform is not a goal

scottmcl wrote

Um, I know I'm missing something, but the very idea that 1) the "same program" (however defined) *should* 2) "perform the same" (however defined) on 3) different machine - actually, I prefer 'system' - architectures sounds like a, well, reasonably ridiculous idea.

Last I paid any attention (long ago, I'll admit), even HPC's "bread and butter" floating point processing for the Cray machines of yore conformed to no common standard, requiring custom coding for the "HPC gold standard" Cray platform.

The field has moved since and your information is no longer current.
I'd like to point out this paper:

The ASC-Alliance projects: a case study of large-scale parallel
scientific code development''
Lorin Hochstein, Victor R. Basili.
IEEE Computer, March 2008, pp. 50-58.

Sorry I don't know if it is available for free.
The authors talked to the developers and examined the code of the most
computer-intensive projects: five large-scale ASC-alliance projects
(simulating neutron stars, break-up in materials) sponsored by DoE.
Surprisingly, achieving peak performance on one particular platform
was not a goal: "Since one of the project goals is to develop
algorithms that will last across many machine lifetimes, it's not seen
as productive to try to maximize the performance on any particular
platform. Instead, code changes are made that will improve the
performance on a wide range of platforms."

### Thanks for the update! Hard

Thanks for the update! Hard problem to tackle.

S.

### Are you saying that you want

Are you saying that you want to automatically derive a linear time version of gib? There are techiques that can do this (keywords "static incrementalization" and "tupling"). But is this a fair requirement? After all if you stage gib naively you don't get linear time specialization. So in my opinion this is not a fair comparison: you are requiring that the partial evaluator comes up with a new algorithm whereas with staging you leave that to the programmer.

### Why stop at linear time?

This problem can be solved in O(1) additions and multiplications.

### I can see how it could be

I can see how it could be done in O(log n) additions and multiplications. But how in O(1)?

### You probably have the right idea.

...or if not, you are rather close. I never said how long it would take to specialize gib to any single n, only how long the specialized function would take to calculate its result for any given (x,y).

### I do not see how you could

I do not see how you could make that use O(1) multiplications and additions. In fact I think that is impossible. gib(n,(1,1)) grows exponentially in n. To make a number exponential in n you need at least O(log n) multiplications.

Edit: oh I see, you are not counting specialization time. But I think Olegs point was that specialization should not take exponential time.

### Specialized functions versus the process of specialization

Actually, the efficiency of the specialized function is what Oleg had in mind, and the solution I have in mind consists of two multiplications and one addition in all cases. The act of specialization would take O(log n) additions and multiplications.

### It's a generalized form of Fibonacci, No?

So, like every linear recurrence it should have a closed form. Therefore, O(1).

Am I right?

[ Actually, by looking at the golden spiral, I'ld expect it to be x*y times the closed form of fib. (Nah, close but no dice. Though I'ld expect the spiral just to be a scaled version of Fib.) ]

### Not quite

The closed form requires O(1) additions, multiplications, and exponentiations. Unfortunately, direct evaluation via floating point numbers will quickly overflow the significand. It's useful for quickly calculating the first few digits of the result, but doesn't produce an exact solution.

The solution I have in mind involves matrix multiplication. Or rather, it's easiest to understand why it works via matrix multiplication

### Partial evaluation is evitable

Oleg, is the following a fair summary of the problems?

Partial evaluation:
* When a compiler attempts to partially-evaluate code, it can't make effective static/dynamic tradeoffs nor reason about whether attempting to inline recursive functions will result in termination.

Staging:
* When a programmer uses explicit staging, he has precise control over static-versus-dynamic tradeoffs and inlining.
* But explicitly staged code using brackets/escapes/eval contains a high degree of syntactic baggage.
* When staging is used to generate specialized versions of an underlying function, underlying properties can be lost in the translation to ad-hoc code, e.g. now we need to define square(k) in terms of a code generator, and we aren't guaranteed square(k)==power(2,k) "by construction".

Naively speculating again, it seems like explicit logic-language style programming constructs would enable us to achieve, in a single definition, a "power" function that handles the general case power(n,k), is guaranteed to be specialized at full performance for e.g. power(c,k) where c is a constant and 0<=c<=100, and its consistency is guaranteed by construction, and will always be correct and divergence-free.

# Original function.
power(n:int,k:int):int=
if(n==0) 1
else n*power(n-1,k)

# New function with staging annotations.
power(n:int,k:int)=
if(n==0) 1
else n*power(n-1,k)[#inlineIf(constant[n],n>=0,n<=100)]


Here, the compiler would guarantee inlining of the function call under the specified conditions. And the definition of a*b would guarantee an immediate result when its inputs are constant, etc. And we could perhaps cause dynamic compilation of specialized code by writing something like:

specialize(f(k:int)=power(n,k))

Here, n is dynamically known at runtime but k is not, and the code-generator would precisely followed the programmer's (and library's) staging annotations, thus guaranteeing predictable behavior.

Would that get us away from the "assembly language of staging"?

I can imagine writing (and reading!) practical dynamic-compilation examples (e.g. the simple calculator / expression-evaluator) more easily with this sort of "code with annotations" style than with brackets and escapes.

### The extent of meta-problems

Neil Jones has a good example, of specializing the Ackermann function

ack(m,n) =
if m=0 then n+1 else
if n=0 then ack(m-1,1) else
ack(m-1,ack(m,n-1))


to m=2. If we naively substitute m=2 and propagate the constant, we
obtain

ack_2(n) =
if n=0 then ack(1,1) else
ack(1,ack(2,n-1))


The occurrence of ack(2,n-1) is particularly troublesome: if the
partial evaluator goes blindly ahead with unrolling that too, it won't
terminate. So, the specializer has to know' when to stop, which is a
hard problem. BTW, that is the reason Neil Jones, for example, got
interested in termination checking. Suppose our specializer is smart
and stops after one unrolling of ack. So we ended up with the original
two-argument ack and a particular instance ack_2, a simple
wrappper over the original ack. That is barely an improvement. Neil
Jones shows what a better specializer could do:

ack_2(n) =  if n=0 then ack_1(1) else  ack_1(ack_2(n-1))
ack_1(n) =  if n=0 then ack_0(1) else  ack_0(ack_1(n-1))
ack_0(n) =  n+1


There is no longer any need to invoke the original two-argument ack;
all the work is done by specialized acks. Something strange has
happened: the specializer has generated two new recursive functions.
The control flow is now different: instead of ack calling itself,
ack_2 calling itself and ack_1, the latter calling itself and ack_0.
As Neil Jones says, that requires "polyvariant specialisation".

The example well illustrates that a specializer often needs to
generate new bindings -- sometimes even recursive bindings. It is here
where things can go wrong: even if we generate AST rather than
text strings, it is disturbingly easy to generate code where variables
are used before they are defined. Our papers show many examples of
that. Statically assuring that all introduced variables are properly
bound is hard; we've been working on this for years.

Here is a second example, of specializing

f a x y = cos(a) * x + sin(a) * y

to a = pi/4. If we propagate the constant
pi/4 and evaluate sin and cos, we obtain

fpi4 x y = 0.7071067811865476 * x + 0.7071067811865475 * y

if we look closely, we see that the two floating-point numbers are not
equal. That is not the artifact of printing them out. If we ask GHCi
to evaluate "sin (pi/4) == cos (pi/4)", it returns False. OCaml
concurs. Therefore, there is nothing our specializer could do. A
better specializer will reason symbolically rather than numerically;
it would know that sin(pi/4) and cos(pi/4) are mathematically
equal. So, a better specialized code will be

fpi4 x y = 0.7071067811865476 * (x + y)

We saved one multiplication. Everyone probably recognized that
f a x y` comes from the area of FFT; saving one
multiplication can mean a great
deal if we are doing FFT of long sequences on embedded systems. So, a
good specializer should need a lot of domain-specific knowledge (e.g.,
trigonometry). That means it is hopeless to wish for a one
specializer, or optimizing compiler, to rule them all. Partial
evaluators or specializers cannot be treated as black-box. They should
be frameworks, to let domain experts put their knowledge. Since these
frameworks are aimed at domain experts rather than professional
meta-programmers, correctness guarantees become much more important
then.

### Taking care of bindings with domain-dependent sum

There is no longer any need to invoke the original two-argument ack;
all the work is done by specialized acks. Something strange has
happened: the specializer has generated two new recursive functions.
The control flow is now different: instead of ack calling itself,
ack_2 calling itself and ack_1, the latter calling itself and ack_0.
As Neil Jones says, that requires "polyvariant specialisation".

The example well illustrates that a specializer often needs to
generate new bindings -- sometimes even recursive bindings. It is here
where things can go wrong: even if we generate AST rather than
text strings, it is disturbingly easy to generate code where variables
are used before they are defined. Our papers show many examples of
that. Statically assuring that all introduced variables are properly
bound is hard; we've been working on this for years.

Are you familiar with John Launchbury's projection factorisations in partial evaluation? The approach that he advocates seems to deal with this polyvariant specialization example quite well by translating the types of specialized functions to dependent sums defined by the static projection of a function's arguments. I'm still working through it, but I'd be curious to know if you're familiar with the approach and what you think of it.

### Lecture by Ralf LÃ¤mmel on Type Classes in Haskell

Many of you will find this educational and obviously relevant to this thread... C

### Meta-programming at ICFP 2010

The ICFP festivities at then end of September include a tutorial

Camlp4 and Template Haskell (Nicolas Pouillard, Jake Donham)

several presentations at the ML 2010 workshop

and even a Metaprogramming for OCaml BOF. We can talk through the night.

### More Metaprogramming at ICFP

Also in metaprogramming at ICFP: Fortifying Macros, by Ryan Culpepper, and also Niel Mitchell's superoptimzer paper may be of interest.

### Just a curiosity of mine,

Just a curiosity of mine, and I don't mean to over simplify these interesting efforts, but am I totally remiss in being oddly reminded of Common Lisp's good old "compiler-macros" ?

S.