## Request for feedback: Epoch Programming Language

Hi all!

I've been reading LtU for a couple of years now, and although I must confess that a fair amount of the highly specific research goes over my head, it's been an invaluable resource for furthering my own explorations in PLT, and deeply thought-provoking even when the finer points may escape me. As such, I've come to respect the LtU community very highly, and I strongly value the opinions of those who have a deeper grasp of this subject than I do. I believe that immersion is the best way to learn - sink or swim, if you will.

In my own career I've done a fair amount of work on DSLs and embedded DSLs for various purposes, most recently games development; this has been a fascinating brush with PLT and the real-life implications of language design and implementation designs. However, I consider it just that - a light contact. General-purpose language development is a completely different ball game, and despite my decent amount of experience in the DSL realm, I'm still a total neophyte when it comes to building a true general-purpose language.

That brings me up to Epoch, my language-in-the-making. Of all the hobby projects - and professional tasks - I've engaged in over the years, this is easily the most exciting, fascinating, and challenging. I hope I can glean some serious education from the LtU community as I present my pride and joy!

Epoch - A General Applications Language for Asymmetric Multiprocessing
I didn't just set out to make a language for the fun of it, although it has been an immensely enjoyable ride. The primary goal of Epoch is to harness modern asymmetric multiprocessing hardware opportunities. Symmetric multiprocessing (SMP) is becoming increasingly commonplace in the form of multi-core processors, and this proliferation shows no signs of slowing any time soon. However, asymmetric multiprocessing (AMP) - the presence of non-homogeneous computation devices within a single computer - has actually been around even longer, and remains one of the most untapped computation resources we have outside of the games industry.

AMP heavily features in entertainment software, in three forms: the primary CPU for main game logic, the GPU for video rendering and effects, and accelerated audio effects from dedicated sound card hardware. In recent years, the introduction of dedicated physics accelerators has added another dimension to the puzzle, and the explosion of specialized processors such as Cell, Tesla, and Larrabee promises to complicate matters even further.

The key observation of the Epoch language project is that, by and large, the implementation of interesting algorithms on various forms of hardware should remain independent of the specific details of operation of those hardware elements. We can divide the processing hardware common in AMP situations into a few families, and identify effective means of representing algorithms for each family. In this way, programmers need only understand the family of processor they intend to work with, and learn how to express algorithms for that family itself - there is no need to learn a half-dozen machine or architecture specific instruction sets or languages in order to tap all of the available hardware.

Another key attribute of Epoch is its ability to dynamically react to available hardware and relocate running code according to the physical resources at hand. For instance, the VM may notice the presence of a supported GPGPU system such as an OpenCL-compatible device, and offload specially tagged code from the main CPU to that GPGPU for better performance. It is the responsibility of the programmer to tag his code correctly, but once this is done, he can be assured that his program will flexibly take advantage of any and all resources available on whatever hardware the program runs. This can even be extended to generic SMP via multithreading, or distributed computing across local or wide-area networks, given proper infrastructure in the VM itself.

How is this all relevant to PLT?
Although Epoch's focus, from a practical standpoint, is multiprocessing/concurrency in various forms, I don't intend to just write another minor variant of C. Instead, I'm drawn to the host of modern languages (and even older ones!) with rich feature sets and philosophies, that make development of nontrivial software much more pleasant and robust.

As such, my real goal is to create a language that is both accessible to the current generation of systems and applications programmers who are accustomed to C (and to a lesser extent C++) for their work, and simultaneously appealing to those programmers with experience in much more expressive and powerful languages, notably Common Lisp, Scheme, ML, Haskell, Ruby, and so on.

This is where LtU comes in, in a big way. My own grounding in PLT is purely that of a curious autodidact; I have no formal training in PLT or indeed in programming in general. Although I strive to understand and internalize the theory and best-practices of formal and academic programming and PLT, I can only be honest and say that I know there must be significant gaps in my knowledge.

My goal with posting this here is not to publicize or evangelize for Epoch, but to increase my own understanding both of the theoretical foundations upon which I am working, and of the requirements and desires of real-world programmers who may one day find Epoch an appealing toolset upon which to develop their own work.

A quick overview of Epoch
In a nutshell, I can describe Epoch as follows. Please note that a lot of this is just planned for the future; I'm only just now working my way up to the 11th iteration of the core compiler and VM, so there's many areas that still need shoring up or even first implementations.

• Statically, strongly typed
• Syntactically simple and consistent, although not quite to a Lisp extreme
• Statically, lexically scoped
• Supports both procedural/imperative and purely functional paradigms, via tagging of "pure" code
• Designed to marshal readily to C APIs/ABIs
• Supports object-oriented paradigms with rich "access control" semantics and named subtyping
• Rich support for algebraic datatypes and type inference
• Value semantics for variables with optional reference semantics
• "Null" references are strictly forbidden; algebraic datatypes and Empty/Nothing types are required for expressing optional references or optional function parameters
• Supports closures and higher-order functions
• Fully garbage collected
• Intended to be execution-model agnostic: i.e. can run under a VM, native code, etc.

And of course I'm sure there are more details I've forgotten!

The Epoch Language Project on Google Code
The full project can be found at epoch-language.googlecode.com. Please feel free to roam around the source repository, issue tracker, and wiki; if you feel exceedingly generous, download Release 10 of the SDK and play around a bit! Everyone is welcome to report issues or other feedback.

Again my goal here is to further my own education and understanding a bit, and I believe one of the best ways to do that is to open my work up for criticism by people who know more than I do!

Thanks for taking the time to check this out, and I look forward to hearing from LtU.

## Comment viewing options

### Sounds interesting

Could you show how to adapt a simple program to different heterogeneous physical architectures?

### Certainly!

My first stab at this was to use what I referred to as language extensions which would take an appropriately annotated block of code and pass it to a cross-compiler for generating native code to execute on different hardware types. For instance, I implemented support for GPGPU systems via CUDA using syntax like this:

multiply : (integer(in), integer(factor)) -> (integer(out, 0))
{
cuda
{
out = in * factor
}
}

Obviously for a simple situation this makes little sense, but the syntax is the same for less trivial applications. You just write code in Epoch inside a special "block", and that code magically runs on the hardware type you selected. The plan is to replace specific APIs like CUDA/OpenCL with a more generic "GPGPU" tag that would detect available hardware and do the cross-compile transparently at execution time; think JIT compilation.

As of Release 11, I'll be introducing a new mechanism known as function tagging. This is a uniform, simple syntax for adding semantic annotations to any function. Along with a handful of other improvements to the language, this will eventually permit syntax like the following:

multiply : (integer(in), integer(factor)) -> (in * factor) [gpgpu]

Along with richer type inference systems and polymorphism, someday I hope to pare that down even further:

multiply : (var(in), var(factor)) -> (in * factor) [gpgpu]

Note that I can omit the function body block as well as the return type in the first case, and even the parameter types in the second case.

A common approach to writing code against a GPU or Cell-style architecture is the kernel, wherein you write a general algorithm (with minimal branching and state, since those are expensive on such architectures) that does all the interesting work, then you throw massive datasets at the hardware. Ideally the kernel does the same thing (or very similar things) to every data element, which capitalizes on the high-throughput capacities of such hardware.

This can be thought of in perhaps more familiar terms as simply writing an appropriate function and then applying it across a dataset via map/fold/etc. In Epoch, you would write the kernel as a standard function, then for instance map it across your data set with the built-in map function; if your kernel is tagged as needing to run on a GPU, and such hardware is available at runtime, the code will automatically be marshaled over to that hardware, the processing performed, and everything copied back to the host program for continued usage (ideally with lazy evaluation semantics where possible, since GPU->CPU memory copies are expensive, for instance).

Unfortunately most algorithms that really work nicely on GPUs are nontrivial and hard to summarize in a short post, but you can get a good idea of what's out there by looking for OpenCL or CUDA example programs (they're easy enough to read even without knowing the DSLs themselves). The kicker with Epoch is that you use a uniform language instead of a DSL to implement your algorithm, and of course relocation to different hardware families is more or less transparent.

### Bling does something similar

Bling does some similar things. We use standard C# with functions a system of parallel types and normal expression semantics. We've successful applied the approach to basic shaders (mainly pixel and vertex), but not GPGPU yet. All the code running in C# land basically builds expression trees (the code still looks like normal C# code via fancy operator overloading), and whether code runs on the GPU, and whether it runs in a pixel or vertex shader, is based on the context that the expressions are used in. We try to push everything up as far as possible, so expressions that don't change per pixel or per vertex are computed on the CPU per-shader invoke, while expressions that depends on per-vertex will execute in the vertex shader, and expressions that depend on per-pixels data will execute in the pixel shaders.

GPGPU Kernels are a bit more difficult to support (CUDA, DirectCompute, or OpenCL) since the programming model for GPGPU is so crazy. The functions are easy enough to write, but the weird kernel hand offs and memory model that GPGPU languages enforce make it very difficult to automatically schedule where expressions should run. I would like to know how you are dealing with that, if you are targeting CUDA, you should have already run into that. How does Epoch compare to something like BSGPU (published at Siggraph 2009 I think)?

### Indeed an interesting problem

There's definitely a lot of unexplored territory in the realm of smooth interop between GPU-side and CPU-side code. My personal answer may be a bit of a cop-out, but it seems to be the most pragmatic approach for the time being: let the individual end-programmer control as much as possible.

The idea of tagging functions as [gpgpu] or running select statements inside a gpgpu { } code block is to explicitly delimit where things are actually invoked and marshaled back and forth. My key observation here is that different GPGPU kernel implementations - even of similar algorithms - can have highly differing behaviours when it comes to when and how data is copied between the two realms, and even when and how the computation itself is performed.

By way of cheating, then, I leave the details of this largely up to the kernel implementer. Rather than trying to optimize the interplay of buses, memory models, execution latency, and all that automatically, I leave things open for rearrangement by the programmer. In this way it is possible to achieve dramatic differences in performance by simple tweaks of the raw Epoch code, e.g. manually hoisting constant or rarely-changing expressions outside of a gpgpu { } code block and passing their values in manually via marshaled data.

(Another way to say this that might be more honest: I don't personally feel smart enough to attempt to do this stuff automatically, at least not from day one!)

I think the field is fertile for research into how to do all this automatically at some point in the future; because I frankly don't see us backing away from AMP any time soon. It's just too promising of a model. But for now, my personal opinion is that we need to toy with things manually for a while until we generate a better intuition of how all this stuff interacts, so that we can start working on sophisticated automatic code improvement.

I see it as really strongly parallel to the development of optimizing compilers for superscalar CPUs. For a few years there it was a black art to do things cleverly and optimally on a superscalar architecture, and then at some point we crossed that threshold where enough people had enough good understanding and intuition about those issues that building a compiler to do it automatically became practical. And nowadays it's rare to hear of anyone actually getting gains by, say, writing inline assembly language into their C++ apps. In a nutshell, I think we'll see a similar evolution happen in the GPGPU space.

Oh, I tried to find the SIGGRAPH paper you mentioned but didn't have any luck; unfortunately I don't have access to the full proceedings at the moment so I'm not really able to provide a good comparison to their methods, sorry.

### You still didn't mention how

You still didn't mention how you would do kernel hand offs. I got the name wrong, here is the name of the paper:

BSGP: Bulk-Synchronous GPU Programming, here is an LTU discussion with link:
http://lambda-the-ultimate.org/node/3709

Even if you are scheduling manually, you would still have lots of challenges to face. Writing a non-trivial CUDA kernel is very difficult, you have to manually chop up your data, manually create buffers to transfer results between kernels, and so on. Right now, its much more difficult than functional orgami.

### Not my forte I'm afraid!

Composition of kernels into nontrivial algorithms isn't an area where I have any substantial experience yet, unfortunately. As I roughly alluded to above, my general plan of attack (however naive) has been to build the simplest thing I can first, and then incrementally add complexity and semantic richness on the Epoch side as my own intuition about what's going on under the hood develops.

I've run into the difficulties mentioned here before, and the motivation for the BSGP paper (thanks for the link btw) is eerily familiar. I'm afraid though that I haven't had the chance to truly look for solutions to these issues yet. The BSGP stuff looks highly promising as it offers a similar notion to what I'd like to accomplish with Epoch; at the very least there's some valuable tools in there for making things a bit more potent on my end.

In any case... as I said off the bat, I'm not really here because I've got it all figured out (much as I wish that were the case!). Rather, I'm here hoping for exactly what you've provided in your responses - opportunities to find things that I don't know about, so I can go learn them.

### Leveraging GPUs involve very

Leveraging GPUs involve very intense trade offs. Before even covering the issue of programmability, getting data on and off the GPU is very expensive, so it only makes sense to run fairly large kernels on them. Your algorithm must be tailored for the GPU and you have to have the right problem (e.g., lots of matrix multiplies), otherwise you won't see any benefit and your performance will really suck. Once you have the right problem, you need to write your GPGPU code, which requires a lot of understanding of the architecture, along with intensive tuning because you almost always never get it right. Seeing as the laws of physics are against you, you have to carefully consider Epoch's scope so that your goals are reasonable.

The authors of BSGP are graphics guys who are getting into languages. They understand the nuances of GPU programming, but maybe don't have the language background. Perhaps it would actually take both backgrounds to crack open GPGPU, but the pesky problem of limited problem scope would remain.

### OpenCL looks very

OpenCL looks very interesting. I would even say that a new programming language should consider building GPU computing support tailored for OpenCL right into the language.

### OpenCL ~= CUDA

As far as I'm aware, OpenCL and CUDA are basically just different syntactic approaches to much the same problem: how to write general-purpose algorithms in a low-level format that's suitable for execution on dedicated graphics hardware. Although OpenCL, esp. with the 1.1 standard, has branched out a bit beyond the scope of CUDA in a few areas (notably "host" side multithreading awareness and so on) they remain largely similar in terms of capabilities and focus.

My goal is actually to support OpenCL and CUDA at some point in Epoch, via what will likely be a general-purpose compiler that targets C99-like languages (which includes OpenCL and CUDA). The abstract side of scheduling and kernel data transformations/handoffs (as Sean has been alluding to) are likely to remain similar enough that most of the transformations and code improvement can be done at AST level for each target; "platform" specific optimizations can then be done at the lower level of emitting the actual OpenCL/CUDA code, as well as possibly architecture-specific optimizations depending on the final target hardware/shader level (much the same as we optimize for CPU levels now).

Another juicy possibility is integrating OpenMP support directly into the language, for things like distributed/parallel for, map, reduce, et. al.

### CUDA is generally faster

CUDA is generally faster than OpenCL and DirectCompute because it evolves much more quickly in sync with NVIDIA hardware. OpenCL is probably about one or two generations behind. Apple made a bet on OpenCL and it hasn't seem to pan out; the programmability benefits aren't really there while anyone who really cares about performance is doing GPGPU on NVIDIA hardware (AMD is kind of left out right now).

### Agreed

A practical benefit to my lazy/ignorant/naive approach to GPGPU support in Epoch is that it keeps a lot of the tough stuff - like deciding which algorithms to even target towards the GPU, as well as much of the implementation detail - in the hands of the end programmer. I like this for a few reasons: it keeps me from having to promise too much up front, it leaves room for experimentation, and it doesn't close over the capabilities of the language framework in such a way as to preclude building more powerful GPGPU/etc. abstractions into the language itself in the future.

In an ideal world, I'd love to team up with someone who does have the GPGPU chops to tackle some of these issues, and work alongside them to create the linguistic abstractions that make the most sense. To me it seems like the most pragmatic way to do that is to produce a sort of "teaser" as it were, with the minimalistic prototype of Epoch showing some of the possibilities of what can be done - and then saying "hey, look, if we combine this really nice language with some cool automatic kernel transformations ala BSGP, think of the possibilities."

Of course there remains a significant chance that there's a far better approach, in which case I'm perfectly open to alternative suggestions!

### Not my forte I'm afraid!

Composition of kernels into nontrivial algorithms isn't an area where I have any substantial experience yet, unfortunately. As I roughly alluded to above, my general plan of attack (however naive) has been to build the simplest thing I can first, and then incrementally add complexity and semantic richness on the Epoch side as my own intuition about what's going on under the hood develops.

I've run into the difficulties mentioned here before, and the motivation for the BSGP paper (thanks for the link btw) is eerily familiar. I'm afraid though that I haven't had the chance to truly look for solutions to these issues yet. The BSGP stuff looks highly promising as it offers a similar notion to what I'd like to accomplish with Epoch; at the very least there's some valuable tools in there for making things a bit more potent on my end.

In any case... as I said off the bat, I'm not really here because I've got it all figured out (much as I wish that were the case!). Rather, I'm here hoping for exactly what you've provided in your responses - opportunities to find things that I don't know about, so I can go learn them.

### debugwritestring? Ouch!

If you really want to attract programmers at some point, you'd better use terser names such as 'log'.
I'm sorry to write on such meaningless detail but it did really hurt my eyes..

### Terseness is not a virtue.

Terseness is not a virtue.

### I think this needs some qualification.

If the goal is to make code as clear and readable as possible, then striving for economy of keystrokes will often not get you there. For example, saying #define HPD 24 instead of #define HOURS_PER_DAY 24 will hinder readability. HPD could mean a lot of things, and a Google search doesn't even list "hours per day" on the first page.

However, the complaint was (emphasis added):

but it did really hurt my eyes..

Here, debugwritestring is not only uglier than log, but also makes the routine appear more specific than it (probably) really is. The "debug" part would discourage me from using it, as filling my code with "debug" statements makes it look like I can't trust my program. writestring? What else are you going to write? int? format? Does Epoch lack adequate support for parametric polymorphism to be able to use "log" for all things? Or is it just an identifier that could use some shortening?

The advantage of terseness is that it's less to wade through. As long as the short name isn't cryptic, and especially if the short name more fully describes the function's purpose, the short name is often less painful to look at.

### :-)

Yeah, there's some bootstrapping cruft littered across the language and the intrinsic functions/standard library/prelude/whatever you want to call it.

Release 12 should include namespace support at which point debugwritestring will become debug::log which can be imported into the global namespace to just log.