## Languages for SIMT Architectures

I think SIMT is going to be added to the traditional CPU, much like SIMD already has been. In any case SIMT already has a foothold in the GPGPU architectures which are SIMT.

SIMT is where a single instruction is executed by multiple threads. It differs from SIMD in that each execution unit has its own register set and control flow, but shares instruction fetch and decode, whereas in SIMD registers and control flow are shared. This needs special care to deal with conditional branches. Generally only if/then/else or switch style conditional branches are allowed and as all instructions are shared by all SIMT units, they have to ignore the instructions in the branch they are not taking. In the case of loops you can have early termination, but all units wait for the last one to exit the loop (effectively instructions still get fed to the units, but they are ignored). This reminds me somewhat of the conditional execution of any instruction in the original ARM instruction set - however I think this was dropped in ARM64 in favour of traditional branches because branch prediction results in faster execution in the (more common) single threaded case.

I am interested in languages and compilers designed to take advantage of SIMT architectures, and that may make efficient programming for such architectures easier for the programmer and compiler technologies that may make things more efficient for the processor to execute. I wanted to start this topic to see what has been done in this field, and collect links to relevant papers. Please post if you know of any, come across any, or have any ideas for how to handle SIMT.

## Comment viewing options

### ISPC

ISPC is a compiler for a GPU-like SPMD programming model, targeting SIMD CPU hardware. There is an accompanying paper from Matt Pharr and Bill Mark.

ISPC borrows some ideas from shading languages like the RenderMan Shading Language (which itself was motivated by the concerns of targeting Pixar's custom SIMD hardware), such as a "uniform" keyword to indicate that data is invariant across the program invocations.

Unlike most GPU SPMD languages (CUDA, OpenCL), ISPC mandates fairly strict rules about how operations across the program invocations are ordered (more or less conforming to the informal notion of "lock-step" evaluation). This leads to a restriction in the control-flow constructs allowed by ISPC, which conforms closely to your informal characterization of SIMT.

In contrast, "true" SIMT in the style of CUDA or OpenCL doesn't constrain control flow this way (both support unconstrained "goto", although CL doesn't mandate support for irreducible loops). These GPU SIMT languages also don't give as strong of guarantees about "happens-before" relationships for program invocations in a group.

Approaching GPU SIMT languages with any assumption about how they are implemented (e.g., the simple notion that when an "if" condition is varying, the processor "executes both sides" and then resumes at the end), will eventually lead to difficult-to-diagnose bugs. These bugs are quite similar to what happens when people make the naive (but understandable) assumption of sequential consistency in languages that do not guarantee it.

### CUDA SIMT

Ultimately SIMT executes a single instruction in all the multiple threads it runs – threads share program memory and fetch / decode / execute logic. When the threads have the same flow – all run the if, nobody runs the else, for example – then they just all run the code in if at full throughput. However, when one or more threads have to run the else, they'll wait for the if threads. When the if threads are done, they'll in turn wait for the else threads. Divergence is handled correctly, but slowly. Deeply nested control structures effectively serialize execution and are not recommended.

### This is kind of my area…

Yes, many people share this naive idea of what the "divergence" behavior is or should be. I myself had such ideas when I was involved in the early days of the CUDA project, but was quickly disabused of them.

If you are up for getting your information from blogs, I'll point you at what I wrote on the subject, based on my experience working on various GPU architecture and compiler projects.

I'll reiterate that the SIMT "divergence" concept is quite similar to memory models. We all have our intuitive assumptions, which tend out to be wrong in practice, and the real guarantees are a subtle combination of architecture, compiler, and language.

### Instruction Order

I get your point now, the branches may not get executed in the order you expect. In fact you may find different cores execute the branches in different orders, or groups of cores use time they would be waiting to execute branches during which the 'other' group of cores would be waiting, effectively splitting the warp in two and getting rid of both wait times.

Reading your blog, the problem is not the SIMT model, but exposing the instructions that break the model. That was really a design mistake, and a SIMT language would have to find a way of making that impossible, possibly by finding an alternative concept to locks that indicate the desired behaviour.

It seems more like the hardware design was missing some logic, such that branch order can only be divergent if there are no locks in the branch. This follows the general rule that optimisations can only be applied where they do not change the semantics of the code.

### Instruction that break the model

Reading your blog, the problem is not the SIMT model, but exposing the instructions that break the model. That was really a design mistake, and a SIMT language would have to find a way of making that impossible, possibly by finding an alternative concept to locks that indicate the desired behaviour.

When you talk about "instructions that break the model," you mean stuff like atomic read-modify-write instructions. For that matter, even basic memory stores (whether to global or shared memory) can end up revealing implementation details in ways that make code non-portable or non-deterministic.

Note that these same operations also create similar problems for MIMD architectures/languages. The only additional wrinkle that SIMT brings to the table is that thread scheduling is typically unfair, and does not guarantee independent forward progress.

As one more note: it isn't just about the two branches of an if getting executed in an unexpected order, it is also about the (surprisingly hard to shake) assumption that execution "re-converges" immediately after that if statement. This means we tend to assume that a load after the if (no matter what thread executes it), can see stores performed by code in the then/else clauses by threads in the same warp/wavefront.

That assumption is not held up by the specification of any of these programming models; they are all clear that one must use a barrier-synchronization instruction both to enforce order of execution, and to guarantee visibility of memory operations.

Even if an architecture automatically provided the assumed ordering property, there is nothing in the specifications to stop a compiler from moving (or tail-duplicating) a load into the if, and breaking the assumed ordering. Again, this is just like how memory models require cooperation between the programmer, compiler and architecture.

I don't want to belabor this point, though. I'd like to post some more papers in this thread when I get a chance. There are definitely quite a few that speak to this "divergence" problem that should be added...

### Barriers

Surely it is the responsibility of the compiler to put barriers in the correct places to preserve the semantics of the programming model. In my opinion the programmer should always be able to assume instructions happen in the order written, and the compiler should insert a barrier to ensure that is the case. This way programs can never be 'wrong', that is the programmer is not surprised by the semantics. Yet if the programmer understands this they can still write optimally fast code by not immediately reading data that depends on the 'if' branches afterwards. In other words the naive programmer should be rewarded with correct but slow code (not incorrect code), and the aware programmer with correct and fast code.

### IMP

I have been working on a programming model that explicitly targets data parallel operations. (Not in the limited sense, think multi-billion node finite element mesh over thousands of processors.) I've been targeting MPI & DAG-based backends, but, come to think of it, SIMT should be a possible translation backend too.

### Re: IMP

The elevator pitch is intriguing. I know that you have already posted once about your work, but maybe there are some new documents that could be of interest to the LtU crowd, and deserve a forum topic?

A side remark: I don't think that "IMP" is a very good name, as it is already used in various compilation and semantics courses to designate toy IMPerative languages.

### IMP documents

It's a project that is very much under development, so the way the ideas look on the outside can still change. The inside is pretty much fixed. For a quick introduction, look at the lecture slides and the "gentle introduction" documents on the home page.

Btw, the "LtU crowd" will probably find this less interesting because I've implemented it as a library rather than a language. That is mostly because new languages have a horrible time getting adopted in scientific computing. However, my ideas could be implemented as a language. But that's way down on my list of things to do.

### But even (some of) your

But even (some of) your documents describe it as an (embedded) DSL — why would that be less interesting on LtU?

By quickly skimming at your code examples, however, it seems you could make your library more usable by actually following the example of embedded DSLs (or maybe different code snippets would look more DSLy).

It's hard to define technically what's different between an embedded DSL and a library; I'd say that looking at existing embedded DSLs and reusing ideas will teach you ways to make your library more usable.

However, you're using C++; while high-performance embedded DSLs are *possible* in C++ (see expression templates, or Boost::Spirit), I can see that implementing them is rather painful. Last I tried Boost::Spirit, using it had quite a few problems, though they might not apply to your scenario, and I hear Spirit V2 addressed many of them.

### Maybe it's time to abandon

Maybe it's time to abandon that DSL terminology. For now it's a library.

At the moment I'm more interested in the ideas of parallel computing than their syntactic expression. But I see that I'm not getting much traction here....

### Hey, I am interested!

Please keep posting; the more the merrier and I am interested.

You're specialized. It takes some work to figure out what you're doing and it is somewhat hidden in a library.

Best of luck, keep up the good work, and keep posting!

### Thanks for the encouragement

I think the shortest I can phrase it is like this:

In scientific computing, a lot of code is really executing a sequential program, it's just that the objects are spread out over thousands of processors. I have come up with a way of coding the algorithm sequentially, while still getting the efficiency in execution of the traditional programming systems.

Clarifying notes:
1. traditional systems are SPMD (single program multiple data) or CSP-ish: you write a single thread/processor/node program with explicit communication/synchronization instructions with the other processes that also run that one code.

2. what makes this stuff tricky is that
2a. the sequential description should not imply synchronized execution: processes can run as far out of sync as their local dependencies allow, and
2b. it is not trivial to go from a global description to the minimal set of local process dependencies. that's where my theoretical innovation comes in.

3. you can not assume shared memory. everything is assumed to be distributed and NUMA.

4. it is not allowed to rely on magic middleware. my system derives the precise communication/synchronization relations. for instance, there is no "directory service" or something you can consult to find where a bit of data resides: if you need it, you should be able to derive it.

### Short and sweet

Theory statement

The innovation lies in recognizing and formalizing the `beta-distribution'. For that, I needed to define distributions opposite to whatever anyone else did, and define the concept of the signature function of a data parallel operation.

### So what does the code look

So what does the code look like that you write? Just normal code, but you have to be mindful of the way it will be compiled to understand performance characteristics?