## Intel's Array Building Blocks (was Rapidmind) : What do the purists and pragmatists think?

Increasing programmer productivity has been a grand challenge since time immemorial. In current times, every program written has to be parallel to take advantage of many core processors. The next great parallel programming language is still waiting to get invented. With Array Building Blocks (ArBB), Intel has taken a library based approach to parallel programming. ArBB uses a mix of clever C++ operator overloading and arguably ugly preprocessor macros like _if, _for, etc to mimic C keywords to let programmers express parallel programs. People who know what they are doing could indeed write portable and scalable programs with ArBB. However, novice programmers could easily trip themselves because of the fragile nature of the macros. So what do language purists out there think? Is it wise to promote such unsafe approaches to parallel programming?

## Comment viewing options

### I don't know about the

I don't know about the multicore side of ArBB -- there are a lot of safer solutions out there (see my recent OoOJava thread) -- but the SIMD one is exciting (as well are other recent Intel CC extensions for SIMD).

SIMDization (SSE/AVX, not SIMT/GPUs) is a domain where, at this point, something is better than nothing. Life could be better -- e.g., NESL -- but, unfortunately, there are no fully integrated solutions. In my case, I want to use SIMD instructions on tree computations: I have no real option and am thus actively experimenting with both a whole-cloth language and (what I can't discuss) more low-level approaches. Conspicuously, Data Parallel Haskell, a high-level solution, performs flattening... but only exploits it for multicore.

### In my case I needed to

In my case I needed to compute the euclidean distance squared between one 16 dimensional vector v and an array of vectors w[i]. This is what I came up with:

__m128 p1 = _mm_load_ps(v);
__declspec(align(16)) float temp[4];

for(int i=0; i<n; i++){
__m128* pnt = (__m128*)w[i];
_mm_prefetch((const char*)pnt+1400, _MM_HINT_T0);
__m128 a = _mm_sub_ps(p1,pnt[0]);
__m128 b = _mm_mul_ps(a,a);

a = _mm_sub_ps(p2,pnt[1]);
a = _mm_mul_ps(a,a);

a = _mm_sub_ps(p3,pnt[2]);
a = _mm_mul_ps(a,a);

a = _mm_sub_ps(p4,pnt[3]);
a = _mm_mul_ps(a,a);
a = _mm_shuffle_ps(b,b,_MM_SHUFFLE(3,2,1,0));
a = _mm_shuffle_ps(b,b,_MM_SHUFFLE(1,0,3,2));
_mm_store_ps(&temp[0], b);
float dist = temp[0];
// do something with the distance
...
}

This is 5x faster than the scalar version. How would you write code that compiles to this (or better ;) in a higher level language? I'm skeptical that you can make a language high level while also being able to express these kinds of low level optimizations.

### Maybe you should try this in

Maybe you should try this in ArBB, and see if the dynamic compiler can get this (or better).

### For something at this level

For something at this level of regularity, I rather write it in matlab and have it automatically vectorized. SIMDization of regular code is largely already handled more clearly with recent Intel CC extensions and automatic prefetching for regular strides is more of a product than research question. I didn't look at the shuffle closely enough to see what's going on for that.

My particular challenge with SIMDization is with more irregular structures. For example, I suspect much of the XML processing out there is SIMDizable; writing it as you did may be possible but a tough sell. The two transformations I'm looking at are lifting fields from a list of children to their parent -- local simdization -- and, more tricky, non-local SIMDization of task-yet-data parallel code (which is attempted by the pure macroscopic flattening transformations of the higher-level languages I mentioned above).

### We have 16 components (v[k]

We have 16 components (v[k] - w[i][k])^2 that we need to add up. I add these up pointwise in groups of 4, and then you're left with an xmm register containing 4 numbers [a b c d] and we need a+b+c+d. It works like this:

First shuffle to [d c b a] and add to the original getting [a+d b+c b+c a+d]. Then shuffle again to [b+c a+d a+d b+c] Then add again getting [a+b+c+d a+b+c+d a+b+c+d a+b+c+d]. This code runs 270% faster than by extracting the xmm register to an array and adding up the numbers one by one. That is, if you SIMDize everything except the addition of the final 4 floats (which requires the shuffling, which a compiler probably wouldn't come up with), then the code spends 70% of its time adding these 4 floats, and only 30% of its time doing 16 subtractions, 16 multiplications and 12 additions.

It is impossible to write this code in matlab efficiently. This calculation is part of an algorithm. The calculation gets called on millions of different arrays w during the execution of the algorithm.

I'm going to try ArBB and see what it comes up with.

What about XML processing do you think is SIMDizable? This seems to me an example of something that is probably not SIMDizable?

### It is impossible to write

It is impossible to write this code in matlab efficiently.

I was proposing that a Matlab-like frontend could be feasibly used to automatically vectorize this, not that Matlab does, similar to Andrew's post below. This is NESL/DPH territory, essentially.

Even for what you wrote, I think there could be better language support. Even if vectors are being used, 1) a better vector notation and and 2) automation of some data munging tasks (like dealing with fixed-size registers when your data has a different size). The new Intel compiler does a bit of both, and, historically, the community seems to have actually regressed on the second point (some architectures had a vector length register that compilers inspected at runtime, so users wouldn't even think of hardcoding such munging).

What about XML processing do you think is SIMDizable? This seems to me an example of something that is probably not SIMDizable?

There are different levels of this. Of most direct note, I heard some of the newer SSE instructions actually were explicitly introduced for string handling. However, my interest is in the trees themselves, particularly on processing webpage layouts, where I've found two basic forms: there is often a computation of a parent node on child nodes, such as mixtures sums, maxes, and mins, and often separate nodes process their respective children in a similar fashion, such as both computating the same mixture of sums, maxes, and mins. While not my particular approach, I don't see an XML extension for NESL/DPH as being too crazy.

### For example, I suspect much

For example, I suspect much of the XML processing out there is SIMDizable

I would suspect that if you had a DTD for the XML document, you could automatically specialize the parser in a way to give dramatic performance boosts. But I would also suspect that XML Accelerator hardware already takes advantage of schema-less optimizations through the use of SIMD.

### HLL

Your skepticism is well placed: the purpose of a high level language would be to avoid needing these kind of low level optimisations.

The actual code that a HLL would allow you to write would be:

for wi in w :
ds = zip([ dsj-wij for dsj,wij in zip(wi,dsj) ])
r = sum([ d*d for d in ds ])


The question of how to optimise that onto SSE should be an issue for the compiler. The language design should simply let the programmer express the problem cleanly.

### Such a compiler is exactly what I'm skeptical of.

Such a compiler is exactly what I'm skeptical of. Delegating the problem to a nonexistent compiler is not really helping my program run in 1 hour instead of 5 :( Although of course I do agree that it would be a good thing if such a compiler existed.

The problem with these high-level-to-efficient-code compilers is that you inevitably run into cases where the compiler is not able to optimize as well as you'd like. In those cases you want to be able to get down to the lowest level and do it yourself. For code working at such a high level it's not at all obvious how you'd integrate manually written pieces of low level code.

I personally prefer the notation sum((v-w[i])**2), as available in e.g. numpy.

### Two questions

So really then you are asking two different questions:
1. How can I solve this problem now, using the available tools?
2. What are the right tools to design that can solve this problem?

For one of these two questions delegating the problem to a non-existant, but otherwise ideal, compiler is of course the wrong answer. For the other question asking what that compiler should look like is of course exactly the right place to start.

Manually scheduling instructions that group the necessary operations together into vectors is not something a HLL would allow. But then again, if it has a sophisticated enough compiler to perform that scheduling then it is not something that is necessary for the programmer to do. In the absence of such a compiler it is indeed necessary to explore more pragmatic approaches.

There is clearly a compiler research question here in how to bundle the operations together into something suitable for a SIMD target. But I'm not sure that see a language design problem here - if the programmer is given imperative control over bundling of data and vectorisation of instructions then it is not clear that there is any improvement over current techniques with either macros / inline-assembly / intrinsics.

The programmer is still responsible for finding a schedule / partition of the data that results in high-performance and the result won't translate easily onto another architecture. Any HLL that provides explicit control over the vectorisation seems to missing the "HL".

### Glasgow, Coconut, Sse compiler run...

You might look into Don Stewart's interesting work in Haskell. For example, he implemented a SIMD optimized Mersenne Twister for Haskell.

GHC can (and will, if passed the appropriate compile-time flags) support MMX and SSE based SIMD, as part of its loop-fusion optimizations. The main issue is that it isn't always clear that the optimizations will be applied - you need to peek at the generated code. It would be nice if we could declaratively annotate/constrain our code with desired optimizations and see warnings or errors when they fail.

You might also look into Coconut, which claims to achieve better performance than C's SimdMath.

### Isn't this missing the challenge?

The question of how to optimise that onto SSE should be an issue for the compiler. The language design should simply let the programmer express the problem cleanly.

I see the HLL challenge as:

1. Provide clean specification of what the algorithm is supposed to do
2. Use reusable optimization bindings for the structure and computation of the algorithm
3. If necessary, continue refining the optimization using manual additions to the optimization chain

These should all be expressible in the same language. This is approximately how Burstall and Goguen, and later Hoare, intended it.

### Don't see the connection

The three challenges that you list are relatively uncontroversial. I suspect that challenge one is universally agreed upon among language designers. Challenge two has some good benefits in terms of reusing well-known results so it is very widespread, although pragmatism wins out. Challenge three is certainly desirable although it tends to be less common than the first two.

Now where I lose your line of argument is that these three challenges should all be expressible in the same language. While I can accept the desirability of these three goals easily I don't see that it follows that one language should cover all three cases for use. In fact I would argue that your second two challenges are suited to a language of graph manipulation. The language that provides the clean specification may, or may not, be defined naturally over graphs - it depends on what the target domain is. In fact depending on that domain graphs may not even be the right internal representation (IR). But there will be some IR that does work for the problem domain, and the second two challenges will be best met by a language specialised to manipulate terms in that IR.

The point that I would make is that there is no need for the language provided to the programmer to coincide with the language that manipulates the provided programs. Unless the programmer is working on a problem similar to the optimisation of their program it seems evident that a language tailored to a clear expression of their problem is not tailored to optimising their program. This is one case where eating your own dog-food can be a problem; if you are actually employed in a factory making food for cats.

### I agree with "something is

I agree with "something is better than nothing" for SIMD. The only other options I can see for writing SIMD are to write straightforward loops and hope that the host compiler would generate SSE, or write the code using SSE intrinsics.