Compiler Technology for Scalable Architectures

Seems like an interesting (if not very in-depth) read.

Our aim is to automatically generate high quality code taking advantage of the wide range of heterogeneous parallelism for Scale-Up and Scale-Out architectures. We propose "single source" compiler solutions for heterogeneous memory and computational subsystems using automatically-partitioned code and data, as well as software-managed cache for irregular data accesses. We exploit parallelism at all levels, including data and task level parallelism as well as SIMD parallelism.

One such heterogeneous platform is the Cell Broadband Engine (TM) (referred to thereafter as Cell), which includes a Power-Architecture processor and eight attached streaming processors with their own memory and DMA engines. In addition, each processor has several SIMD units that can process from 2 double-precision floating-point values up to 16 byte-values per instruction.

We propose techniques that include compiler optimizations partitioning for data and code to run on the multiple heterogeneous processor elements in the system, automatic generation of SIMD code, and other specialized optimizations for processor elements in the Cell architecture. Measurement indicates that significant speedups are achieved with a high level of support from the compiler.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

don't know if this is where

don't know if this is where you just came from, but there's an article on ars technica about this (or closely related), just posted to slashdot.

SIMD capabale compilers?

Are there currently any compilers that generate SIMD code? I mean, compilers that take an ordinary loop working on an array of bytes and transform that into SIMD instructions? I'm pretty sure my C++-Compiler can't, does anybody know about Intel's compilers?



Maybe my question wasn't precise enough: Is there a compiler out there that can transform code like

BYTE *src1, *src2, *dst;
for (size_t index=0; index != len; index++)
    dst[index] = src1[index]+src2[index];

into a loop using SIMD-instructions that process 8 or 16 bytes at a time? Possibly with saturiation?

According to your link, VC++ can only use single SSE instructions to optimize single-precision FP or int64 arithmetic. But it's strictly single instruction - single data, as far as I can see.

Automatic vectorization

I think Intel's C++ compiler does a little automatic vectorization. And then there is of course VectorC, whose raison d'etre is automatic vectorization.


GCC versions 4.0 and later supposedly do, through the -ftree-vectorize flag: get the details here.