## Coconut : Haskell code assembly on the cell processor

This doesnt seem to have ever made it to ltu.

The abstract:

Coconut is a developing system for high-assurance, high-performance
software. It was used to develop a library of special functions for
the Cell BE processor, which is distributed in the Cell BE SDK 3.0 as
MASS. Average performance is 4X better than the alternative
hand-tuned C library, SimdMath.

Coconut has been successful where patterns of efficient
hardware-specific computation can be captured as higher-order
functions and encoded in a Domain Specific Language embedded in
Haskell. Patterns include efficient control structures not
expressible in C, e.g., the MultiLoop, and efficient uses of SIMD
instructions which require significant compile-time computation for
pattern specialization. Some patterns interact with a novel
instruction scheduler called Explicitly Staged Software Pipelining,
based on a min-cut approach, which outperforms SWING modulo scheduling
in our tests.

A less developed aspect of Coconut is the parallel production of
proofs of correctness along with executables. Current work aims to
prove only limited properties about programs---the ones most likely to
be broken---creative use of SIMD instructions, and parallelization.
Coconut intermediate code is represented as nested code (hyper)graphs.
At the lowest level, we transform acyclic loop bodies to remove the
effect of SIMDization, and produce machine and/or human readable
specifications. This has been used to verify opaque patterns of
optimizing linear algebra for SIMD processors.

Such code graphs are embedded in higher levels containing control
flow, first single-threaded control flow optimized for ILP, and then
parallel control-flow, optimized to hide communication latency. At
this level control flow is restricted to allow peak utilization of
multi-core hardware, but enable efficient compile-time verification of
soundness. Soundness, in this context, means that the parallelized
program can be transformed into a code graph without synchronizing
control flow, because every execution can be shown to produce the same
result. Think of it as reducing the parallel debugging effort to the
single-threaded debugging problem by eliminating the non-determinism
inherent in parallel code. I will give a formal language description
of the language, and the O(n) algorithm which proves soundness and
produces the equivalent single threaded'' code graph.

## Comment viewing options

### Previous mentions

While this exact work has indeed not been posted on LtU, some of the preceeding work has - see the post Declarative Assembler and my comments there.