Accelerator: simplified programming of graphics processing units for general-purpose uses via data-parallelism

Accelerator: simplified programming of graphics processing units for general-purpose uses via data-parallelism. David Tarditi, Sidd Puri, and Jose Oglesby.

GPUs are difficult to program for general-purpose uses. Programmers must learn graphics APIs and convert their applications to use graphics pipeline operations. We describe Accelerator, a system that simplifies the programming of GPUs for general-purpose uses. Accelerator provides a high-level data-parallel programming model as a library that is available from a conventional imperative programming language. The library translates the data-parallel operations on-the-fly to optimized GPU pixel shader code and API calls.

A library provides programmers with a new type of array, a data-parallel array. Data-parallel arrays differ from conventional arrays in two ways. First, the only operations available on them are aggregate operations over entire input arrays. The operations are a subset of those found in languages like APL. They include element-wise arithmetic and comparison operators, reductions to compute min, max, product, and sum, and transformations on entire arrays. Second, the data-parallel arrays are functional: each operation produces an entirely new data-parallel array.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

GPGPU

GPGPU.org is to general processing on graphics boards what LtU is to programming language theory. Worth a visit even for those who are casually interested. The most interesting thing about GPUs is not that they can do some operations faster than mainstream CPUs, it is that their speed is increasing at a faster rate than that of CPUs!

Some other programming language links are:
Cg(a graphics language with C like syntax)
BrookGPU (an older extension of C/C++ which also provides specialized arrays)

An interesting article linked by GPGPU

Funny that, I've just finished reading an interesting article that I found through GPGPU:
http://www.gpgpu.org/cgi-bin/blosxom.cgi/2005/08/22#goeddekeDoubleFEM

I always thought as computation on GPUs as limitated due to the reduced FP precision but in this paper they use an iterative algorithm to attain a precision equivalent to CPU's one and they still get a 2* improvement, not bad.

GPU trend

I don't follow GPGPU so closely any more but there was talk (prediction?) that soon GPUs would have good support for floating points (single or double...dont' remember). Back then I was most interested in an integrated random number generator because of financial contract pricing....a random number generator on an ultra-fast GPU, combined with an easy to use language will do wonders for monte-carlo pricing models!

I also heard that a comapny is working on a 'Physics Processing Unit.' Since it is also marketed towards gamers, the price should be affordable. Perhaps APL like functionality will make a comeback to allow easy interface with these cards.

Depends what you call "good"

NVidia and the new ATIs have 32bits FP (so single) precision, but I don't expect double precision support any time soon on GPUs, graphics don't need this kind of precision and scientific users represent only a small income source for HW makers compared to the vast number of gamers.
Whether it is good enough or not compared to CPU (x86 have up to 80-bits FP) depends on the application.. Sometimes as in the article I linked an iterative solution can increase the precision with still an improvement compared to a pure CPU based solution.

I don't know what is the precison of the PPU, I expect it to support double precision as physics calculation may need this, but the calculation in double precision could be much slower than in single precision: I think that this is the case for the cell.

The question is: Do we need l

The question is: Do we need languages with specific constructs to make the most of this technology, or are libraries enough?

Leading the witness

I just skimmed the pdf, but I think it's very neat that they could approach the performance of the hand-written version at least in some of the cases. I think this may find some use.

But you're obviously right. Libraries are just interim solutions until new (maybe not even so specific) language constructs that allow the compiler to target cpu+gpu+cell+whathaveyou effectively. Usually people who are interested in this are truly power hungry, so they'll be pushing it until somehow they pass the hand-optimized code in performance.

What about both?

In Haskell at least, big libraries are often expressed as combinator libraries, which make little embedded domain specific languages. Slap in a monad to control the effects (easy... just need a couple Olegs of mind power), and you've got the best of both worlds.

I expected this answer. But

I expected this answer.

But let's talk specifics...

We need chips designed for languages...

...not vice-versa. Imagine a chip designed around LLVM as an example.

Every chip is in effect a hardware interpreter for its own assembly language. But hardware guys seldom think like programmers. Their view of programming comes from awful HDL languages and half-baked MATLAB. So we get blazing fast tiered cache super chips that use GOTOs. Maybe they must, but you see the point.

Array programming is familiar and very, very useful to me. It is nice stuff and I certainly wish more chips had direct support. I do not view GPUs as the way to get it.

What i'm saying is that we shouldn't "make the most" of whatever the (commercial) hardware guys drop from the sky. We should invent our own chips.

I agree...

Its also easier than you think to do it yourself. FPGA dev kits are relatively cheap. Altera sells a student "University Program" kit for $150(when I bought mine a few years ago) that comes with everything you need. I think now I would prefer Xilinx(and it might be a good place to start for someone new), but we used the same boards in my senior design class so I bought what I felt comfortable with.

I mention this because its a personal project to design a microprocessor of the sort you mention. Of course, I still have a bunch of theory to learn, and no time, so "one day" is probably not soon...

Anyway, a while back there was a discussion here about an Erlang processor on an FPGA. Apparently, while only running at 20MHz it performed comparable to a 500MHz UltraSparc.

Oh yeah, I know HDL's are atrocious, but I spent half of my time in my senior year writing a DSL compiler that would take my state machines and RTN specifications and output synthesizable VHDL.

If I were to do it again, I might try Confluence. Although the temptation(and fun) of rolling my own will always be there :).

Already done.

- RISC were designed looking at the output of C and Fortran compilers.

The RISC CPU won in the sense that every new CPU is RISC, but they still failed to beat the installed base of the x86 on the desktop.

- The ARM CPU has an extension Jazelle which was made to be able to accelerate the execution of Java programs.

So it's still been done in a few case for the embedded where compatibility doesn't matter, but as x86 have shown in a market where compatibility rules, no different ISA have any chance..

Don't stop work at Kitty Hawk...

...just because everyone still rides in horse buggies.

Previously discussed.

Microsoft

That is because Microsoft never bothered to move to anything else. (They had some half-hearted efforts with NT4 - but nobody ran the PPC, MIPS, or Alpha ports of that?)

IF Microsoft had done like Apple, they could have moved from x86 to PPC if they wanted. Or they could have refused windows certification unless your program was sold for all supported platforms. (You couldn't certify a Windows 95 program unless it also worked on NT)

Microsoft didn't care. It was and is easier for them to support on CPU - they don't have to deal with all the big-endian quirks of other CPUs.

Don't get me wrong, I'm not saying Microsoft should have done this (Though from a hardware point of view they should have). Only that they did not.

Big Endian quirks?

Dem's fightin' words! Little Endian is the one with all the quirks. :-)

Personally, I think the X86 architecture single-handedly managed to destroy assembly as a viable programming language. Then again, that was probably a good thing. :-)

> Dem's fightin' words! Littl

Dem's fightin' words! Little Endian is the one with all the quirks. :-)

Agreed, trying to read a memory dump with integers stored in little endian is about as fun as poking yourself in the eye, it hurts.

I really don't understand why Intel used little endian, sure some tricks works better in little endian but some other tricks works better in big endian, so from this point of view there is no winner, from readibility point of view big endian is the clear winner though, so why did they choose little endian??

Then again, I probably shouldn't be surprised, Intel is absolutely incapable of doing anything elegant: x86 vs 68k, MMX/SSE vs Altivec, Itanium vs Alpha RISC ISA..

When it comes to chips, operating systems & PLs...

...adding on a new feature to an existing target will always seem to win out over starting from scratch. Of course, as time goes by, the original features that were preserved will become useless (or annoying). It's easier to sell a kludge than it is to rewrite.

The x86 has a history that goes back to the 8080 chip (probably even earlier than that). In the case of the MC680x0 chips, Motorola made a decision to not make it consistent with the 6809, making for a much cleaner instruction set. By definition, the RISC chips also chose to not support the convoluted instruction sets - preferring a reduced, more agile, set of instructions. With the x86, you have got 8 bit (and 4 bit) type instruction sets that have been extended to go up to 32 bits (and beyond).

To keep on topic, PL's are not too dissimilar. If you program in a language like C++, Java, C#, you are using languages that have features that go back all the way to the late '60s. Making a clean break has been tried several times, but familiarity is a powerful force.

No, no, no

"To keep on topic, PL's are not too dissimilar. If you program in a language like C++, Java, C#, you are using languages that have features that go back all the way to the late '60s."

If you program in any language you are using languages with features that go back to the '50s, '60s, and '70s. The difference is when you program in languages like C you have the kludges that go all the way back to the late '60s.

On the other hand they appear

On the other hand they appear to have little trouble with XBOX 360 (PPC cores).

instruction set theory?

...there is also Don Knuth's MMIX as well.

My knowledge of assembly language and machine architecture comes from a single undergrad course. It is interesting that a tiny set of instructions (add, sub, shift, load, store...) are behind ALL computer applications--from music and video players to video games and ...well, this is obvious for this audience.

I've often wondered if programming languge theory has been applied to machine instruction sets. Obviuosly the goal of mainstream machine instructions must remain absolute and raw speed. However, within that constraint, could there be an instruction based on lambda calculus or pi calculus (I'm not sure if category theory fits)...does that allow an absolute smallest set of instructions that can be combined to be as expressive (and as effecient) as modern assembly languages?

There are

There are virtual machines based on these things. Though they are usually still much higher level than the typical instruction set. For example, (O)Caml is named after the CAM, the Categorical Abstract Machine. (It no longer uses an implementation based on it though). It would be quite interesting to take a simple π-calculus and derive the corresponding abstract machine (a la some work Danvy has recently participated in. There are various other approaches and virtual machines as well. Often these are used to improve efficiency, but usually at a (much) higher level than machine language. A different way of applying theory to machine instructions is typed assembly languages and proof-carrying code.

Intel's iAPX-432

Sometimes a big project that doesn't quite make it can damn an idea for decades, even as the underlying technologies evolve and economics change. I think that's what intel did (25 years ago) to the idea of chips designed for languages, with their iAPX-432 chip. In the Intel C++ compiler discussion that you (Mark) linked to, Patrick Logan mentions the iAPX-432, and points to Eric Smith's rather comprehensive page about it.