Lambda the Ultimate

BrookGPU
started 12/21/2003; 7:38:06 PM - last post 12/23/2003; 6:51:12 AM

12/21/2003; 7:38:06 PM (reads: 7571, responses: 10)

BrookGPU

via slashdot

As the programmability and performance of modern GPUs continues to increase, many researchers are looking to graphics hardware to solve problems previously performed on general purpose CPUs. In many cases, performing general purpose computation on graphics hardware can provide a significant advantage over implementations on traditional CPUs. However, if GPUs are to become a powerful processing resource, it is important to establish the correct abstraction of the hardware; this will encourage efficient application design as well as an optimizable interface for hardware designers.

They have implemented a data-parallel language based on C that compiles code to execute on the powerful graphics cards used in standard PCs. WOW!

Posted to implementation by Luke Gorrie on 12/21/03; 7:41:20 PM

Luke Gorrie - Re: BrookGPU

12/21/2003; 9:13:29 PM (reads: 751, responses: 0)

lspci says my nVidia card is an "NV17", which I fear means I'm out of luck with my laptop :-(

Bart van der Werf (Bluelive) - Re: BrookGPU

12/22/2003; 1:23:01 AM (reads: 711, responses: 0)

Maybe its a taste of the future. How long before a massively paralel unit makes it into the cpu, in the same way they now all have a float coprocessor?

Carlos Scheidegger - Re: BrookGPU

12/22/2003; 8:21:22 AM (reads: 672, responses: 2)

(Edit: Ouch, sorry for the very long post. I got a little carried away :)

> lspci says my nVidia card is an "NV17", which I fear means I'm out of luck with my laptop :-(

Pretty much :(. On the NVIDIA side, only the GeForce FX series (NV30 or higher) has real floating-point capabilities. The older cards only work with fixed-point math, and offer little to no programmable parts. If you want to hack GPU programming a bit, a GeForce FX 5200, while not providing nearly the same horsepower as its higher end, costs only U$60 and offers real FP math useful programmability.

> Maybe its a taste of the future. How long before a massively paralel unit makes it into the cpu, in the same way they now all have a float coprocessor?

Well, this is a topic which I can talk about a little, since that's what I'm writing my diploma thesis about. The answer is "probably not long", and I'm basing my guess on a purely economic observation. The major computer graphics conferences have seen, this year, a flurry of "X on the GPU" papers, and most of them have reported performance gains of an order of magnitude or more. Some algorithms are, for example, conjugate gradients, FFT, and multigrid solvers for PDE's. All of these are directly applicable in the industry.

The big thing to notice is the difference between R&D budgets for Intel and NVIDIA: Intel spent US$4bi in 2002; NVIDIA, U$150mi. NVIDIA and ATI (the two big players in th GPU market) are expecting a two-fold performance increase by Q3 2004, and I wouldn't bet on such a performance increase in CPU's. This pretty much means to me that a CPU company that provides such (real, not wimpy 3DNOW!, SSE or MMX)streaming alternatives will have a huge competitive advantage.

Obviously the reason NVIDIA processors are more efficient is that they're much simpler. The streaming programming model is VERY limited. For example, the programmable part in which most of the work is done,the fragment processor, has NO branching, only conditional writes (This rules out recursion :( ). A fragment, in its computer graphics definition, is a pixel that has not yet been written in the screen. Nowadays, though, it can be interpreted as a record of floats that will be the output of the execution of the fragment program for a single "activation record". One fragment program is executed typically in parallel for tens of thousands of fragments, and there is no communication at all between the executions. The only thing that changed is what data is passed to each unit. This simplifies the hardware implementation, and allows massive parallelism. It is speculated (these details are not disclosed by the chip makers) that some of the boards have upwards of 32 ALU's executing in parallel in the fragment processors.

The big caveat is that programming in GPU's is a MAJOR pain. Since we are outside of the CPU, we cannot set breakpoints, do interactive debugging, or even print execution traces. Not only that, but we are programming in an API that was developed to create visual effects, and not data-parallel computations. For example, this means that we don't have arrays, we have textures that can be interpreted as arrays, provided you use them in the right way. It also means that you can't simply read values from the GPU to the CPU without ruining performance. There is no memory management: all data has to be inside textures, previously allocated and put there, by the CPU. There is surprisingly no integer math, so many applications are shunned right away. There are ways to do bit-fiddling, involving lookup-tables and such, but they're not as efficient as raw FP instructions. Real branching, and thus looping and recursion, has to be done in the CPU. All of this makes programming GPU's for general purpose computing a black art of sorts. Programs have to be carefully orchestrated to work at all, and finely tuned later to exhibit the potential performance gains.

This means that GPU programming is really not a panacea, as has been stated in some places. The performance gains are amazing, but they're only needed in a few specialized areas (computational fluid dynamics is the canonical example - my proof-of-concept implementation is 8 to 10 times faster than a textbook CPU implementation). I've read in some places people suggesting implementing OS's on the GPU's, and other silly things like that. This is pretty much impossible in the streaming computation model. Note, though, that some interesting uses have already been found to the streaming architecture. There are published papers on cracking password schemes with streaming computing (http://www.cs.unc.edu/Events/News/PxFlCodeCracking.html)

Which brings us to Brook. If Brook delivers what it promises, it may well be the killer app for GPU's. There is a dire need for decent language support in the GPU side, and Brook, while not Haskell, is much better than the assembly-level hacking that is needed right now to turn a GPU into a processor. I'd say Brook would put the GPU programming close to where we were when Kernighan and Ritchie came along --- much better than Cobol, but definitely no Haskell or Scheme :)

HTH, Carlos

PS: Streams here are not the SICP streams. They are huge heaps of uniform data that are passed between processors. Each processor does not create or destroy individual pieces, only performs transformations on the pieces. Think real signal processing here (no stream-filter, etc).

andrew cooke - Re: BrookGPU

12/22/2003; 8:46:46 AM (reads: 635, responses: 0)

What modifications would be possible to the API that these things present, so that they would be more usable for non-graphics tasks, without lowering their graphics performance? Presumably we'll see those changes occur in parallel with the adaption of this kind of computing model, as NVidia and ATI realise that they have an additional market.

Bart van der Werf (Bluelive) - Re: BrookGPU

12/22/2003; 10:14:11 AM (reads: 596, responses: 0)

I just thought of a reason why intel wouldnt add a massively parralel coprocessor, the memory bandwidth of such a system would require a specialized large piece of memory just for the copprocessor with a very wide bus, putting a large amount of memory on chip would be too large/expensive, so it would need to be off chip, which would increase pin count alot, and would consume a large amount of close to the processor motherboard space. On a videocard seems like a much more ideal place, it has a big bus, its own memory, close to both the cpu and a likely output for the processed data. If they could add more advanced conditionals and integer math it would be ideaal :) Now only if they would make it standard, then i could actually write an application that uses it and be able to share it with somebody else without spending alot for a specific brand of videocard.

Ehud Lamm - Re: BrookGPU

12/22/2003; 10:54:07 AM (reads: 602, responses: 0)

Interesting analysis. Thanks.

Carlos Scheidegger - Re: BrookGPU

12/22/2003; 11:09:20 AM (reads: 583, responses: 0)

> Now only if they would make it standard, then i could actually write an application that uses it and be able to share it with somebody else without spending alot for a specific brand of videocard.

Bart, OpenGL 2.0 has a proposal for GLslang, or Gl Shading Language, which would be the OpenGL equivalent of Cg, and presumably would work with different vendors. I say presumably because, right now, only NVIDIA has full IEEE floating-point numbers. ATI only has a reduced-mantissa implementation. But you should keep in mind that Cg is still highly inappropriate for general computation. For example, instead of writing

for i in [0..imax] in parallel do
   for j in [0..jmax] in parallel do
      some_code(3.141592f, i, j);

You would have to write (using the C API for Cg):

cgGLEnableProfile(CG_PROFILE_FP30); /*enables fragment programs*/
cgGLBindProgram(handle_to_a_compiled_and_loaded_"some_code");
cgGLSetParameter1f(handle_to_"some_code"_parameter, 3.141592f);
glBegin(GL_QUADS);
glVertex2i(0,0);
glVertex2i(imax, 0);
glVertex2i(imax, jmax);
glVertex2i(0, jmax);
glEnd();

The code for "some_code" would be a previously compiled and loaded Cg program. You could, of course, create abstractions for these operations, but the bottom-line is that we're trying to code in a different language with different semantics, and we should have a language, or at least a DSL, for that. That's why Brook's so cool.

> If they could add more advanced conditionals and integer math it would be ideaal :)

I wouldn't hold my breath on this one. The simplicity of the streaming architecture is what makes it efficient. I'd bet on a intermediate transformation that would push the conditionals off the GPU instead. (I don't think Brook supports this right now, but some of their early design docs even mentioned recursion.)

Patrick Logan - Re: BrookGPU

12/22/2003; 3:00:23 PM (reads: 535, responses: 0)

As somewhat of an aside Intel did some work on massive parallel computing in the 1980s, resulting in a group of employees leaving and founding NCube. Another branch of thought turned into the iWarp. On the commercial side for Intel the ideas became the i960 family.

Of course these were not co-processors. But they're all used for imaging and video among other purposes. Speculation might indicate these ideas may influence future uses of all those transistors that Moore's Law predicts will be available... the whole systems on a chip idea.

(I have no insight into what is actually being considered by Intel for the future.)

Kragen Sitaker - Re: BrookGPU

12/22/2003; 7:10:14 PM (reads: 511, responses: 0)

I was curious to see what license this is distributed under, so I looked. I'm very disappointed.

This looks like an open-and-shut GPL violation case; they're distributing a work derived from cTool, a GPL-licensed program, but they're trying to impose a trademark-like restriction on top of the GPL, and additionally, they're including in the combined work some files with restrictive licenses from nVidia and SGI.

In short, the authors of this work do not have the legal right to distribute it, and neither does anybody else.

-kragen@pobox.com

Luke Gorrie - Re: BrookGPU

12/23/2003; 6:51:12 AM (reads: 457, responses: 0)

It's wonderful to think of having special-purpose hardware available when writing regular programs. This is what made assembler programming on the Amiga so much fun: things like the 3-instruction coprocessor that executed one instruction for each four pixels "beamed" onto the screen. Back then you had 12-year-old kids writing programs that write programs for the coprocessor, all very fun, and with hindsight quite sophisticated (he says without looking at the probably-awful code).

I still remember my horror when I got a PC and looked at how people did graphics hacks on it.. "my god, they just use the CPU! it's all so horribly straight-forward!"

Here's hoping we get a lot of quirky hardware to find hacks for again :-)