Twilight of the GPU

This interview with Tim Sweeney discusses his prediction that graphic rendering will move from special purpose GPUs back to the CPU:

I expect that in the next generation we'll write 100 percent of our rendering code in a real programming language—not DirectX, not OpenGL, but a language like C++ or CUDA. A real programming language unconstrained by weird API restrictions. Whether that runs on NVIDIA hardware, Intel hardware or ATI hardware is really an independent question. You could potentially run it on any hardware that's capable of running general-purpose code efficiently.

This is driven by the development of cheap multi-core CPUs. Consumers might still buy graphics boards, but they'll be fully programmable multi-core devices:

Intel's forthcoming Larrabee product will be sold as a discrete GPU, but it is essentially a many-core processor, and there's little doubt that forthcoming Larrabee competitors from NVIDIA and ATI will be similarly programmable, even if their individual cores are simpler and more specialized.

How are we going to program these devices? NVIDIA showed the data-parallel paradigm was practical with CUDA (LtU discussion). Now, Tim asks:

...can we take CUDA's restricted feature set—it doesn't support recursion or function pointers—and can we make that into a full C++ compatible language?... From my point of view, the ideal software layer is just to have a vectorizing C++ compiler for every architecture.

The FP lover is me says that data-parallel + C++ = the horror! However, I can appreciate it is a practical solution to a pressing problem. What do you think? Is Tim right? Is a data-parallel extension to C++ the right way to go? Or is the next version of Unreal going to be written in FRP?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


I guess Ct is Intel's answer to my question. I think there was a presentation about this at last year's CUFP. What I recall is a boat-load of FP machinery wrapped up in a dynamic compiler and presented as a C extension. I was quite impressed, so unfortunately Intel have not made Ct available to the general public.


See also OpenCL, supported by Apple and AMD.


I suspect it'll suffer from the same problem facing today's multicore: too difficult to program, and too few programmers.

Admittedly, the parallel technique under a GPU setting is mostly vector and SIMD stuff, but utilizing threads and building on top of multiple GPU/CPU cores is another part of the big picture.

I don't think any extension to C or C++ will suddenly make all the multithreading headache goes away. Data parallelism is nice, but we still need a DSL for it, not C++.

DSL for Data Parallel

But we already have a DSL for data parallel programming. It's called Fortran :-)

C++ already does array programming

Array programming has already been done in C++ for decades thanks to the expression templates technique that allows one to make any DSEL within C++.

I don't know much about Ct, but from what I saw it just seems to be a library with nothing new in it, except it may be more optimized since it's Intel after all.

Compiler technology

Efficient support for nested data-parallelism requires extensive program transformations and other advanced compilation technology. For best results you even want to make it target-specific, which is why Intel is using a form of jit compilation. Mere template trickery cannot compete either way.


Although trickery might be the right term, I don't know if mere is the right adjective for things like Blitz++.

Are GC compatible with games engine?

IMHO, one reason to use C/C++ for games is that it's not too hard to have well-behaved (for games) memory allocators.

A GC for a game engine should have a:
- quasi-real time performance:it shouldn't induce lag in the game (or a bounded amount).
- it should scale with the number of CPU: games are the kind of application which will need to use parallelism eventually.
and the language should be able to interoperate with C/C++ libraries as easily as possible.

Is-there any language which have this kind of GC (real time + scales with multiple core) currently?

SuperCollider has a realtime GC but I don't know if it has the other properties.

Efficient on-the-fly Cycle

Real-Time GC

As you point out, a real-time gc can not block indefinitely. This can be achieved through an incremental collector that can return to doing its work after it has been interrupted. Even so, the system needs to have enough idle time to actually perform the gc cycle to reclaim memory or the system will eventually run out of memory. There are at least two problems lurking here: 1) can we schedule the system; and 2) will it run with a certain amount of memory?

The GC needs to be predictable for a schedulability analysis, one work in this line is A correct and useful incremental copying garbage collector (non-free link).

Extending this to multiprocessor systems is a hard problem in general. Just scheduling alone on multiprocessor systems is difficult, Björn Andersson's thesis is a start.


The advantage of the GPU lies in (a) its available memory bandwidth, and (b) the fact that the computations performed are designed to effectively exploit locality all the way out to the RAM buffers. General purpose CPUs can't meet this advantage without fundamentally compromising their utility as general purpose devices. A factor of 30 in relative memory performance is more than enough to ensure the use of GPUs for the foreseeable future.

The title is misleading and you have to RTFA

Tim doesn't seem to be predicting the imminent demise of the GPU, as much as the rise of the GPU as a general purpose processor that will have general purpose language compilers targetting them, and the demise of APIs such as DirectX and OpenGL.

Fair enough...

... but the challenge in looking at a GPU as a general purpose compile target is going to be that the state on the GPU will have to be treated as either context switchable or safely shared somehow, and that goes well beyond a standard compile problem. The OS integration issues here are quite challenging.

GPUs being used for general purpose (or at least small vector) computing isn't news. Folks at NIH have been using them that way for years.

Think that's done already

I think GPUs as they work currently are already context switchable. I thought the whole point of Vista's new video driver architecture was to deal with multiple apps using the card (e.g. the desktop compositor and a game)


While Dave Lopez is right to say that this new way of making 3D engines applies both to the new GPU and to Larrabee like CPU, I also agreed with your point initially: I didn't see how Larrabee can hope to compete against high end GPUs due to the memory bandwith issue.

Then there was a paper about tile based rendering a different way to render 3D scene (which is/was used by PowerVR) which use less bandwith, but IMHO this will only allow Intel to compete with low end GPUs (which may be fine for them financially).

In the long run stacking some DRAM on the CPU would allow them to compete with high end GPUs, but there would be cooling issue.

"In the long run stacking

"In the long run stacking some DRAM on the CPU would allow them to compete with high end GPUs, but there would be cooling issue."

Cooling is not the main constraint on stacked dies. DRAMs are fairly low in power dissipation, either per cell or per square millimeter, as compared to CPUs.

The main costs are in hugely increased complexity of packaging.

There have long (at least 35 years, that I have direct experience of) been arguments for stacking dies, but it's almost always much cheaper to just expand in areal dimensions. Which means more cache and SRAM on die. (DRAM technologies are generally different, so there's always the idea of having CPU/GPU/SRAM on one die, DRAM or Flash on another, mounted above or adjacent to, but, like I said, it has almost always been cheaper to not do this.)

(A friend of mine filed for a patent on this in 1977, as a result of our soft error work on memories, but it was not pursued by our company, Intel. There were a _LOT_ fewer chip patents in those days, as nearly everything was extensively cross-licensed.)

--Tim May

Some background reading

A good overview of CUDA and NVIDIA GT200 - a modern GPU - can be found here.

The challenge to the reader is now to propose a decent language that can make efficient use of such an architecture.

Applicative Order Y-combinators

I am sure recursion can be done with Applicative Order Y-combinators, even when a language does not support recursion.

It is theoretically possible to port all functional programming languages to CUDA.

You are assuming that CUDA

You are assuming that CUDA supports function pointers, which I don't think is true. From Wikipedia's CUDA page under limitations:

It uses a recursion-free, function-pointer-free subset of the C language, plus some simple extensions. However, a single process must run spread across multiple disjoint memory spaces, unlike other C language runtime environments.