New Chip Heralds a Parallel Future

Missing no chance to stand on my soapbox about the need for easy PL retargeting, I bring you insights from Paul Murphy about our parallel-processing, Linux future.

[T]he product has seen a billion dollars in development work. Two fabs...have been custom-built to make the new processor in large volumes....To the extent that performance information has become available, it is characterized by numbers so high that most people simply dismissed the reports....

The machine is widely referred to as a cell processor, but the cells involved are software, not hardware. Thus a cell is a kind of TCP packet on steroids, containing both data and instructions and linked back to the task of which it forms part via unique identifiers that facilitate results assembly just as the TCP sequence number does.

The basic processor itself appears to be a PowerPC derivative with high-speed built-in local communications, high-speed access to local memory, and up to eight attached processing units broadly akin to the Altivec short array processor used by Apple. The actual product consists of one to eight of these on a chip -- a true grid-on-a-chip approach in which a four-way assembly can, when fully populated, consist of four core CPUs, 32 attached processing units and 512 MB of local memory.

Paul follows up with a shocker.

I'd like to make two outrageous predictions on this: first that it will happen early next year, and secondly that the Linux developer community will, virtually en masse, abandon the x86 in favor of the new machine.

Abandonment is relative. The new processor will emulate x86 no problem, as Paul notes. In the PowerPC line, already today we have Linux for PowerPC complete with Mac OS X sandbox. From a PL standpoint, however, this development may cattle-prod language folks off their x86 back ends and into some serious compiler refactoring work. I hope so!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I disbeleive

Oh, I easily beleive that they're going to make the chip, and I could beleive a factor of 10 theoretical performance gains. I just disbeleive it'll kill the x86.

The problem is that to take advantage of the increased performance, you will have to write super-parallel programs. We're not just talking about two or a dozen parallel threads, we're talking about hundreds or thousands.

I know one thing for sure- getting "classic" threads&mutexs based multithreaded programs to scale that far is nigh onto impossible. Not with any sort of reliability, that's for sure. Which means changing the paradigm programmers use, in a deep and fundamental way.

In my lifetime (1968-), this has happened maybe once, with the introduction of Object Oriented programming languages. A transition that started when I was in high school and is still ongoing (trust me- a lot of people programming in C++ are really programming in C, I've seen the code).

Fads come and go. XML seems to be the current fad. Even in my professional life I've seen them go by (anyone remember multimedia programming?). But the deep stuff changes slowly, if at all. We programmers are all too busy reinventing the wheel to consider axels, let alone shock absorbers. If you want the broad majority of programmers to pick something up, it can't be something *challenging*. It has to be more or less exactly what they already know, maybe with some surface changes. C++ vr.s Java stuff.

For some applications it'll be nice. Inherintly parallel problem sets. Gaming is an obvious place- give me 10 times the FP performance in the CPU, and graphics acceleration becomes less usefull. Web serving and data base applications are other areas that will benefit.

But don't expect it on the desktop in the next decade.

Re: I disbeleive

Sony and IBM plan to sell these chips in huge volumes. Two new fabs is a huge investment in such a down chip market as we have seen these last few years. They don't even need the PC desktop to succeed. Paul merely opines that cell PCs don't need video cards, audio cards, etc. So they are both cheaper and more powerful. That will be a hard combination for the market to resist. People are already putting Linux on just about everything in sight, even Apple iPods. It will be a natural fit.

You may recall when the IBM PC emerged. There were better machines around. What IBM PC had going for it was not Charlie Chaplin, nor technical merits, but cost.

Languages should not focus exclusively on the desktop. Language research struggles for wider and better abstractions, but then cripples itself by narrow desktop strategies. There are more embedded processors manufactured each year than exist people on the planet. These processors include DSPs and FPGAs. Even PIC microcontrollers are getting downright respectable. It is time that language researchers started thinking about parallel computing, real-time, microcontrol, etc. As I keep saying, governments around the world are dropping money for that sort of thing.

I know one thing for sure- getting "classic" threads and mutexs based multithreaded programs to scale that far is nigh onto impossible.

I disagree, since the parallel support is on the processor, it makes software simpler than threads and mutexes. But why does old code have to scale at all? It will just run 100 times faster as-is, under emulation if need be.

The problem is that to take advantage of the increased performance, you will have to write super-parallel programs. We're not just talking about two or a dozen parallel threads, we're talking about hundreds or thousands.

You are right that only naturally parallel languages will make the cut. These languages exist today, and yes, they handle thousands of lightweight threads: Oz, Erlang, Felix, Alice ML, etc. In my mind that outcome will be a very, very good thing. Goodbye, dumbed-down, single-track languages; hello to parallel power.

Parallel Programming is Hard

I disagree, since the parallel support is on the processor, it makes software simpler than threads and mutexes. But why does old code have to scale at all? It will just run 100 times faster as-is, under emulation if need be.

How does parallel programming become easier when it is supported in hardware? Serializing a parallel algorithm is easy; parallelizing a serial one is hard. I don't understand how the old code is going to run a 100 times faster as-is. Presumably an emulator is going to be doing a translation, and writing the compiler that takes x86 code and parallelizing it isn't going to be much fun.


The problem with architecture today is that we have essentially given up on trying to speed up serial algorithms. The emphasis is on parallelism, instruction-level and up. This is certainly easier on the designer, but Amdahl eventually bites you in the ass.

No, it's not

I don't understand how the old code is going to run a 100 times faster as-is.

Oh, gosh, that is simple. What changes is the thread/mutex library (that is the emulation). Wrap the old thread library's API around the new chips, leaving the application code alone. Recompile and link. So now ultra-fast hardware imitates a thread/mutex library that was once purely software. There are two speedups. One is the move to hardware. The other is the change from serial thread multiplexing to real parallel threads. One may argue about the tantalizing performance reports on this chip as a third speedup factor, i.e., it's just a faster chip.

I don't see any need for x86 emulation per se. Sorry if I confused you. The code just needs to be recompiled, modulo the thread library port, which is more involved, but of limited scope. All of which is just for backwards compatibility anyway. What is exciting is the prospect of new, parallel programs in new, parallel languages.

Serializing a parallel algorithm is easy; parallelizing a serial one is hard.

I proposed no such thing. Presumably, an old program using threads is already parallelized. Algorithms qua algorithms are not programs, and have their own issues. At least with parallel hardware we'll be only theory-constrained, not hardware-constrained.

How does parallel programming become easier when it is supported in hardware?

Parallel programming becomes easier with proper language support. Well, I said "since the parallel support is on the processor" but later qualified with "only naturally parallel languages will make the cut," making my notions explicit, I hope. The idea is of parallel processors coupled to languages that exploit the hardware and remove complexity. A huge difference separates good languages from bad ones on this score. The bad ones create an illusion that parallel work is intrinsically hard, as the CTM preface observes:

Concurrency in Java [and its ancestor, C/C++] is complex...and expensive....Because of these difficulties, Java-taught [and C-taught] programmers conclude that concurrency is a fundamentally complex and expensive concept. Program specifications are designed around the difficulties, often in a contorted way. But these difficulties are not fundamental at all. There are forms of concurrency that are quite useful and yet as easy to program with as sequential programs....Furthermore, it is possible to implement threads, the basic unit of concurrency, almost as cheaply as procedure calls. If the programmer were taught about concurrency in the correct way, then he or she would be able to specify for and program in systems without concurrency restrictions (including improved versions of Java).

Yes, it is.

As for your comment that scaling doesn't matter, I agree that it will be fine as long as there are a LOT of threads with no dependencies between them. Otherwise, the speed of that code is going to be very reliant on the "third speedup factor", and the memory performance of the chip.

I earlier interpreted your comment to mean that the emulator magically translated an x86 program with a few threads to a massively threaded native Cell one. I think that would be difficult to accomplish.

Language support is helpful, but constructing efficient parallel algorithms is always hard, particularly when you have to take into account communications costs (including memory). Coding for Cell sounds a lot like coding for NUMA machines. Performance there is always dependent on memory access costs where the languages are unlikely to be of much benefit.

Re: Yes, it is.

As for your comment that scaling doesn't matter

Sigh. Note what I said: why does old code have to scale at all? and exciting is the prospect of new, parallel programs in new, parallel languages. The old code contrasts new, parallel programs. How you derive from these remarks "scaling doesn't matter," is beyond me. Old code just needs a transitional phase, not perpetual maintenance. No one writes 16-bit code anymore, and that change did not take forever. The change to parallel hardware and software may well resemble it.

I earlier interpreted your comment to mean [something that is] difficult to accomplish.

Irrelevant misinterpretation, then.

constructing efficient parallel algorithms is always hard

In the first place, disambuiguate algorithms from programs. In the second place, there exist parallel-friendly "algorithms," too. Finite-element simulation engineers would beg to disagree with your assessment, as would graphics programmers. Some algorithms can be naturally parallel, just as others can be naturally serial. The problem today is that we have only one form of hardware, so the constraint is tighter than the algorithm itself.

This yes-it-is/no-it-isn't debate feels like Slashdot now, so I will bow out. Sony and IBM appreciate the merits of parallel processors to the tune of a billion dollars, along with the authors of CTM, so I stand in good company.

Maybe Peter Van Roy will offer some thoughts. I have spilled enough electrons.

Re: Yes, it is.

First, my comment was made in the context of Hurt's original remarks on legacy code and I think it accurately reflects your comment: "Why does old code have to scale at all?"

I don't think it's clear that Cell will run old code faster in emulation or even with a recompile. It's going to depend on specifics within the architecture.

Legacy code is important. No one writes 16 bit code anymore, but I know plenty of people who still run it. Cell will have a hard time gaining traction (beyond game machines and supercomputers) if everything needs to be rewritten for it.

Next, the two examples about parallel friendly algorithms are only friendly on SMP machines. I haven't read anything technical on Cell, but it doesn't appear to have SMP behavior. You keep writing that I need to keep programs separate from algorithms, but the suitability of an algorithm is going to be affected by the architecture, not just its expression. Once you move away from von neumann, you have to choose some other abstraction. Languages assume those abstractions and performance suffers if the abstraction doesn't match the underlying architecture. Concurrency in C is painful because C's assumptions don't match what the programmer is trying to accomplish. Similarly, Cell seemingly specifies a very odd set of assumptions that I don't think match the languages you mentioned (I'm not familiar with all of them, but aren't they all PRAM?) Perhaps a clever compiler can hide that complexity, but I doubt it.

Cell looks like an interesting architecture. If it's successful I will be happy, since it will spur compiler research. But your comment about Sony and IBM and a billion dollars brings to mind the obvious parallel of Intel and Itanium (also an interesting architecture).

Bugs

I doubt that most applications as found on linux will function correctly if taken from emulated threading to real threading.
From the description of cell computing it seems to biggest profit would be in the local memory, a concept that is not very compatible with the global memory model for which most c programs are written.

The live expectancy of the x86

I agree that languages should not focus on the desktop. But if you're going to predict the death of the x86, then you have to predict a change of CPU on the desktop- or the death of the desktop.

Yes, there are several attempts at serious multi-threading (you forgot Cilk and JoCaml, among others). It's quite possible that one of them is the solution. But notice- the existance of the language doesn't make it popular. None of the languages we've listed come close to being popular compare to Java or C++. As a crude measure of popularity, wander down to your local bookstore and compare the number of books on Java or C++ to the number of books *total* on all Lisp derivitives.

This isn't right, isn't fair, and isn't good, I agree- but it *is*.

Re: The life expectancy of the x86

Recall I said that Languages should not focus exclusively on the desktop. Also, Abandonment is relative means that I personally expect the x86 to live longer than does the author. Nor do I forget Cilk. You should also review Flow Java from the CTM folks. On popularity, well, 16-bit code was popular once, too.

multiple threads or multiple data

Can someboty explain to me why you'd have to write multithreaded code to take advantage of multiple processors? Why not simply one thread where every processors calculates a part of the data. Like:

for (item in collection)
{
..do something with the item..
}

In a lot of cases it's very easy to prove that each item can be handled in parallel with the each other item.

It isn't easy.

The problem isn't the dependency in the order of the operations, but in the possible side-effects of such operations. The compiler should be able to prove that doing these things in parallel is equivalent to doing them serially, so it should be able to prove that the piece of code is thread-safe.

Ok then, but...

What if the programmer simply specifies that it is safe to do the items in parallel, would it work then? Or are there other problems?

If the programmer is wrong...

the program would crash or give wrong results. This is very tricky if you don't have referential transparency. OTOH if you have RT then the compiler can do any tricks as long as the code isn't explicitly serialized (e.g. monads).

I think this is where there i

I think this is where there is much to be gained.
Personally i would use java with a few added keywords to spawn these subtasks, still how do you manage to only push the needed pages to the cells local memory ?
A kind of two stage paging ? where the main ram becomes a cache before the virtual memory stored on disk ?

How about FP?

Side effect free funcitonal programming and array oriented languages could have an advantage on cell processors.

Pipelines

I remember reading this article on the innards of the Playstation 2 a while back. It seems as if there are obvious benefits with this kind of architecture for systems where a pipelining approach makes sense - gaming/rendering, pro audio/dsp (as a guitarist who likes long FX chains, I start to salivate when I think about this in detail), maybe crypto as well. In other words, it's ideal for situations where you have discrete subsystems doing specialized tasks that can be chained together, as in the PS2's emotion engine, rather than when you have a single homogenous task that you then have somehow to parallelise.

Re: Pipelines

JITting for emulation of older architectures (x86) may be one such application, perhaps something like a parallel version of the SoftPear approach. So one branch translates x86 opcodes into native opcodes, while the other pipe executes them.

Confirming Footnote

Footnote: Slashdot recently linked to a more detailed technical study that confirms my impressions. There are some choice quotes in the piece and it tackles both the tech details and the market implications.

Ars Technica winces

A small response on Ars Technica winces at some parts of Blachford's analysis.

--Shae Erisson - ScannedInAvian.com

Thank you

After reading the "analysis" article yesterday I felt the urge to post some flamebait; glad that Ars Technica already responded in a nicer way :-)

The Register Articles

Take urges to Slashdot. I'm not following details, but the Ars Technica contention seems small and pedantic as Blachford says in rebuttal. Either way, $1 billion for new fabs speaks for itself. On a wider perspective is The Register's recent piece, "The Cell chip - what it is, and why you should care" (pt. 1, pt. 2).

But in the end, technical arguments like these won't be the decisive factors. We have to step right back and look at how and why people depend on computer technology, and exactly who in the world stands to benefit from each - and there are many - of the possible "victory" scenarios.

And

This time, Cell is aimed at a different market, one that Wintel has failed to conquer - the living room.

Another footnote could be considered good or bad:

Each Cell is given a GUID, a global identifier.