automatic program parallelization for multicore cpus as a software problem

I just came across an article regarding Intel having massive multicore CPUs ready in the near future. This kind of CPUs will require a fundamental change of how we write programs, or so they say (there are plenty of discussions around about this).

But do we really have to change the way we write software? I wonder if there can be a part on the chip that automatically makes different threads of execution out of a single thread, by correlating data dependencies at run-time. Just like branch prediction, a similar piece of logic could be used to 'predict' data correlations, and thus separating the instruction stream into different threads, as if the code was multithreaded. A special lookaside buffer could be used to prevent simultaneous access to memory locations: since data are always fetched from the cache, the cache itself could contain flags that can be used to monitor simultaneous access. The CPU would catch the simultaneous access event and modify its threads and prediction statistics accordingly, so as that next time the simultaneous access is avoided.

I am asking LtU for this because I think the problem is essentially a software problem. In my mind, an instruction always targets a memory location (either directly or indirectly through registers), so the CPU could monitor dependent instructions and separate them into different threads of execution. The mechanism could be extended to registers, since registers are essentially memory locations inside the CPU.

Assuming that dependency tracking could take place in run-time, what programming language semantics are required in order to help the hardware run the software more efficiently? Is the C semantics enough? For example, Fortran loops can be automatically vectorized because they do not contain pointer aliases.

Finally, are purely functional programming languages better, on the semantics level, for automatic parallelization? on the surface, it certainly seems so, but what about languages that are translated to C (or equivalent) code? aren't those programs under the same constraints as C? would we have to eliminate C as the middleman and encode parallelization tips (coming straight from the FP semantics) directly in the instruction stream?

I wonder if there can be a

I wonder if there can be a part on the chip that automatically makes different threads of execution out of a single thread, by correlating data dependencies at run-time. Just like branch prediction, a similar piece of logic could be used to 'predict' data correlations, and thus separating the instruction stream into different threads, as if the code was multithreaded.

Modern super-scalar CPUs already do this in a sense. The instruction stream is analyzed for dependencies and the processor will spread them out among several execution units. Unfortunately, there's only so much parallelism that can be extracted in this manner which is part of the reason for the move to multi-core.

I don't think analyzing a compiled program for automatic thread-level parallelism is something that can be done practically at runtime unless you can make assumptions about the way the source langauge generates machine code. Pointers make analyzing which parts of a program access which memory locations too complicated.

Finally, are purely functional programming languages better, on the semantics level, for automatic parallelization? on the surface, it certainly seems so, but what about languages that are translated to C (or equivalent) code? aren't those programs under the same constraints as C? would we have to eliminate C as the middleman and encode parallelization tips (coming straight from the FP semantics) directly in the instruction stream?

Presumably, the compiler from the source language to C would generate a multi-threaded C program as output in such a situation thus requiring no auto-parallelization by the C compiler itself. I don't think any of the existing parallel FP compilers use C as a middleman though, but I could be mistaken.

As an aside, I'm working on an impurely-functional (well depending on your exact definition of functional I suppose) dataflow language with efficient parallel execution as one of its goals (though I'm quite a ways from achieving that particular goal).

By Michael Pavone at Fri, 2007-06-15 15:46 | login or register to post comments

Re: new dataflow language

Are there pointers you'd suggest to things for a newbie to read, specifically discussing the set of dataflow and dataflow-like languages (i'm using that link too much recently) which exist, and how they do/not succeed in being amenable to high perf parallelization? Many thanks.

By raould at Fri, 2007-06-15 17:47 | login or register to post comments

Not the best person to ask

To be honest, I'm not really the best person to ask. I didn't originally set out to create a high performance parallel language, it's a goal I sort of picked up along the way and while the current implementation is multi-threaded, it actually runs slower on my dual core machine when the threads are allowed to run on both cores (obviously I have quite a bit more work to do in this area).

My original goal was to create an easy to use visual scripting language for adding simple logic to a templating system for InDesign documents. Dataflow just tends to lend itself to that kind of language and as I migrated my goals from a very domain specific language to a more general purpose one, high performance parallelization seemed like an obvious, though challenging, goal.

By Michael Pavone at Mon, 2007-06-18 12:58 | login or register to post comments

TRIPS == hardware dataflow?

TRIPS is a 'new' architecture from UT Austin. According to this article:

TRIPS is a demonstration of a new class of processing architectures called Explicit Data Graph Execution (EDGE). Unlike conventional architectures that process one instruction at a time, EDGE can process large blocks of information all at once and more efficiently.
...
Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously.

Anyone know what an 'explicit data graph' is?

By markt at Fri, 2007-06-15 21:23 | login or register to post comments

EDGE

I don't quite get it yet myself, but Compiling for EDGE Architectures looks like it might be a good place to start.

Looks like rather than having a compiler produce an instruction stream that hardware gets to parallelize as it can, the compiler produces a sequence of 128-instruction blocks which internally execute in parallel based on explicit data dependencies and externally appear to execute atomically (as one mega-instruction?).

By Tommy McGuire at Fri, 2007-06-15 22:06 | login or register to post comments

Lambda the Ultimate

User login

Navigation