## "Limits to implicit parallelism in functional application"

How much implicit parallelism is there in functional programs?

One answer is "lots (I'd imagine)." Functional languages (and especially non-strict functional languages) can place relatively few constraints on a program's order of evaluation, often allowing multiple evaluations to proceed in parallel--at least in principle. The resulting "implicit" parallelism may be easy and useful to locate and exploit.

Another answer is "not so much (in actual experience)." When you take an existing, non-trivial functional program and try to find practical ways to speed it up by judicious placement of par (or by strategies or whatever), a surprisingly large effort can produce a surprisingly small improvement. So maybe that implicit parallelism just isn't there.

Another is "well, maybe it's there but it's just hard to find." If people can't find much implicit parallelism in real software, maybe it's the fault of the people. Maybe par and strategies and such are just too hard for people to use. Maybe some automatic parallelization mechanisms, yet to be invented, could do much better. Or something.

One interesting new answer is "I don't know. Let's measure it." I've just written a short paper on measuring the implicit parallelism in real Haskell programs, such as the NHC98 compiler. These measurements are in the context of a limit study, so they let us say that there is _no more_ than some certain amount of implicit parallelism present; exploiting even that much might of course be impractical.

The paper is at http://www.detreville.org/papers/Limits.pdf. I'd appreciate comments, of course.

Cheers,
John

## Comment viewing options

### There was a language Id90, a

There was a language Id90, a secret functional programming's weapon of massively parallel execution. It employs so called lenient evaluation order, where everything is evaluated in parallel except lambdas and conditionals. It can be shown that this evaluation order has exponentially more parallelism than eager evaluation. And Id90 particularly shines in numeric computation - experiments done on Thinking Machines CM-5 had shown that most of the computation tasks double the solution speed when number of processors doubled (some of tasks, however, display this behaviour only after some number of processors in system).

Id90 evolved into pH, parallel Haskell.

Lenient evaluation incurs 100%-200% overhead for its' support. But I think it worth it - to sacrifice three times slowdown for doubling the speed almost arbitrarily and almost effortlessly.

### to sacrifice three times

to sacrifice three times slowdown for doubling the speed almost arbitrarily and almost effortlessly.

Not to be picky, but maybe you worded this improperly?

### Might be.

It might be worded improperly. I am not a native speaker. I try to elaborate.

If our algorithm, expressed in Id90, has the complexity of O(C90*N^q/P) and the same algorithm in Fortran (or C) has the complexity O(C95*N^q) we can sacrifice C90 being three times bigger than C95 when we simply can quadruple number of processors P and get speedup over Fortran (or C) version.

### > But I think it worth it -

> But I think it worth it - to sacrifice three times slowdown for doubling the speed almost arbitrarily and almost effortlessly.

Well, I wouldn't use you to do some accounting: investing software development effort now, for a three times slowdown or 2/3 third slowdown given that most CPU sold now are mono or bicore..
Even with a 4-way CPU, as the clock of each CPU in the 4 way computer is lower than one of the single CPU, the 4/3 improvement would show little real benefits..

So is-it reasonable to pay now for potential real benefits when >4 core becomes available for a reasonable price?

That depends of the application of course.. But usually software developers only starts developing for hardware X when a significant proportion of users have it.

### Programs in Id90 were ran on

Programs in Id90 were ran on multiprocessors up to 64 processors. See the links.

And you do not need any effort like #pragma mpi or whatever, because all loops are parallelized automatically.

So, you write your program once and run it multiply times, increasing number of processors to meet your needs without any noticeable slowdown.