Scala Team Wins ERC Grant

The Scala research group at EPFL is excited to announce that they have won a 5 year European Research Grant of over 2.3 million Euros to tackle the "Popular Parallel Programming" challenge. This means that the Scala team will nearly double in size to pursue a truly promising way for industry to harness the parallel processing power of the ever increasing number of cores available on each chip.

As you can see from a synopsis of the proposal, Scala will be providing the essential mechanisms to allow a simpler programming model for common problems that benefit from parallel processing. The principal innovation is to use "language virtualization", combining polymorphic embeddings with domain-specific optimizations in a staged compilation process.

This may yet lead to very interesting developments.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Extended synopsis

Here's an extended synopsis of the scope and plan for the project.

Which is definitely worth

Which is definitely worth reading, I should add.

The scope seems to include an "Effect System" for Scala

From page 4 of the ext. synopsis: "The goal of this work package is to come up with a general, user-extensible effect system that’s at the same time accurate and lightweight."

This is great to see --

This is great to see -- Scala has definitely become a common host language for parallel DSLs for both its Java heritage and frontend facilities. Consciously boosting this effort, and getting funding to do so, is awesome news!

For such a vision, I have to wonder -- how is this different from similar trends in the Haskell approach? There seems to more focus on frontend support (syntax, inference) and the uses of higher types is cool. OTOH, I've found tuning and JITs/profile guided techniques to be essential for both irregularities in small kernels and scaling to large code bases with no individual kernels: the proposal brushes on supercompilation, but discussion of dynamic support is rather shallow (and I'd argue not intrinsic to advocating DSLs).

With FRP and staged

With FRP and staged compilation among their goals, I certainly look forward to the outcome of this labor.

I find interesting their proposed 'inverted' use of types where dynamic values are explicitly marked Rep[T] and everything else gets a compile-time value during staged compilation.

how is this different from similar trends in the Haskell approach? - Leo

I'd certainly like to see simplified support for staged compilation (runtime specialization / JIT of partially applied functions) in Haskell, perhaps triggered by an identity function similar to 'par' and 'seq'. It would be a huge boost to support for DSLs.

discussion of dynamic support is rather shallow (and I'd argue not intrinsic to advocating DSLs)

Well, the ability to compile DSLs effectively would reduce reluctance by performance-conscious developers to utilize DSLs. It doesn't even matter whether they are 'right' (or 'shallow') to be concerned about performance; it remains a psychological barrier to adoption and a problem for advocacy. I, for example, am a performance-conscious developer, one who likes to understand what happens to code under-the-hood in order to know where code should be written more carefully and where returns would be marginal. I would be reluctant to accept a DSL unless I was confident of its implementation.

Lowering the barriers for entry, not just in expression of DSLs but also for those using them, is a very important consideration.

That said, there are other major concerns for using DSLs not touched upon in the extended synopsis, such as expressing type, syntax, and error messages in terms of the DSL code... and integrating the debugger.

Solution?

So, the future Scala solution to parallelism to give the user cool typechecking and metaprogramming tools so the user can construct a new ad-hoc DSL to solve his parallelism problems himself, in that DSL, running as staged code within Scala, running inside the .NET virtual machine.

1. This moves the parallelism problem around without solving it.

2. How will heterogeneous computing and this GPU[t] thing actually work? You can't really port .NET to CUDA/OpenCL given their incompleteness, so what can a DSL do other than strcat some code together and pass it to a command-line compiler, and communicate with it using ad-hoc synchronization and data sharing?

seconded

I admit similar skepticism over the final efficacy of this plan. The individual pieces are nice hunks of CS or implementation work, and having them available could indeed make a system that is "nice to look at". Things did stray into hand-waving, though, when they mentioned CUDA and C/MPI. Also, if one reads the research goal from the funding agency, there's all this hulaballoo about the ordained IT crunch when available parallelism wildly outstrips our ability to program to it. But then all the application areas cited here are already well situated with their own effective (if balkanized) parallelism strategies.

I do like the idea of getting things all under one roof to share common infrastructure, and don't want to nip at the heels of someone interested in actually putting in the work of making such integrated systems happen. However, they did a few things that raise flags to me, such as the quick jargonization of their own work (language virtualization, polymorphic embeddings). Perhaps that's more a strategy to get the funding than an explanation of the work.

It also seems they'll be redoing a whole lot of academic and implementation work while they're at it. Even in my little slice of the world I've seen similar projects, such as Rascal and DMS*. They reside within a whole analysis and transformation world apparently undiscovered by the group here, judging by the text and the otherwise extensive references.

* Note I work for the maker of DMS. There's a prominent failed exercise in our company lore where we made a snazzy vector code generator using a DSL layered into C++. The target company didn't buy in because the returns would've been more than 6 moths off, with additional complaints about lack of Visual Studio integration. So I have doubts that large groups will jump on this new environment-- everyone "wants" parallelism, but I've seen few big players who are willing to put in the effort to go beyond their current ad-hoc investments into a wider integrated system.

Yep

there's all this hulaballoo about the ordained IT crunch when available parallelism wildly outstrips our ability to program to it. But then all the application areas cited here are already well situated with their own effective (if balkanized) parallelism strategies.

It's not just you who thinks that. The problem is that all grant proposals say this stuff and nobody ever gets called out on it for being lame.

What no grant reviewer will every dare point out is that domain decomposition will be the most effective approach to parallelism, and that, as you say, these application areas all have decomposition techniques unrelated to the programming language work discussed.

The hugest non-sequitir in this grant proposal that I laughed at was the mention of Google and Facebook's uses for parallelism. It showed either complete naivete, hand-waving, etc. I don't know how to describe it accurately.

The biggest issue that computer scientists can assist with on the tools side, which includes IDEs and languages, is going to have to come from empiricism. That's what Leo was speaking into, and what Tim Sweeney was speaking into. For example, the Google example is a total non-sequitir because nobody actually understands how effective Google is in saving power vs. returning results fast. And, as I've pointed out before, it might be more financially and environmentally responsible for Google to model the accuracy of its search results vs. the time it takes users to give up waiting for the results to return vs. quality satisifaction with results (which is different from accuracy). These three compartments to a statistical model are probably a more important factors than straight-line parallelizability.

But what about straight-line parallelizability? They didn't cite Kathryn S McKinley's studies on the so-called Power Wall. This is the real motivator right now for better tools and languages, and the real motivator for better integrating tools and languages. It is integration of concerns that matter. Data centers at Google and Facebook and so on are sucking down the world's power supply at incredible rates.

Overall, I'm glad Martin got funding - he's a smart guy and sometimes you just have to trust him - but the proposal was kind of sloppy. But it is also the norm.

2. How will heterogeneous

2. How will heterogeneous computing and this GPU[t] thing actually work? You can't really port .NET to CUDA/OpenCL given their incompleteness [...]

See Bling. Now consider having language-level support for the kinds of syntactic transparency Bling wishes it could provide, to the point that developers largely don't even realize they're programming with lifted values, but it all works as expected.

The point is that experts would write the DSLs to provide optimal synchronization points, scheduling, etc. (or use a provided framework/library) and users would just write programs as normal. This is LINQ, but adding more syntactic flexibility and the ability to write programs that abstract over the concrete representation (due to higher-kinded types).

Bling works well for

Bling works well for graphics programming of pixel and vertex shaders, Bling in Scala would be even better, given its more powerful and user-friendly type system.

GPGPU is a whole different ballgame. The syntax problem is the least of your concern, the semantic model for GPGPU is bizarre and you have to really know exactly what's going on to get decent performance. If the Scala team could make any progress in that regard, it would be a very big deal.

Understood, my point was

Understood, my point was only that the developers would just be programming against a usable API/DSL, and domain experts would be providing the backend and managing the complexities. Bling is just a good illustrative example of this principle in a limited domain. This Scala project will just be taking it to a whole new level, as you say.

Tim, Scala runs primarily on

Tim, Scala runs primarily on JVM, a port for .NET is currently in the works but it's not done yet.

The main point that you miss in your reply is that we do not assume one kind of user. The parallel programming domain expert will construct DSLs and give optimization hints that the framework will map to different architectures. The application programmer will then just use the DSLs. Some of these DSLs already exist, see the work on Liszt and OptiML at Stanford.

I explicitly do not claim to have a silver bullet to the PPP challenge. All I think we have is a lever to address the problem with continuous work without starting from scratch each time. Both the framework and individual parallel DSLs can be re-used many times over.

Regarding point 2, you are right that we will never be able to port whole applications to CUDA/OpenCL. That's why domain-specific program analyses are needed to decide what goes where. The low-level work is indeed messy, we'll have to throw some serious resources at making it easy to use and robust. It's good that we have these now.

Towards nice parallelism

I can imagine application code written in ordinary Scala which directly invokes GPU pixel shaders written in a subset of Scala that is typechecked via staged compilation and translated to run on the GPU. This would provide a cleaner and more robust programming environment than the current model, where C++ application code passes GPU shaders written in GLSL different language to OpenGL as strings and performs ad-hoc error-prone marshalling of shared data.

If the industry's inevitable path is towards extreme heterogeneous computing, then your approach seems about as good as I could imagine. But I tend to take the CPU/GPU dichotomy as a historic anomaly that will likely be undone as CPUs grow more GPU-like with more cores, wider vector instruction sets, greater focus on aggregate performance per watt, etc.

If the hardware is homogeneous, then I think global language-level solutions to parallelism are viable and much cleaner (e.g. some combination of pure functional programming, transactions, vectorization, and messaging).

Need a crystal ball here

You might be right that the current CPU/GPU distinction will go away again in the future. I also think the trend is towards instruction set convergence, but on the other hand current CA trends seem to favor heterogeneous architectures, where some processors are more powerful than others.

But I believe staging will be essential even in a completely homogeneous computer architecture setting. The reason is that you need a way to extract massive parallelism from applications, and domain-knowledge + staging is inherently more powerful at this than global language-level solutions such as functional programming, transactions, messaging etc. Not that these latter are not useful, of course they are. But I have not yet seen evidence that they can scale many applications well beyond 100 cores, say.

But we can't know for sure, of course, without having tried this avenue and compared it with others. That's what the research is for.

Sorry if this appears blunt, but...

I also think the trend is towards instruction set convergence

This would be great, but is it not pie in the sky thinking? I don't see anything in the future which tells me that (for example), GPUs are anywhere close to giving up their texture fetching units, or that CPUs are anywhere close to having texture fetch units. If you've tried to access memory with any decent sort of performance on a GPGPU task, you end up having to use texture units, or Cell-style, manually managed private memory.

I don't think any of the above will appear anytime soon on a CPU instruction set, so what's your evidence for that statement?

Evidence

Intel, AMD, and NVIDIA are each shipping (or will shortly ship) chips that integrate CPU cores with vector instruction sets, even more vectorized GPU cores, and various bits of fixed-function hardware.

Intel is about to ship tens of millions of Sandy Bridge chips that have CPU cores with 8-wide vector operations, and texture sampling units on-die. In another generation, those vectors could be 16-wide and the texture units could be generally accessible across the on-chip bus -- as was the case with the (working, but not publicly released) Larrabee chips Intel built two years ago.

Each generation, NVIDIA GPUs add more cache and improve substantially for general memory access operations, while growing ever more programmable.

These trends can't realistically be construed as anything other than instruction set convergence.

Intel is about to ship tens

Intel is about to ship tens of millions of Sandy Bridge chips that have CPU cores with 8-wide vector operations, and texture sampling units on-die.

Ah, fantastic. Thanks for correcting me (that was news to me at least).

Also,

This is the direction things were heading in the 1980s, until the AI Winter killed research into heterogeneous hardware architectures. [Edit: And a lot of the biggest names in hardware at Intel cut their teeth in that era of hardware architecture.]

There is still a lot of programming language research from the 1980s that goes uncited, some of which contains good ideas. Usually when I bring it up to current leading researchers, they just ignore it. It is kind of interesting; there is simply too much new stuff for people to care about the old stuff. We had an LtU story about this recently, written by Jack Dennis and Peter Denning... although those authors went even further back in the search for a good history lesson.

Even with instruction set

Even with instruction set convergence, staging and meta-programming can still be useful. You can create a custom physics engine in Bling using high-level abstractions, and the engine will be dynamically output to CLR code. The advantage is that the CLR code is more low level and efficient than the CLR code the user wrote directly; i.e., it doesn't create garbage when doing physics engine computations.

Two separate issues

1. How convincing is this approach to parallelism?
2. What will this do to Scala.

1. How convincing is this

1. How convincing is this approach to parallelism?

Time will tell; I think it will be interesting to see how many of the problem domains they list will actually be addressed. I am kind of curious what sort of benchmarks exist out there already for some of this stuff, like mesh solvers. I honestly don't know how you benchmark some of the domains they listed. And some, like Google Search, don't have benchmarks and have an implicit suggestion that there exists a JVM capable of running Scala code to outperform the Google code [edit: that is kind of interesting, since it basically says that whatever solution Scala comes up with, they will depend on something like Azul Systems or Google Dalvik team and/or .NET engineers to actually make it work]. That implicit suggestion is a big Whatever/WhoCares since it can never be tested.

Does anyone know benchmarks for mesh solving?

Just to clarify: The

Just to clarify: The synopsis is evidently not the whole proposal. The whole proposal runs to about 40 pages. These proposals are generally not published, because they contain very detailed plans of future research that might be scooped by others (in the worst case by taking out patents against this stuff).

Whose money

These proposals are generally not published

Quite ironic, given the fact they apply for funding by public money.

Funding Scala is probably among the best uses of these monies, but still the setup is illogical.

While the ways in which

While the ways in which science gets funded deserve to be discussed and crticized, I am not sure LtU is the appropriate place. Nor is doing it in this thread fair to Scala.

Related

A recent topic (also with Martin Odersky) seems closely related to several goals of the proposed Scala project.