Why Multi-Core is Easy and Internet is Hard

I was recently invited to be on the panel Reinventing Audio and Music Computation for Many-Core Processors at the International Computer Music Conference (ICMC 2008), Belfast, Ireland, Aug. 2008. In my position statement, The Challenges and Opportunities of Multiple Processors: Why Multi-Core Processors are Easy and Internet is Hard, I explain why programming multi-core processors is basically a sociological problem (the technical problems were solved long ago) and why programming loosely coupled systems (like the Internet) still has a lot of technical challenges. I am curious to hear what the LtU community thinks about this.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

REST

REST proponents would say that programming systems like the Internet is just sociological problem, as the technical problems are all solved by REST. I'm not far from believing in it myself, it's hard to see what is hard about loosely coupled distributed systems that can't be solved using it.

I'm not far from believing

I'm not far from believing in it myself

Not sure what's hard to believe. A REST-ful URI is just a lambda name! If you then maintain memory-safety, you now have a secure, distributed lambda calculus.

Not only URIs

Besides URIs we have other defining traits in REST (e.g. HATEOAS, uniform interface), some of which aren't well mapped to well known theories. Also nothing says that a distributed lambda calculus is a good fit to distributed systems when compared to distributed Pi or Join.

The actual comment was believing that it's just a sociological problem. IMHO there's some issues that are still open (e.g. service versioning, authentication) and need some engineering work. Don't get me wrong, I've been working solely with REST based architectures this year and I'm not looking back to anything else (to work within the Internet).

Also nothing says that a

Also nothing says that a distributed lambda calculus is a good fit to distributed systems when compared to distributed Pi or Join.

It's closest to a distributed lambda calculus with futures. The Waterken server was influenced heavily by E.

(e.g. service versioning, authentication)

I highly recommend you read the site I linked to. Tyler Close's web-calculus/web-keys work discusses all of these issues and solves them quite elegantly IMO.

cap-talk is an old friend

I'm a lurker on cap-talk for some time now, Waterken is what I point people to when they ask me about caps on the web ;). The fact that the web-calculus exists and is elegant is great, but there's a distance between having a theoretical solution and it being practical. I'm not trying to diminish Tyler's efforts or abilities, but there are many great ideas that don't scale. I think that the web-calculus is the solution to the problems I discussed above, but its just my opinion, until there's a large base using and deploying it we don't know if it'll work as intended. I would bet on the web-calculus (and caps in general) but we can only say the problem is solved when it is "used in anger" by late adopters.

I'm not trying to diminish

I'm not trying to diminish Tyler's efforts or abilities, but there are many great ideas that don't scale.

Very true. If you've been following cap-talk, then I think a Horton implementation on the web would get Waterken close enough to implementing existing ACL-based policies so as to reap the benefits of both worlds.

There are also scalability concerns with traffic volume, and development tools, but I think they're all fundamentally solvable.

REST?

REST says nothing about the distributed algorithms needed to coordinate or collaborate. REST provides global naming of resources and conventions to build client/server applications on the Internet. This is only the first step.

'solved long ago' eh?

Interesting read. But I think the (repeated) blunt statement that 'multi-core concurrency was solved long ago' is simplistic and hard to back up. There certainly are a lot of techniques available and documented, but specific situations or performance requirements often make it necessary to go beyond existing techniques, since none of them (as far as I know) is general, composable, efficient, and silver-bullety enough to solve all problems.

Two General's Problem

The first thing coming into my mind when you ask "what is hard about loosely coupled distributed systems" is the Two Generals' Problem and the referenced position statement sums up to exactly that.

A bigger distinction should be made

between "loosely coupled" systems, and "widely distributed" systems.

In the former case, one can often assume the absence of malicious entities, a single administrative domain, as well as more desirable technical parameters (bandwidth, latency). In the second case, one must seek to guard against attacks of all sorts; you may be dealing with untrusted parties, you won't have control over all the network nodes, etc.

And to be pedantic, in between SMP/multicore and loosely coupled you have various NUMA architectures--in which you lose the "gods eye view" but generally can assume an absence of partial failure. Of course, the relationship between a CPU and a GPU is often NUMA, though the modern GPU is special-purpose enough that programming one is often not considered "multiprocessing" (it's just considered another peripheral).

A simplistic "chart"

Sequential code, private to an organization:
* Determinstic scheduling: yes
* Topology: Fixed.
* Migration: N/A
* Global consensus ("god's eye view"): yes
* Partial failure: no
* Malicious third parties: no
* Malicious clients: no
* Administrative domains: single
* Bandwidth: Extremely high
* Latency: Extremely low (nanoseconds to tens of nanoseconds)

Multithreaded code, private to an organization, running on a single CPU with pre-emption:
* Deterministic scheduling: no
* Topology: Fixed
* Migration: N/A
* Global consensus: yes
* Partial failure: no
* Malicious third parties: no
* Malicious clients: no
* Administrative domains: single
* Bandwidth: Extremely high.
* Latency: Extremely low (nanoseconds to tens of nanoseconds)

SMP/Multicore, private to an organization:
* Determinstic scheduling: no
* Topology: Extremely stable (occasional CPU upgrade in some systems)
* Migration: Trivial (one processor as good as another)
* Global consensus: Possible, though more expensive (cache coherency issues)
* Partial failure: no
* M3P: no
* MC: no
* Administrative domains: single
* Bandwidth: Extremely high
* Latency: Very low (tens to hundreds of nanoseconds, cache coherency, again)

NUMA architectures:
* Deterministic scheduling: no
* Topology: Very stable (occasional CPU upgrade in some systems)
* Migration: Somewhat trivial--support provided by underyling OS typically.
* Global consensus: Expensive, often "no"
* Partial failure: no
* M3P: no
* MC: no
* Administrative domains: single
* Bandwidth: Very high
* Latency: Low (hundreds of nanoseconds to microseconds)

Dedicated clusters with fixed topology, decoupled or isolated from an organization's intranet:

* Deterministic scheduling: no
* Topology: Stable-very stable (such systems are not reconfigured without good reason)
* Migration: somewhat difficult--though often supported by OS or grid/clustering software.
* Global consensus: No
* Partial failure: Yes, though uncommon and usually quickly detectable
* M3P: no
* MC: no
* Administrative domains: single (administrative responsibility may be subdivided, though there is usually one person "at the top")
* Bandwidth: High
* Latency: Medium-low (tens to hundreds of microseconds)

Intranets in general and similar private distributed systems

* Deterministic scheduling: no
* Topology: moderately stable (often changes to topology, though most are harmless or published in advance)
* Migration: difficult.
* Global consensus: No
* Partial failure: Yes, though somewhat uncommon and usually quickly detectable
* M3P: no
* MC: no
* Administrative domains: single (administrative responsibility may be subdivided, though there is usually one person "at the top")
* Bandwidth: High
* Latency: Medium (hundreds of microseconds to milliseconds or more)

Clusters, other loosely coupled systems, via a VPN (all nodes private to an organization, but over a public network; possibly with negotiated QoS)
* Deterministic scheduling: no
* Topology: moderately stable (often changes to topology, though most are harmless or published in advance). Topology of intermediate network is often a "don't care" and is abstracted away by VPN.
* Migration: difficult. Global consensus: No
* Partial failure: Yes
* M3P: yes, though modern VPN abtractions reduce the risk via transport-layer encryption (transparent to the application).
* MC: no
* Administrative domains: multiple. Nodes are technically under a single point of control, though geographically distributed sysadmins are often better modeled as multiple domains. Network service provider(s) are additional domains, though they are often required to provide some level of QoS even if best-effort.
* Bandwidth: Medium
* Latency: Medium (milliseconds to tens of milliseconds)

Distributed systems over the open internet, all nodes private to an organization:
* Deterministic scheduling: no
* Migration: difficult.
* Topology: unstable
* Partial failure: Yes
* Global consensus: No
* Partial failure: Yes, with fault isolation difficult.
* M3P: yes. In addition, some service providers may be hostile to some applications (ie BitTorrent).
* MC: no
* Administrative domains: multiple. Intermediate domains may be difficult to identify, intermediate service providers may be indifferent to endpoints.
* Bandwidth: Medium-low
* Latency: Low (tens to hnudreds of milliseconds)

External services over the open Internet:
* Deterministic scheduling: no
* Migration: difficult.
* Topology: very unstable; client nodes come and go at will.
* Global consensus: No
* Partial failure: Yes, with fault isolation very difficult.
* M3P: yes. In addition, some service providers may be hostile to some applications (ie BitTorrent).
* MC: yes. tons of them.
* Administrative domains: multiple, including clients under different domains.
* Bandwidth: Low, especially if client is over dialup or other low-bandwidth connection.
* Latency: Low to very low (hundreds of milliseconds or more)

LtU Marketing Department

  Deterministic scheduling Topology Migration Global consensus ("god's eye view") Partial failure Malicious third parties Malicious clients Administrative domains Bandwidth Latency
Sequential code, private to an organization: yes Fixed. N/A yes no no no single Extremely high Extremely low (nanoseconds to tens of nanoseconds)
Multithreaded code, private to an organization, running on a single CPU with pre-emption: no Fixed N/A yes no no no single Extremely high. Extremely low (nanoseconds to tens of nanoseconds)
SMP/Multicore, private to an organization: no Extremely stable (occasional CPU upgrade in some systems) Trivial (one processor as good as another) Possible, though more expensive (cache coherency issues) no no no single Extremely high Very low (tens to hundreds of nanoseconds, cache coherency, again)
NUMA architectures: no Very stable (occasional CPU upgrade in some systems) Somewhat trivial--support provided by underyling OS typically. Expensive, often "no" no no no single Very high Low (hundreds of nanoseconds to microseconds)
Dedicated clusters with fixed topology, decoupled or isolated from an organization's intranet: no Stable-very stable (such systems are not reconfigured without good reason) somewhat difficult--though often supported by OS or grid/clustering software. No Yes, though uncommon and usually quickly detectable no no single (administrative responsibility may be subdivided, though there is usually one person "at the top") High Medium-low (tens to hundreds of microseconds)
Intranets in general and similar private distributed systems no moderately stable (often changes to topology, though most are harmless or published in advance) difficult. No Yes, though somewhat uncommon and usually quickly detectable no no single (administrative responsibility may be subdivided, though there is usually one person "at the top") High Medium (hundreds of microseconds to milliseconds or more)
Clusters, other loosely coupled systems, via a VPN (all nodes private to an organization, but over a public network; possibly with negotiated QoS) no moderately stable (often changes to topology, though most are harmless or published in advance). Topology of intermediate network is often a "don't care" and is abstracted away by VPN. difficult. Global consensus: No ??? Yes yes, though modern VPN abtractions reduce the risk via transport-layer encryption (transparent to the application). no multiple. Nodes are technically under a single point of control, though geographically distributed sysadmins are often better modeled as multiple domains. Network service provider(s) are additional domains, though they are often required to provide some level of QoS even if best-effort. Medium Medium (milliseconds to tens of milliseconds)
Distributed systems over the open internet, all nodes private to an organization: no unstable difficult. No Yes, with fault isolation difficult. yes. In addition, some service providers may be hostile to some applications (ie BitTorrent). no multiple. Intermediate domains may be difficult to identify, intermediate service providers may be indifferent to endpoints. Medium-low High (tens to hnudreds of milliseconds)
External services over the open Internet: no very unstable; client nodes come and go at will. difficult. No Yes, with fault isolation very difficult. yes. In addition, some service providers may be hostile to some applications (ie BitTorrent). yes. tons of them. multiple, including clients under different domains. Low, especially if client is over dialup or other low-bandwidth connection. High to very high (hundreds of milliseconds or more)

open internet latency

Hm. I disagree with the statement that "External services over the open Internet:" have "low to very low" latency. My experience with ISPs in Russia directly contradicts this statement.

Whoops...

Obviously, that should say "high" latency. Brain fart.

Internal security risks

Developers consistently underestimate the risk of clients on an intranet. The simple fact of the matter is that many security breeches are "inside jobs," or at least have some help from the inside. The larger or higher-profile your organization is, the greater your risk.

True.

No code is safe when run on a compromised system; the security model in the chart above deliberately excludes the possibility of viruses/trojans which have infected a system--focusing instead on attacks that are enabled by running over public infrastructure. Perhaps a questionable assumption, but a simplifying one.

(On second though--perhaps applications which have user interfaces ought to be considered distributed applications, and things like mice, keyboards, monitors, and other peripherals ought to be considered "nodes"? Many attacks against a system are done by compromising such peripherals--think of keystroke loggers and the like.)

Making a bigger distinction

Yes, you are right, there are lots of different kinds of loosely coupled systems. The chart gives a good overview of the different problems. Partial failure (yes/no), imperfect failure detection (yes/no), bandwidth (high/low), latency (high/low), malicious entities (yes/no), etc. The dataflow techniques that work well for multi-core processors can also be used in some of these loosely coupled systems. MapReduce, for example, works well in systems with partial failures. A fortiori, these two points reinforce the conclusions of the position paper.

Solved?

Well, the problem of multicore programming is solved as long as we take for granted that dataflow programming is superior in all ways to all mainstream programming languages.

Because, if it is not superior in all ways, then the problem of multicore programming is not really solved, just transfered.

Dataflow programming

Marijn Haverbeke says: none of them (as far as I know) is general, composable, efficient, and silver-bullety enough to solve all problems

dcsobral says: solved as long as we take for granted that dataflow programming is superior in all ways to all mainstream programming languages

Interesting reactions. Of course it's true that dataflow programming does not solve all problems and neither is it superior in all ways to mainstream programming languages. But it does give you what you need to program multi-core processors easily. Why not take a look?

Proving an Impossibility

Can you show me that there can't be a better way than dataflow programming to make multi-core easy? :)

While dataflow programming's power is obvious at a glance, there is surely a reason that I, a programming languages enthusiast, student, and inventor, have never heard of any of the dataflow languages listed by Wikipedia. I expect more popular success from a Silver Bullet for multi-core, especially these days.

Is there something I'm missing which should be telling me that dataflow programming is The Way?

Wikipedia

Wikipedia lists quite a few dataflow languages, some of them have Wiki pages of their own.

Edit: Please disregard this utterly stupid comment, I've obviously misread your post.

Monotonic dataflow (deterministic concurrency)

I can't show you that there can't be a better way than dataflow to make multi-core easy. But monotonic dataflow (a.k.a. deterministic concurrency) is really very easy: it's not harder than usual functional programming. I.e., if you know how to program without explicit state, then you have all you need for programming multi-core. Chapter 4 is the place to look!

I can't blame you for not having heard of dataflow languages. It's a problem of education: most computing curricula teach almost nothing about programming languages. They teaching programming as if it were just a cookbook of object-oriented techniques. It's a great tragedy.

Data flow is a natural model

Data flow is a natural model for many domains (music, hardware, signal processors), but, at least with my experiences in Flapjax, application-level abstractions common to VB/RoR/Cocoa style programs are often hard to encode (though many that are annoying in functional/imperative are much easier!). In particular, I've recently been struggling with structured, versioned, and (inclusive) shared data. I saw a couple of programs listed on your home page, which hopefully will provide a better intuition as to why you believe such common application tasks are solved problems. First I'll need to learn oz/mozart, but luckily I finally snagged a copy of CTM..

Different dataflow

Your examples (music, hardware, signal processors, Flapjax) are not what PvR calls dataflow programming. A variable in Flapjax is a stream of values (like a mutable variable). An Oz dataflow variable is only one value, like in functional programming. The difference between an Oz variable and for example a Scheme variable is that a Scheme variable has a value if it is visible whereas a dataflow variable can be unbound even if it is in scope.

Scheme:

(let ((a 3)) ; you have to give it a value
  ... a ...)

Oz:

local A in % you don't have to give it a value
  ...   % here A is not bound
  A = 3 % now it's bound
  ...
  A = 4 % illegal, dataflow variables have a fixed value once they're bound
end

Using an unbound dataflow variable blocks the current thread until the variable is bound by another thread.

An example. You need to do two computations in parallel. Then you combine the result of the two computations.

local A B in
  thread 
    A = <computation 1>
  end
  thread 
    B = <computation 2>
  end

  <combine A B>
end

This program first spawns two threads, one for each computation. Then it immediately tries to combine the results. The results probably aren't available yet (A and B are not bound), so the <combine A B> step waits until A and B are bound. A better explanation is in CTM (with much more advanced examples, like storing unbound dataflow variables in data structures).

Kinds of dataflow

There are at least three kinds of dataflow:

  • monotonic dataflow: very similar to pure functional programming, can use lazy evaluation (in Oz: chapter 4 of CTM)
  • nonmonotonic dataflow: tokens moving along wires; similar to concurrent logic programming (in Oz: section 5.8 of CTM)
  • functional reactive programming: refined version of the previous one that does not create extra nondeterminism (in Oz: see FRP in Oz)

(To keep things clear I leave out synchronous programming for now.) All can be used to program multi-core processors, but the first is by far the easiest. This programming paradigms taxonomy shows how they relate to each other and to other paradigms.

It would be cool if....

CTM were to one day get an extra "chapter" in which FRP were to be discussed. Probably you and Prof Haridi are busy, and a second edition is probably not due out for a couple more years (if not the better part of a decade), but if you're looking for additional material to cover, there's one idea.

New versions of CTM

There will be a new printing (the fourth) of CTM available in July that corrects all the errors on the errata page. I am collecting raw material (including about FRP) for a possible second edition, but I don't think it will happen for several years (11 years passed between the first and second editions of SICP!).

A variable in Flapjax is a

A variable in Flapjax is a stream of values (like a mutable variable). An Oz dataflow variable is only one value, like in functional programming.

It seems easy to derive a stream from a dataflow variable. Just resolve the variable to a (current value, closure to compute the next value) pair.

No closures

Having a closure sort of defeats the point of using dataflow variables. Instead, what you do is representing the stream as a list whose tail is a dataflow variable. Very much like a lazy stream, which in fact can be viewed as a special case.

A FRP variable seems like a

A FRP variable seems like a declarative variable sans blocking semantics. FRP values, denotationally, are something like F = T -> (H + F) where T is time (perhaps on a discrete clock) and H is the host domain - they are never undefined in the Oz sense. For example,

a = getJson('http://1');
b = getJson('http://2');
c = a + b;

Perhaps you could encode blocking:

a = getJson('http://1');
b = getJson('http://2');
c = definedP(a) && definedP(b) ? a + b : NotDefined ;

This 'dP(a) && dP(b)' check could get annoying, but is systematic, so we could imagine creating a sugar for it or building it in. This wouldn't get you the scoping ability of Oz, but that can be achieved by a simple hoisting transform on an Oz style 'declare' statement. I'm not actually convinced that breaking linear flow is worth it (at least for the trivial examples I've read so far) - that was one of the big wins people report of our system!

I'm trudging through CTM this week - such a use of dataflow seems similar to futures, forces, etc., and not what I think people envision when suggesting dataflow for multicore, but will keep digging for the dataflow aspect of it. Not to say these ideas are not interesting - several of my friends are revisiting cooperative scheduling work related to it.

EDIT: changed 'x' to '+' in the denotation. Such a value should be flattenable to 'T -> H', but this flattening can be annoying.

Article is somewhat misleading

I think that the text quoted below is misleading (while formally correct):

We now understand how to make an all-or-none broadcast (all receive or none receive) that works even though there might be processor failures during the algorithm. We understand how to make consensus (agreement among many parties) even though there may be communication problems or processor failures during the algorithm.

It gives an impression that the problems is solved in general, while they are known to be solvable only in some restricted forms. See Hundreds of Impossibility Results for Distributed Computing for more details.

Impossibility results in distributed computing

You're right, a basic property of distributed computing is that many things are impossible. What I intended to say (maybe not clearly enough) is that the area of distributed computing is well understood, not that there exist algorithms to solve every problem. For example, a basic result that everybody should know is the impossibility of distributed consensus with a single faulty process. Assume asynchronous reliable communications (messages arrive after any finite time in any order). Assume at most one process may crash (stop permanently) at any time and that we cannot detect this failure directly. (These assumptions are essential to the result!) Then it is impossible for all processes to arrive at consensus on a single binary value. This basic result was first proved in 1985. A consequence is that totally asynchronous systems cannot have any kind of fault tolerance. To have fault tolerance we need something more, like failure detection or timing constraints on message delays.

How the Hidden Hand Shapes the Market for Software Reliability

Have you seen this paper:How the Hidden Hand Shapes the Market for Software Reliability by Birman et al? It's a little unfocussed but it has an interesting digression on the overly subtle definition of 'impossibility' used by FLP.

"For example, a fault-
tolerant leader election protocol might be tricked into never
quite electing a leader by an endless succession of temporary
network disconnections that mimic crashes. As a practical
matter, these patterns of message delay are extremely
improbable – less likely, for example, than software bugs,
hardware faults, power outages, etc. Yet they are central to
establishing the impossibility of guaranteeing that consensus
will be reached in bounded time. "

The paper then goes on to talk about the ways in which real-world distributed systems have access to all sorts of powerful features denied to the models used by FLP.

Computer music and multi-core

The problem specifically for computer music is that there has been a lot of focus in multi-core computing on maximizing throughput and not as much on real time issues. When OS's, languages or libraries provide various facilities they are often aimed at maximizing throughput. For audio rendering, which has a hard deadline to meet, minimizing latency is more important than maximizing throughput. Even things as simple as calloc'ing memory is a problem in computer music when the calloc code conspires with the virtual memory manager to zero fill a block of memory when the page is accessed instead of when it is allocated. In this case throughput is optimized for most programs, but audio programs wind up page faulting trying to fill a buffer in real time. Thus audio programmers wind up having to implement a lot of infrastructure themselves: lock free data structures, real time allocators and garbage collectors, etc.

Anthropomorphisms

There is an important difference between multi-core and the internet. Basically multi-core is clearly an engineering concept subject to physical laws and principles. The internet is quite different. For whatever reasons, we anthropomorphize the internet. This means it is subject to metaphysical laws an principles. Another expression for this might be "artificial intelligence", the anthropomorphic equivalent of human intelligence.

Dataflow languages

Dataflow programming very often suits audio programming, but not in general. And even for audio programming, dataflow languages gives limitations where you have to mess with busses instead of calling functions to return sample values. Dataflow programming does however remind a little bit of how to patch a mixer board, and that familiarity is important, although it's not a very ideal advantage, since it's just a familiarity and not an actual advantage.

I don't think dataflow languages are the perfect solution for audio programming, but rather just a convenient solution to gain high performance without having to spend long time waiting for the computer to generate optimal code.

Multicore computing?

The claim that the technical challenges of multicore computing 'were all solved long ago' really surprised me!

While it may be true that few aloof gurus from academia *think* that they solved them long ago, and that there is indeed a sociological problem with engineers, the challenges of heterogeneous multicore computing are still to be addressed. I'm working on complex, heterogeneous systems and the feedback we've had from academia is below expectations: they could work with simplistic systems but as soon as we went beside the 'hello world' app their programming model could not deliver. They gave up and we now develop our own tools.

Details

What kind of problem were you addressing? How and why did their "programming model" fail to deliver? What was it supposed to deliver? What are you basing your tools on?

Embedded design

What kind of problem were you addressing?
That's in the embedded space. We have a complex video decoder/encoder that is made of heterogeneous hardware blocks (processors, VLIW-like processors, specialized hardware..). All the blocks are connected through shared memory and different kind of hardware 'channels' with FIFO. An important point is that these channels may contain both data and addresses. So some blocks are basically just acting as finite state machines rather than processors.

All these features are non-standard, but in the embedded space the usual abstractions (shared memory, transactional memory, ...) do not give the suitable locality and/or are too costly in term of hardware.

How and why did their "programming model" fail to deliver?
They came up with the usual programming model: tasks, shared memory and some FIFO. Without going into the details, let me say that our hardware/software partitioning (the one devised by our engineers) was not even representable with the guys' programming model. That's why I'm saying all the technical challenges are not yet resolved. But the discussion with these guys was very interesting and I learned a lot.

What was it supposed to deliver?
A standardized way of looking at our system - e.g. a unified programming model.

What are you basing your tools on?
Intuition.

Any, the outcome is that we could not abstract our system and find a unified programming model (yet). To have a system with good performances, software guys needed to solve a lot of multicore-related issues that were specific to the algorithm and the hardware being implemented.

I would agree that my comments do not apply to SMP and homogeneous hardware, where most of the communication is standardized though.

Obviously, I don't know your constraints

...and cannot comment on the specifics of your architecture. But I *am* involved in the video equipment industry, and have encountered in my career numerous baroque and badly-done BMP (bizarro multiprocessing) designs, usually done by hardware guys who haven't a clue about software (or about the performance capabilities of modern CPU architectures). I also am familiar with what you can get away with using off-the-shelf x86-class microprocessors (either alone, or in tandem).

Unless you're building a reference-grade MPEG encoder, one that dances and sings and tries to squeeze every last bit out of the encoded bitstream using all the wonderful tricks in the MPEG playbook--that, or you're trying to real-time-encode a large number of streams in parallel (MPEG, H.264, or whatever else)--I have a hard time imagining what you are doing that can't be accomplished using stock PC components and stock PC architectures.

And if you ARE doing a reference-grade or highly-parallel design--you're probably building a gadget that you can sell for thousands, if not tens of thousands of dollars. So I'm also puzzled by your cost constraints (though am not too terribly surprised--I've seen projects fail around here due to management not willing to authorize sufficient manufacturing cost in order to build HW capable of meeting the requirements, thinking that engineering can square the circle with only a straightedge and compass, given enough unpaid overtime).

When I hear of the need to pass pointers between nodes in a non-SMP design, that smells bad. When I hear of designs guided by "intuition" that appear to not correspond to the relevant theory, that smells really bad.

My particular, off-the-cuff advices is unless you are Sun Microsystems (or some other experienced builder of massively-parallel supercomputers); avoid NUMA-ish designs. Either stick to homogenous SMP (in which case the chipset and OS does the hard parts for you), or highly-decoupled multiprocessing (using FIFOs only, not arbitrary shared memory). If you don't care about latency, on-board Ethernet is a great way to connect nodes in a loosely-coupled multiprocessor.
Make sure your hardware/software partitioning is appropriate--you want to a) balance the workload of the computing nodes, b) as a practical matter, ensure that high-volume I/O can be done using DMA on your general purpose processors, and c) minimize round trips between nodes. For example, using part of an FPGA as a coprocessor to do DCTs and IDCTs on behalf of the main processor, is generally not a good idea. Distributed nodes should do LARGE tasks, not small ones.

And if your system is non-orthogonal its design, the "programming model" should reflect that.

Did you program your

Did you program your hardware in C? If so, how is Cilk deficient?

Aloof gurus

few aloof gurus from academia *think* that they solved them long ago

This is why I love LtU: it's where aloof gurus meet pedal to the metal practitioners :-).

It's too bad there is no LtU

It's too bad there is no LtU workshop attached to a conference. These meetings would be even more fun over dinner and drinks :-)

Lambda the Ultimate Conference

If you make it, people will come.

I've been thinking about

I've been thinking about doing something like this at some point.

Where would it be held?

Where would it be held? POPL? ICFP? OOPSLA/ECOOP? The programming language community is a bit fragmented and many people (me) don't cross conference boundaries. Although LTU is FP oriented, a lot of OOPSLA people lurk here (see a Scala BOF at OOPSLA). BOF's at conferences are probably appropriate, although you'd be competing for time with other more focused interest groups.

Yes

There's nothing stopping there being multiple workshops, though ideally one would like a "main" one.

Warmup

How about a quick pub night in Ehud's new home of California? I'm paying a short visit to San Francisco next week and I'd enjoy putting some faces to names. Wednesday evening perhaps?

This would be LtU Beer Night #2 after Ehud, Denis, and I inaugurated the event in Jerusalem last year. :-)