## Languages & Niches

Hi there, I've been reading LtU for a while now, but I finally took the plunge and registered. I'm mostly self-taught and I can appreciate the quote by Dominic Fox on the Policies page. Hopefully, at worst I'm a newbie and not a wannabe. Anyways, I'm working on designing my own language, and I've actually settled most of the design choices, even started writing up parts of the compiler front-end.

Before I finally start setting things in stone though, I wanted to step back and make sure I'm not missing anything. I think it's interesting that the past few stories have been more about the "pragmatics" (if I can sneak in a linguistics term) of languages: adoption, tool-support, an active community. I found the comments to the story on R really interesting especially starting here.

I think Dennis Ritchie's advice on not expecting to become a superstar is apt too, but I figure if I'm going to put a lot into this project, shoot for the moon, right? When it comes to adoption though, I think Peter Gabriel's "Worse is Better" makes sense. Needs either arise or remain that other languages don't fill. It's almost evolutionary, where a new niche opens up and the first language into it, even if it's far from perfect, gets the opportunity and support to adapt.

I've read a lot of opinions that handling concurrency well is the current big hurdle to clear, but I've also read a few logical arguments that concurrency is a hyped issue. I'm far from the most experienced or smartest about this so I was wondering if anyone, based on their own research or work, had a different opinion on what the great, unfilled niche is. My own impression is that a toolchain and primitives that help simplify parallelism over data and air-tight security are actually stronger needs (like E?), but I could see concurrency being valuable for real-time systems.

Obviously, I'm going to make the language as complete as possible, but there's a point past which you just burn yourself out. If there are any articles or papers I should check out, I'd really appreciate it. Besides that, I hope this isn't too long or off topic, and I look forward to being part of the site.

## Comment viewing options

### concurrency....

Hi Mike,

I'm like you, a newbie that has given some thought to what he wants from a language. At http://code.google.com/p/lohr/wiki/LohrTutorial I have started writing what is essentially a tutorial to my imagined language.

Anyway, I have some thoughts about concurrency...

1) I dont think that concurrency is an over-hyped niche. Concurrency will be the only way to improve execution performance for the foreseeable future.

2) Full-blown coroutines. Node.js is currently getting a lot of attention but have you ever tried to take a piece of even somewhat complicated imperative code and rewrite it in the event-driven, asynchronous style used in node.js? It *really* hurts my brain.
But with full-blown coroutines I could write code in a straight-forward procedural style - and when a routine has to block while waiting for something the coroutine can merely be suspended until that something is available. Thus I could have the asynchronous goodness of node.js without twisting my code into a giant hairball of callbacks.

3) Automatic parallelization. I can imagine a language where every slot that holds data implements a 'future'. So, when a piece of code goes to access a piece of data it can find out if the piece of data is available. If a piece of data is not available then the associated coroutine is suspended until that piece of data is available. I suspect that such a language can be automatically parallelized. Automatic parallelization is a big deal: http://www.paulgraham.com/ambitious.html

### For the forseeable future

For the forseeable future you'll be able to get far larger gains out of compiler optimizations that eliminate the cost of high level languages than out of parallelization. For example look at the improvements that Javascript has seen. If your language is saner from a compiler's perspective, so that you can reasonably do ahead of time compilation, then you can get even closer to C.

Furthermore, many and perhaps almost all applications where performance matters are fairly easy to parallelize (graphics, scientific computing, most machine learning, web sites, etc). And applications where performance is really critical are going to be written in C/C++ anyway.

So I do think parallelization is over-hyped, even in the cases where good performance is desirable (in many cases it does not even matter, for example many applications are fine with taking a 100x performance hit by using Ruby).

### Agreeing with most of what

Agreeing with most of what you said, but taking an "easy" to parallelize problem and implementing it in CUDA is actually very hard. Additionally, MapReduce + Java is fairly scalable and the overhead of Java over C/C++ is not a big deal (i.e., its in the noise for big data parallelization problems).

### Right, in general getting an

Right, in general getting an actual speedup is *hard*. What I meant is that formulating the data flow in such a way that it could in theory be executed in parallel is easy, and that is what a lot of the hype about functional programming is about: "you can evaluate any two expressions in parallel!". The hard part is making the overhead in parallelizing small enough that it is a net win, in particular getting the data layout and access patterns right is hard, and functional programming alone doesn't help with this. I'm pretty sure that if you talk to e.g. a numerical linear algebra/fft/etc expert that he will tell you that multi core parallelization is one of the least important steps of optimization, compared to data locality, SIMD, etc. Tough perhaps it will become more important as the number of cores increases, but because of that data locality will also become even more important.

### concurrent vs. parallel

1) I dont think that concurrency is an over-hyped niche. Concurrency will be the only way to improve execution performance for the foreseeable future.

Concurrency is the ability to start two tasks and have their executions overlap . I think you mean parallel execution, which is basically doing two things at the same time. Concurrency doesn't imply parallelization (you can have concurrency on a single core), nor does parallelization imply concurrency (you can automatically parallelize a single thread of execution).

The distinction is very important because each term serves a different master, and languages optimized for one usually aren't optimized for the other. Edit: Erlang is concurrent, not parallel; CUDA is parallel, but not really that concurrent until recently (parallel execution on the GPU does not depend on concurrency, actually, concurrency makes everything a pain in the arse).

### both

Actually, I was thinking of both. I know that mentioning node.js and parallelization confused the topic - but the asynchronous node.js style is useful for implementing both parallelization and concurrency. Node.js uses asynchronous style for parallelization but it seems the asynchronous idiom would also be useful for implementing concurrency, where the tasks are to be distributed across a small fixed set of threads. Of course, that will introduce the need for some sort of concurrency control. I would love to try a language that had coroutines, automatic parallelization/concurrency, and software transactional memory.

BTW, its seems to me that the word 'parallelization' is often used to mean concurrency. I think this is because the word 'concurrentization' doesn't exactly sound like a real word :-).
See for instance, Speculative Parallelization Using Software Multi-threaded Transactions, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.9206

### Concurrency is the ability

Concurrency is the ability to start two tasks and have their executions overlap . I think you mean parallel execution, which is basically doing two things at the same time. Concurrency doesn't imply parallelization (you can have concurrency on a single core), nor does parallelization imply concurrency (you can automatically parallelize a single thread of execution).

But concurrency without data races makes parallelism easy, right?

And the other way around, if you can do parallelism making the parallel threads explicit gives you a concurrent point of view on the program (which you can then sequentialise).

Erlang is an example of concurrent programming which parallelises well.

### My experiences with

My experiences with parallelizing programs seems to be very different from yours. I disagree with most of these statements.

### My experiences with

My experiences with parallelizing programs seems to be very different from yours. I disagree with most of these statements.

I know about the theory, not about the practical experience. I suspect we are not exactly on the same page. Do you mind giving more details?

### But concurrency without data

But concurrency without data races makes parallelism easy

Parallelism is often about carefully moving memory. Even when it isn't, things like SIMD/GPUs are subject to instruction divergence -- even across seemingly independent threads. Sometimes even finding the parallelism is hard. Speculation is hard, both in finding it and exploiting it (and for the latter, FP/purity as currently understood can easily be more wrong than right).

Such issues might sounds like HPC/matrix multiply concerns, but I've repeatedly hit them in parallelizing browser algorithms (after you do the simple things).

And the other way around, if you can do parallelism making the parallel threads explicit gives you a concurrent point of view on the program (which you can then sequentialise).

I might have misread your statement here: I can imagine exposing a thread id (which is an easy if ugly way to guide memory) and therefore thread, but use regions/types etc. to disallow racey concurrent access across threads.

Erlang is an example of concurrent programming which parallelises well.

It distributes well across nodes, but has terrible per-node performance. Parallelism is gauged by scale AND performance-per-watt.

I see a lot of PL people (e.g., PLDI/POPL crowd) thinking along your lines, so it's not that crazy. Parallelism is no longer theoretical outside of the HPC scene, so we're only now seeing how it turns out in the mainstream.

### Parallelism is often about carefully moving memory

Parallelism is often about carefully moving memory. Even when it isn't, things like SIMD/GPUs are subject to instruction divergence -- even across seemingly independent threads. Sometimes even finding the parallelism is hard.

You are talking about a particular brand of parallelism here, SIMD.
My thoughts were more related to multicore CPUs. Which are not the big story right now but still worth of interest, I think.

Speculation is hard, both in finding it and exploiting it (and for the latter, FP/purity as currently understood can easily be more wrong than right).

FP/purity is a property which can be exploited for parallelization. It does not tell if it is the right thing to do. Am I right?

Such issues might sounds like HPC/matrix multiply concerns, but I've repeatedly hit them in parallelizing browser algorithms (after you do the simple things).

And the other way around, if you can do parallelism making the parallel threads explicit gives you a concurrent point of view on the program (which you can then sequentialise).

I might have misread your statement here: I can imagine exposing a thread id (which is an easy if ugly way to guide memory) and therefore thread, but use regions/types etc. to disallow racey concurrent access across threads.

If you are starting from a parallel algorithm, is there a chance of racey concurrent access at all?

Erlang is an example of concurrent programming which parallelises well.

It distributes well across nodes, but has terrible per-node performance. Parallelism is gauged by scale AND performance-per-watt.

Is the terrible performance per watt because it does not exploit specialized GPU hardware? What makes it bad?

I see a lot of PL people (e.g., PLDI/POPL crowd) thinking along your lines, so it's not that crazy. Parallelism is no longer theoretical outside of the HPC scene, so we're only now seeing how it turns out in the mainstream.

I guess they did not get their beliefs challenged.

### Parallelism has mostly

Parallelism has mostly turned into a discussion about SIMD; Concurrency isn't about making the program run faster, its more about not making it run slower as it contends to safely access/update concurrently shared data. They are two very different problems, which is why we have different names for them!

I believe Erlang's goal was never really about high performance, instead it was designed for high availability and concurrency. It will be interesting when languages are judged on their perf/watt capabilities vs. how many cycles they can burn through.

Sometimes some programming languages are promoted in ways that are not merited or useful; e.g., learn FP now because of multi-core...huh? If you make this claim, show me the money now! I mean, you can't really claim to support parallel programming if you are slower than the non-parallel alternatives, HPC is about real performance now vs. perceived potential future performance.

### You are talking about a

You are talking about a particular brand of parallelism here, SIMD.
My thoughts were more related to multicore CPUs.

For speculation and memory, I actually had concrete multicore examples in mind, but they apply to most parallel algorithms and architectures. For example, communication avoiding algorithms are shaking up the multicore and distributed world, and these are rehashes of old concerns about memory optimizations.

I agree with Sean's post on this thread, though with reservations on SIMD. He's right in that most important parallelism seems to be data parallelism -- multiple users on a server, a document tree in a webpage, a blocked matrix. A lot of the killer apps, and especially what we can implement, is polarized to be SIMD/SIMT or MIMD, and I think he's observing that (in the algorithms/kernels world) there is more SIMD. The server/SaaS world is the data parallel MIMD world of multiple users, and the cloud compute world (e.g., pagerank job) is a mix of both.

I'm less sure on this SIMD/MIMD divide going forward: my own experiences with speculation and hot paths suggests programs we believe are sequential can be run MIMD and those that are MIMD can often have SIMD components. Enabling economically-relevant development of them is an open problem that a lot of people are looking at and making progress on.

FP/purity is a property which can be exploited for parallelization. It does not tell if it is the right thing to do. Am I right?

Yep! My contention is the degree of "can be exploited," both today and in the foreseeable future. FP, as we know it, has been oversold. (It's a hard problem; most parallel languages, in theory and practice, don't really do that much for you.)

Is the terrible performance per watt because it does not exploit specialized GPU hardware? What makes it bad?

The base language implementation is poor for things it wasn't designed to do. It's designed for massively concurrent and reliable packet switching etc., not computing on the individual packets. Once compute is involved, exploiting finer-grained parallelism and more efficient language implementations (even sequential ones) will get you more compute/$. After sunk costs, compute/$ is basically perf/Watt.

Let's say you don't care about compute. Your computation may be bandwidth/memory-bound. Erlang might even be worse: you just exploded your footprint and introduced huge delays between steps of the computation!

If you are starting from a parallel algorithm, is there a chance of racey concurrent access at all?

I'm not sure what you mean. An algorithm where everything is independent? What about, for example, computations that read from a shared data structure but don't write to it? We still have to be careful about access to that data. A GC'd language might try to mark the nominally read-only data, and the caches (or cluster nodes) need to concurrently get duplicates somehow: these are the low-level concerns that programmers (or language/framework designers) have to deal with today.

I guess they did not get their beliefs challenged.

Some :) The program this year is heavy on parallelism, and while some of it seems bizarre/naive, there are many solid sounding pieces that I'm looking forward to. Everyone has to learn :)

I've mainly thought about it as coming down to whether you have shared data or not. From a computation standpoint, they look alike because they're both about logically, if not in reality (time-slices), doing several things at once. From the data/state view though, they're almost polar opposites. Concurrency only becomes an issue when your threads depend on common, mutable data or state, and parallelism only really happens when you're sure they don't.
I think it was Linus Torvalds (who's notoriously opinionated so I add a grain of salt) that was arguing concurrent programming doesn't have many performance benefits. It's something you do when you absolutely have to for shared data (an online-bank account with multiple logins is the example I've always heard), but you get the same performance from clusters and multiple cores without the headaches by just running separate, unthreaded processes.
I'm not really a systems programmer though so I could be missing something. Apparently though, this topic comes up a lot.

### Concurrent programming

Concurrent programming usually doesn't have the specific performance benefits related to parallel programming, since the styles of programming are fairly different (and orthogonal). Parallel programming is often about (but not always) executing the same instructions across different data (sharded or tiled), concurrent programming is about executing different instructions around the same data, and making sure those instructions are safe wrt how they access and update the data. Conflating the two terms sounds easy but often leads to conflated observations (this code is concurrent, so it must execute very fast!).

Concurrency without shared data is simply made parallel, but its not a very interesting case.

### Concurrency without shared

Concurrency without shared data is simply made parallel, but its not a very interesting case.

Not true at all. Consider the Brock-Ackerman Anomaly: two processes are not actually parallel, because although they are modularly independent they are observing each others' outputs via inputs, leading to ambiguous and non-deterministic computational results. Concurrency entails the potential for feedback. Parallelism is a different property. I wouldn't call it orthogonal property, either.

For the Brock-Ackerman Anomaly, if each independent process is "operating in parallel", e.g. continuously crunching on streams of data, it still says nothing about the fact that when you compose the two together you get interesting feedback behavior. There is no meaningful parallelism property for the whole program, then, unless you carefully refine the semantics of the program. e.g. to Kahn Process Networks to disallow the anomaly.

### Concurrency implies control and information

Concurrency is not "whether you have shared data or not".

Consider two controllers, each trying to target an object with a missle. One goes first, and misses by some amount. The other controller observes the miss and makes a (noisy) estimate of how far off the miss was.

There is no requirement that they share the same data. Hmm.. Why? Because they are each using sensors to create data, rather than share it. This means the information is shared implicitly and imperfectly. The concurrent condition is that the second controller only should fire its missle after the first controller goes, to optimize its chances to strike the target.

Its not about shared data. It could be shared communication, shared goals, etc.

Here is another example. A bunch of robo-copters are flying in the air, and there is massive signal interference jamming all radio communications, so they literally cannot send messages to each other to cooperate. So how do they cooperate? Implicitly!

For example, let's take a team game like basketball. How does everyone on the Celtics know where to go on the hardwood floor? Why don't some players just run up in the stands and get a hot dog? Why does the tall guy stand under the hoop? Is all this happening concurrently? Yes! Are they all talking to each other? Not explicitly. One player can see where players are on the floor, and those players can in turn see that one player.

Back to robo-copters. If robo-copters have sensors for "seeing" their environment, then they can surely "see" each other and notice they are all pointing to the same object. Now let's say they are trying to swarm something, how do they all optimize their positions without talking?

### I think we agree...

Concurrency is not "whether you have shared data or not".

I really like your examples, but I think we might be talking past each other. I mentioned common state later in my comment, but looking back at it, people probably interpreted "shared data" as a statement about implementation (semaphores, global variables, etc.)

My thoughts tend to get ahead of themselves, and I think my reasoning was that for a routine to work correctly, its logic needs to somehow take into account the state of its problem domain (which may include the real world.) If there are concurrent processes involved, then that state will be de facto shared. Where I'm probably getting ahead of myself is immediately picturing that "something" being modeled as data by the routine or the programmer (but it's not always modeled, which is where negative side effects come from, right? ;-)

I'm like you, a newbie that has given some thought to what he wants from a language.

I've made notes of all the websites you mentioned, and I'll definitely check out your thoughts.

Node.js is currently getting a lot of attention but have you ever tried to take a piece of even somewhat complicated imperative code and rewrite it in the event-driven, asynchronous style used in node.js? It *really* hurts my brain.

I've never tried out node.js, but it looks interesting from the website. You mentioned event-driven, and I was wondering if you've ever looked at Tcl. I doubt it's as well-supported & optimized as Javascript, or has libraries as custom-tailored for networks as node.js, but it might give you some ideas, and it does have coroutines. Google's Go is a newer one that has easy threads as a main selling point, but I don't know how good it would be for asynchronous programming.

For the forseeable future you'll be able to get far larger gains out of compiler optimizations that eliminate the cost of high level languages than out of parallelization.

I'm actually glad to hear that because as far as performance ideas go, that's been my main approach so far.

Agreeing with most of what you said, but taking an "easy" to parallelize problem and implementing it in CUDA is actually very hard.

I've never messed with CUDA or OpenCL, but I could see how concurrency would throw a wrench in things. What exactly is it about CUDA that makes programs hard to implement in it? Is it too low-level or something else?
I appreciate all of the comments, they're very helpful. On the concurrency issue, I've been picturing an approach a little like Clojure's. Anyways, I wanted to check there's not some obscure programming problem that will explode in the next few years, and my language missed the boat for lack of some extra work.

### another vote for concurrency support

I think the niche I describe below is narrow, so perhaps not what you want. You use term "adoption" a couple times, leading me to suppose you focus in part on volume of demand. (Thus a narrow app context would not apply much, since demand is low.)

Mike Abolazemi: I was wondering if anyone, based on their own research or work, had a different opinion on what the great, unfilled niche is.

If you wanted to quantify that, you might express unfilled niche as pain P times number of developers D feeling that pain. You might be thinking of a big D but smaller P, while I'm thinking of a small developer count D, but big pain P. (That's my warning proviso.)

I've recently been thinking about the diagram below to explain context for a painful situation in embedded concurrent systems programming in C. The figure illustrates a situation where it would be useful to rewrite C to C, which is something I mentioned recently. (And it's still at the top of my todo list.)

This is basically a vote in favor of better concurrency support. If you make a new language, I'd find it useful if it had green threads running in fixed memory, so an engine using it could be embedded in an unforgiving execution environment. Further, I want to know what happened, so your tool-chain should have logging, tracing, and stats support. If it helps, think of my focus as evidence-based programming: it's not true unless there's evidence, no matter how good the story sounds.

What I need is many thousands of concurrent actors, imposing few constraints on the host env, and leaving behind a conclusive trail of evidence for analysis, profiling, and debugging. (And heck, you get extra points if you can browse that evidence from elsewhere, presumably using a web browser.)

Below you can see a bunch of layers I named alphabetically. Each layer corresponds to something in my current production environment, which I'm not going to describe. But similar sorts of layers would apply to any realistic production env, so this is pretty generic. It's pretty hard to describe goals of a C-to-C rewriter to get stackless CPS transformed green threads, without explaining use cases in terms of a diagram like this.

I do layers Foxtrot to Juliet inclusive, completely from design to end debugging, in a soup-to-nuts implementation with docs, etc. But I'm not actually using green threads, though I now sorely wish I was.

• Alpha: hardware env (host box, maybe virtualized)
• Bravo: software env (operating system, like Linux)
• Charlie: system env (app platform, shadows Bravo api as needed)
• Delta: at least one process instance of Charlie per processor
• Echo: plugin instance conforming to Delta's abstract module api
• Foxtrot: embedded poll-able async engine with queue-in/queue-out api
• Golf: tasks (connections) [lightweight processes (aka "lanes" or "flows")]
• Hotel: actors [lightweight green threads (aka "jobs")]
• India: utils [maps, scheduler, finite state machines, etc]
• Juliet: client api to async engines in other OS processes [cache/storage]
• Kilo: daemons and any client-visible feedback channels in shared memory
+--Alpha----------------------------------------------------------------------+
|+--Bravo--------------------------------------------------------------------+|
||+--Charlie----------------------------------------------------------------+||
||| +--Delta--------------------------+ +--Delta--------------------------+ |||
||| | +--Echo-------+ +--Echo-------+ | | +--Echo-------+ +--Echo-------+ | |||
||| | | +-F-+ +-F-+ | | +-F-+ +-F-+ | | | | +-F-+ +-F-+ | | +-F-+ +-F-+ | | |||
||| | | |G G| |G G| | | |G G| |G G| | | | | |G G| |G G| | | |G G| |G G| | | |||
||| | | +---+ +---+ | | +---+ +---+ | | | | +---+ +---+ | | +---+ +---+ | | |||
||| | +-------------+ +-------------+ | | +-------------+ +-------------+ | |||
||| +---------------------------------+ +---------------------------------+ |||
||+-------------------------------------------------------------------------+||
|+---------------------------------------------------------------------------+|
+-----------------------------------------------------------------------------+

In practice, layers Echo and Foxtrot are one-to-one, so showing multiple Foxtrot instances inside Echo makes no sense, unless you're debugging and each Foxtrot is actually a different system, where a test F2 is used to drive the production F1.

Okay, now I can explain what the C-to-C rewriter is for. The figure above shows something that is very, very hard to test: an 11 on a scale from 1 to 10, which means it's not tested as well as I'd like. The goal of the rewriter is to make complex things uniformly complex, at about 8 on a hypothetical complexity scale (so it wins if the problem is more complex and loses if the problem is less complex).

Say current production layers are A1, B1, C1, D1, E1, F1, G1, H1, I1, J1, K1, etc. I want to write a simulation of top layers with the C-to-C rewriter, so I can use A2, B2, C2, D2, and E2 to drive all the version 1 layers in integration and unit tests.

It's very hard to get the actual production system to go through the full range of phase space permitted by system description and api contracts. So a simulation would be ideal. But it requires a non-deterministic, concurrent, async engine of the sort described by layers Foxtrot and lower. Any such async Foxtrot api can be driven by any other, because this sort of thing composes fairly well, since the boundary between inside and outside is async, and can be folded into any topology without causing strain.

So here's my suggestion: make it possible to write at least two versions of Foxtrot layers in your language, where one can drive the other in simulation, while providing all the tools to gather and browse evidence needed to vet production systems.

### Unclear on the engine, but I've got metadata

Woah, those are a lot of layers; I salute you, good Sir. I think problems that cause a few programmers big headaches are just as legitimate issues. My thoughts were at some point from conception to end-user distribution, there has to be a lot of demand to really drive a community, which leads to comprehensive libraries, docs, tools, etc. That could take the form of lots of end-users (or really big ones like companies or governments) wanting something from a small group of overworked programmers.

I may be getting confused, but by "async engine," I'm guessing you mean the actual schedulers for all of these jobs are in the foxtrot layer? It sounds like you have to use the earlier layers as they're given to you, and I'm not sure about the performance, but could the engine be moved into the delta layer? Couldn't it provide asynchronicity to all the following layers, and foxtrot would be more of a narrow interface that gathers data and requests for echo? Then different foxtrots could still drive each other by routing commands through delta (which again, may be too much overhead).

Your comments on evidence-based programming though, I understand completely, and I'm glad to say I've been thinking about too. My approach is based around the idea that builds and runtimes generate a lot of metadata. The trick is to have a quick, automatic way to sift out the good stuff and present it to not just the developer, but other tools and even the program itself (if you're into reflection). I also think the metadata angle makes a good foundation for a lot of the ideas about the developer experience that dmbarbour mentions below.

### can't be too clear

I'm obligated to be unclear, to the extent I can't disclose how systems function where I work. (And I avoid nuance about what's good and bad in my opinion.) But lots of layers is just par for the course, and doesn't reveal info. I recently formed a plan to start writing docs with a "What if?" focus, in imitation of science fiction tradition, where we consider implications of givens in systems design. For example, here we might ask, "What if we have many layers?"

Seeking high demand for a healthy community sounds like a good plan. Lower demand but with a strong focus might also serve, but that's a conjecture since we see it less.

Every layer with async behavior has access to a scheduler, or rolls its own. (If you call a function, and it replies later whenever it wishes, and not necessarily when the function returns, that's an async model. As a special case, reply before a function returns can occur -- sometimes requiring protection against accidental re-entrancy.) If an api is abstract and async, it can use its own scheduler, or else one you provide, but it doesn't have to tell you which one.

In this case, several layers have their own schedulers, and the foxtrot layer has its own. (Each async api is like a scheduler-to-scheduler boundary.) Moving layers and schedulers around can be done in some cases. A layer can exist so an engineer can adapt one layer to another while adding value in the adapter. For a global scheduler to be easily used by all layers, it would need to be unusually clean, re-usable, and sharp-edge-free, and this reads as over-engineering to many folks. But it's still a good idea. The foxtrot layer makes its scheduler available to the next layer underneath, and it does get used there, but foxtrot doesn't formally know that. But Delta and Echo don't want to see another scheduler, nor provide a re-usable one.

On the topic of gathering meta-data, I've been planning a scheme based on the idea, "What if the expensive monitoring layer for analysis and profiling, etc, is a proxy in between one layer or another, so it can be inserted dynamically when debugging?" You could put it in another process and expose a browse-able http view, without unduly interfering with the process under study (beyond slowing things down with latency).

### Live Programming, Interactive Programming, Gestures and Touch

Concurrency is important, but how much can you contribute there? It took me years of study just to get to the point I understand much of what has already been done and various pitfalls. Just pick a proven concurrency paradigm for now - I'd suggest shared-nothing threads similar to E language or the new JavaScript Q library.

A valuable area right now - one that only takes some vision and creativity - would be to target touch, voice, and gesture-based programming. Sean McDirmid has initiated interesting discussions [1][2] regarding this space and its connection to PL design. And I recently blogged about that relationship as well.

Live programming - editing in a live buffer - is also a worthy area to pursue. Live programming offers far richer and more efficient feedback than a REPL loop ever will. See Inventing on Principle if you haven't already.

And Interactive programming makes this more a dialog between user and machine. I.e. perhaps the machine can fill gaps in a specification, work around minor errors in a live system, highlight ambiguities or potential problems.

All three areas go together well, along with Reactive programming (continuous signals instead of one-time events) and possibly some visual programming.

With the new technologies coming out this year or next, I've been thinking a lot about how programming might be achieved with Google glasses and a LEAP or a glove. This is more than an niche; this is a frontier.

### Concurrency is a fact of life, usability is the future?

I know I'm a novice when it comes to a lot of these concepts, especially from a theoretical, semantics perspective. I mainly became interested in E for the security ideas, but I haven't looked into it yet and I may like their approach. The one thing I do want (and I know this is a big functional/imperative debate) is some mutable data, which is why I'm still leaning towards something like Clojure. I've decided not to allow direct, destructive reassignment (because it wrecks the logical meaning of equality), but I was planning on a few cases of mutable data, changeable in certain situations.

I'm not sure how much a language itself can be optimized for new interaction methods. My gut feeling is that burden is ultimately more on the development and build tools than syntax or semantics.
But one of the design points I like to keep in mind is that the toolchain is just as important as the language. I think there will always be some use for canonical plaintext, if for no other reason than sometimes the futuristic, Minority Report gloves break, but I've tried to keep my language as clear and simple to map to as possible. The reactive programming is really cool, and as more of a cognitive notion, I do have plans for trying to make it central to objects in my language.

As I commented above though, I do think each phase in the build process just tossing out all of its metadata is a huge gap. If you could distill it well, not only could you do better profiling and debugging, but I think it would help a lot with live/interactive/visual/tactile programming. One of the big jumps to visual/tactile programming is replacing linear steps through code with viewing and editing things in 2 or 3 dimensions. I could be wrong, but I would think it helpful if the machine didn't have to step through computations each time you refresh or change something.

And for the live/interactive programming, I haven't seen the Bret Victor presentation yet, but he's an interesting guy (I'm eventually going to blog a partial agreement with his ideas about math), and I've had his presentation bookmarked for a while. I would think build metadata would help here a lot more too, especially if you aren't using a dynamic, interpreted language. Is it even possible to do live programming with something like C or Java, at least if you use a JIT compiler?

I've been really glad to hear all of your opinions though. I feel sometimes like I'm in over my head, but I'm actually very encouraged now. The ideas I've had aren't irrelevant or unworkable, and hopefully, I'll be releasing an alpha version of a good language in the near future.

### Language design and

Language design and programming model have a deep impact on what tools are feasible, what a development environment can elegantly provide (without impedance mismatch), and ultimately impact programmer and user experience.

Live C and Java can be done, especially with a little programmer discipline. But there would be much awkwardness, e.g. with respect to malloc'd structures and new objects - if you change the code so the object would have been created differently, or never created, how do you explain this inconsistency between the object and your code in a safe way? No amount of build and metadata would make that less awkward. Hence the need for programmer discipline.

For open systems, the programming model is more important. Features such as undo, prediction, reactivity, and consistency need a lot of cross-cutting cooperation. And a programming model captures those concerns. A programming model can be expressed in a software architecture, not just a language. But it helps if the language is already close to the model - otherwise you'll spend a lot of time adapting libraries to fit the architecture.

For most people, parallelism comes in the form of multitasking. That is, we want our computers to do five hundred different things at once, but it's fine with us if they do it by running five hundred different programs at once. The idea that you can claim more of a computer's power for a single program by engaging more of its processing units for that program, is true but is only really relevant if those other processing units would be idle otherwise (Generally true on user's desktops; generally not true on server machines that have thousands of sessions or database requests or whatever to take care of anyway). This kind of multitasking is MIMD parallelism. To a first approximation, MIMD is the only kind of parallelism supported by today's general-purpose machines and languages.

But there are a lot of tasks (graphics processing and audio codecs being the most familiar and widespread) that aren't MIMD. These are mainly pipeline-oriented and/or SIMD. In a pipelined task you've got many processing units doing one thing over and over, each running a simple "program" that fits in one cache line for that processing unit, and handing the data to the next processing unit which does the next "step" of the process using a slightly different one-cache-line "program". So different processors are doing different (simple) things and handing data from one to the other. In an SIMD task you've got many processors all doing the same thing at the same time - but each has its own bit of memory and each is working on different data.

These kinds of tasks are generally handled by specialized hardware: A GPU or DSP.

The crushing advantage of Non-MIMD parallelism, and the reason it scales so well to thousands rather than merely a dozen or so units, is because both cases eliminate or drastically reduce the memory bandwidth requirement for instructions and cache coherence. A pipelined program is distributed in tiny chunks to all processing units and then runs, on potentially massive data, without the need for bandwidth devoted to fetching more instructions. An SIMD task has a single shared instruction bus that all the processing units see and they all run the same instruction at the same time. You can have as many of them as you need, and they don't contend for instruction bandwidth at all; they are like rowers all listening to the same drummer.

Furthermore, your cache coherence requirements mainly amount to making sure that each unit can hand off its data to at most one or two "neighbor" units - and in specialized hardware there are usually dedicated channels for that.

Imagine a thousand processing units with no cache coherence contention and no instruction pipeline stalls. And that is a pretty accurate description of a modern GPU.

Leveraging these capabilities when possible is a *huge* win for performance. A GPU is often doing thousands of times as much "work" (measured in additions and multiplications) as the CPU on the same system. These specialized parallel hardware units are programmable, and can be used to do a surprising variety of tasks. For example, if you can represent a neuron in a way that maps feedforward and feedback to matrix multiplications, you can run a neural network on your GPU, *way* faster than it will run on any (or all) of your regular processing units. Likewise, if you can model gravitational interactions in a way amenable to treatment by your GPU or DSP, you can calculate orbital mechanics way faster than you could on your CPU.

And such hacks have been done, here and there, by a few determined people who write directly to binary interfaces. But right now, there's approximately ZERO support for non-MIMD parallelism in any available programming language I'm aware of. Enable the user to take advantage of the non-MIMD parallel hardware for user-written programs and you'll be able to turn in performance for some tasks that blows everything else out of the water.

Of course the Operating Systems are in the way right now; you're going to have to either get cooperation from OS development teams, or subvert or defeat a couple different types of barriers which may involve a need to hack kernels and flash BIOSes, to get at the goodies. Also, it's not going to be easy to come up with an API or programming model that can be supported by most of the very wide variety of hardware that's out there.

But oh man, are the performance numbers ever worth it.

Ray

### SIMD parallelism in PLs

Enable the user to take advantage of the non-MIMD parallel hardware for user-written programs and you'll be able to turn in performance for some tasks that blows everything else out of the water.

For some tasks, yes. The GPU can be a powerful opportunity.

Haskell has effective support for GPU computations via the Accelerate library. You do need to describe contiguous portions of your computation in terms of simple vector and matrix operations to make this effective

### Different Angles

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Different Angles

context: in response to forum post "Languages & Niches" (By Mike Abolazemi at 2012-05-21 23:54) (http://lambda-the-ultimate.org/node/4519) on lambda-the-ultimate (The Programming Language Weblog)

The capability model is not for satisfying "stronger needs" or "air-tight security". These are simply properties that follow from banishing ambient authority (although it still does not solve covert channel attacks). Ambient authority cannot exist, yet all mainstream PL/OS/software/hardware* designers try to implement it, resulting in an insecure system that does not support executing code/using devices that you haven't created yourself or verified from top to bottom, and a system that does not let you choose how to "wire" things.

Concurrency is also not to satisfy "stronger needs" nor performance. Notice how your GUI freezes for the duration of any computation? That's because its using cooperative multitasking; if one piece of code anywhere in the million lines of libraries takes too long, the entire program is stuck for that duration. This kind of programming is clearly intrinsically flawed, as nobody is going to verify the transitive closure of libraries on any current system.

Capabilities and concurrency are simple solutions to composability problems:
- - Capabilities allow local reasoning about whether a process does what it should do, and what process is using what resources. E has this.
- - Preemptive multitasking with lack of shared memory allows local reasoning of the behaviour of a process without having to worry about whether it blocks the entire system or another process is blocking it. You only care about what messages the process is waiting on. Erlang has this.

One thing that is terribly under researched is small trusted code bases. With the quality of architectures today, it's no wonder nothing works. People need to be able to verify from the ground up that their system is sane. But this is currently impossible because everything depends on C which itself is not feasible person to understand. So you finally spent a few years and wrote a C compiler that is correct? Well too bad, because less than 0.001% of programmers actually understand how to write C correctly (although they think they do), and even if they do, some library probably forces them not to.

*Ambient authority is less common in hardware simply because the nature of the universe dissallows it (devices can't automatically plug into your TV's output jacks). However, Thunderbolt, FireWire, PCMCIA, ExpressCard, etc. where any device you plug into your computer has read and write access to all your memory (or at least the lower 4GB).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iEYEARECAAYFAk/KNk4ACgkQ3PGpByoQpZEvcACeJozoABG/yDnK+8FXgL+N0ryG
CsYAn11CePgfCxNGbcb/wGiUCnkgSmOp
=oc8/
-----END PGP SIGNATURE-----

### Has someone verified that

Has someone verified that this is the real longpoke?

### Concurrency is also not to

Concurrency is also not to satisfy "stronger needs" nor performance. Notice how your GUI freezes for the duration of any computation? That's because its using cooperative multitasking; if one piece of code anywhere in the million lines of libraries takes too long, the entire program is stuck for that duration. This kind of programming is clearly intrinsically flawed, as nobody is going to verify the transitive closure of libraries on any current system.

Not so simple. GUIs can freeze for other reasons. Most graphics APIs, like win32, use an "immediate mode clipping sysem". These systems became popular due to the cost of memory on desktops; in an immediate mode clipping system, only one component contributes to the color of a pixel. But they also have a tremendous downside: constantly switching z-order leads to "thrashing" in the drawing algorithm, and so that is why Win32 GUI apps will tend to hang as it sorts out what visuals need to be displayed to the user. Users think their entire operating system has croaked.

With retained mode graphics systems, the entire visualization is cached in memory, and multiple "visual objects" cached in memory can contribute to the picture drawn on the screen. If we didn't retain the paint, we would have to callback to user code. Further, if we wanted to do compositing, we would get stuck in the shared memory concurrency environment you refer to.

That being said, I agree with your thrust. Shared Nothing Concurrency is the only approach that works at scale, as the Internet demonstrates. It was really sad when I attended OOSPLA 2010 and saw so many shared memory concurrency papers. Half the presenters were living off their Ph.D. thesis from the 80s or 90s when people thought shared memory concurrency could work (even when people were saying it could not). Seeing old, bad ideas re-hashed was just sad.

### Ah, verification

I really appreciate your comment. I have some family members that focus on computer security some with their work, and from them and the papers, it seems to be a big issue where a lot of people are still looking for answers. On top of that, it's a topic I still need to learn a lot more about. I brought up capabilities in E because I had read about them before in relation to Coyotos, where I got the impression their primary purpose was security.

I have looked at Erlang some, I liked the node concept, and I know it has a great reputation for high-availability code. Honestly, the idea of hot-loading patches probably stuck with me the most though. I do remember thinking the explicit message passing... broke a lot of interfaces and just forced the developer to become a postman instead of a traffic cop. I've mainly just jotted down notes for low-level ideas so far (to start, my parser will probably just target LLVM-IR, with possibly 1 or 2 middle layers), but I understand why locks are seen as almost barbaric.

Formal verification is something else I don't know much about (yet), but I can see why it's so tantalizing. It's also a topic more inline with this blog, isn't it? I know there's a recommended reading list (I just started reading SICP, and the programmer/wizard comparison in the preface has already hooked me), but is there a go-to source I should start with? I have classical logic down, and due to Haskell and abstract algebra, I can tread water in category theory. But my understanding of modern logic & formal semantics is really weak. I did manage to soldier through Davis, Sigal, & Weyuker's computability book. The automata theory and formal languages clicked particularly well, but the last few chapters on semantics: whoosh, right over my head. Something about partial definitions and suprema of posets, and then my brain imploded.