The End of the GPU Roadmap

The slides from Tim Sweeney's High Performance Graphics 2009 keynote expand upon his Next Mainstream Programming Language talk we discussed some year back. To summarse, Tim makes the following points (in my interpretation):

  • The current GPU pipeline is too limiting and too difficult to program
  • The CPU roadmap from Intel and others show a convergence between the CPU and GPU. We're going to see a lot of cores with vector units hanging off a shared cache. The cores are getting simpler but programming them is getting harder
  • Programs that are purely functional can be automatically vectorised, and the new CPUs are going to extend the scope of vectorisation to general programs
  • Economic reality demands tools to make programming this hardware easier. New languages seem the only viable way forward

It's an interesting talk with a bit more detail at the bit-bashing end than previous talk. The question is: who is going to make this language?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Who will step forward?

I went to a Nvidia talk where the presenter mentioned Copperhead (data parallel Python) but was very dismissive of open source projects. I think Nvidia see themselves as a hardware company, with software a necessary distraction. OpenCL, the main alternative, is a library; I think the principal players are too wedded to the C runtime to make the necessary leap. Academia, on the other hand, doesn't like getting down to the hardware level. Efforts at systems language like BitC and Cyclone have not spurred a surge of new research. (Admittedly getting your hands on, say, a Larrabee does not seem easy.) It really does seem that this kind of work will slip through the cracks, which is sad and ironic as functional languages are absolutely ideal for this target domain.

One attempt

Well, I'm giving it a go: http://github.com/girving/duck. The rough long term idea is to allow full access to the source code inside any closure, which should allow extremely aggressive optimizations to be written inside the language and applied at any time during execution. None of that is visible from the code at this point though; it's currently a dirt slow, featureless interpreter. :)

It's a bit silly linking to this when I haven't gotten around to writing up rational in a coherent form, but your comment was so forlorn and pessimistic that I couldn't resist.

Full steam for IPO

I actually think it would make a great project for a startup. I'm glad someone is having a crack at it.

Open source

Hopefully it also makes a good side project for someone otherwise happily employed...and anyone else who wants to help out, of course. :)

Collaboration

I've also been developing ideas in this area for a year or two. If others are interested in pooling knowledge, let me know. I could host a forum so that we can help each other out.

EDIT: I already happen to have a rather appropriately named domain for such a forum at http://lambda.lucid3d.org/. For those that are interested, join me there. As with LtU, please use your real name for your user name and keep discussions on the technical side and on topic. Thanks!

Empty shell

I've also been developing ideas in this area for a year or two

Hopefully that site doesn't chronicle your progress :). I'd suggest at a minimum, write some text that explains your goals and how you're envisioning this collaboration would work. Maybe even put a title on the page other than "Wiki". Also, the current format is confusing: lots of needless collapsing stuff, text that looks like the title of an empty section that's actually a link with a description below it (linking to the actual empty section).

No, no not at all :)

Hopefully that site doesn't chronicle your progress :)

Honestly in the presence of the people on LtU I'm just a padawan ;). I do have a little experience working in the Virtual Reality industry though.
But no, my plan is certainly not to chronicle my own ideas. Whereas LtU is a blog I would like Lucid3D to act more as a forum in roughly the same spirit as ompf.org (what I would describe as an elitist forum for real-time ray tracing gurus).

Additionally I would suggest that people keep posting all LtU relevant material here and use Lucid3D only for more implementation specific discussions that fall outside of LtU's domain. There's no reason to scatter a great existing knowledge base. The forum is intended as a platform for discussing your personal designs and asking for help since LtU is not an appropriate place for such discussions afaik.

Also, the current format is confusing: lots of needless collapsing stuff, text that looks like the title of an empty section that's actually a link with a description below it (linking to the actual empty section).

I've only spent about 30 minutes configuring the site, so apologies for that! I'll do some more work on it as I get time tonight. In the mean time, please make some suggestions on the "Site Discussions" page if you think that this is a worthwhile idea or if you would like to help out.

OpenCL

As I understand OpenCL, it's not really a library or a language, but a curious chimera of the two:

You have:

(a) a library for probing hardware, creating contexts, compiling kernels and dispatching tasks to task units, and
(b) a C99-like language with added data and task parallelism primitives in which to write the kernels.

Perhaps the "next" language (for the author's given definition of next) will be one that unifies and simplifies the two separate aspects.

There's after all no reason why higher-level languages can't generate OpenCL kernel code as an intermediate language, and wrap and simplify the process of loading and dispatching the kernels in the same hypothetical language.

Compiling to OpenCL

Compiling high-level code down to OpenCL is certainly feasible, but there's a severe granularity problem in today's GPU-based OpenCL and CUDA implementations which make this impractical: typically a multi-million clock cycle latency between starting a vector operation on the GPU and seeing it complete, due to a combination of the pipelining involved and the PCI Express bus latency. This means compiler techniques like loop vectorization can't be automatically enabled, since the compiler would need to know, statically, that the loop is going to run for millions of instructions in order for the vector optimization to not cause a horrid performance loss.

What you really want is a processor that can run both scalar and vector code, and transition between the two either instantly or at worst in a few tens of nanoseconds. Then, a compiler can apply a variety of automated vector transformations (e.g. loop vectorization, or the broader set of tools in nested data parallel languages like NESTL) with the guarantee that, no matter how degenerate data the user provides, the loop is at worst only marginally slower than the scalar version.

Today's CPUs fail in this regard because their vector instruction sets are crippled (lacking scatter/gather memory access). GPUs fail because their cores are incapable of fully dynamic control flow, as defined by the 1950's stored program computing model. Upcoming vector CPU architectures like Intel's Larrabee succeed, and I assume NVIDIA and ATI have similar projects in the work.

This is possible to some

This is possible to some extent today: you can express a GPU code in a high-level language, say C++, C#, or Haskell. Rather than interpreter the code directly (execute it), reify the code as trees that are used to generate shader programs. As the shaders are being generated, detect CPU dependencies (e.g., uniforms, textures) and automatically deal with boilerplate (calls to (a)). Projects that do this:

* Sh (now RapidMind or whatever) for C++.
* Conal's Vertigo for Haskell (not sure if uniforms are explicitly declared or inferred).
* My Bling project for C#. (also, checkout a project called Brahma that runs on top of LINQ)

So (a) and (b) can be combined and its definitely where I think GPU programming should go.

I'm working on vertex shader support in my newest version of Bling. In this case, we can automatically infer the interpolated parameters that need to be transferred from vertex to pixel shaders. Someday, we can also support compute shaders, but I'm just focusing on graphics for now because...

Tim's point is right though: even if they are easy to write, you just can't just whip out a compute shader/kernel to solve some arbitrary problem quickly on the GPU because of getting data to/from the GPU. I'm not sure the right solution will be Larabee. One of the reasons GPUs are fast is because of their restricted memory model. Take that away and what will we have?

Conal Elliot's Vertigo

Conal Elliot has done a lot of excellent work in this area. Especially relevant to OP is his work on Vertigo [pdf].

It's taking me a while to read his small library of papers. It has inspired me to pursue logic reactive programming as the basis for scene-graphs. It seems to me that logic programming would offer an even better basis for both composition, concurrency, and optimizations (in particular, optimizations for level-of-detail & occlusion). I'm pursuing the concept on wiki-wiki as DeclarativeGui.

Why a completely new language?

"The question is: who is going to make this language?"

Why not extend/adapt an existing language? Although it's not suitable out of the box at the moment, I'd personally like to see Haskell or Haskell-derivative in games and gave a talk on exactly this subject just a few days ago.

I note all of Sweeney's references were to research done in Haskell.

"Who will step forward?"
I think we necessarily need to see a mix of industry, academia and hardware. I expect the will to collaborate exists, but coordination is definitely lacking at the moment. The end user is usually the best driver in these situations, so I'd like to see games companies taking the lead.

Vector Haskell?

Haskell is awesome, but it's not the perfect language for concurrency because of its pervasive use of lazy evaluation. To vectorize or multithread a map, fold, or comprehension, you'd need to know in advance that every element of the resulting data structure will be fully demanded later. Otherwise, you could only vectorize the generation of thunks, leaving the actual work to be done later at demand-time. But later, unless you can determine that demanding one element implies demanding all other elements, so you won't be able to vectorize the access to elements.

To transform lazy code into eager code soundly requires strictness analysis, which is fairly limited in the presence of dynamic control-flow. Things like modern graphics algorithms, though inherently very vector-friendly, have dynamic components (e.g. pixel shaders) which will work against a static strictness analysis.

Of course, Haskell is side-effect free, and that's a huge plus for both concurrency and vectorization. So, the ideal parallel language looks more like Haskell than C++, but I don't think it's exactly today's Haskell. You really don't want lazy evaluation as the default in a concurrent language -- it either prohibits or greatly complicates the ability to run code that is definitely encountered at runtime, but whose values aren't necessarily demanded.

You don't want old-fashioned eager evaluation either. That breaks the recursive programming techniques that are key to writing useful algorithms in languages that lack side-effects. The most promising alternatives seem to be Id90's "lenient evaluation" (always run all encountered code regardless of demand, but allow it to be delayed to support recursion), and the suspension/resumption mechanism of Concurrent Constraint Logic Programming languages (run all encountered code, suspending parts that have yet-unsatisfied dependencies, and resuming them when their dependencies complete). These effectively provide the much more powerful dynamic equivalent of static strictness analysis.

Laziness should be a separate type

I think you can get most of what you need out of normal eager evaluate language plus a trick to delay argument evaluation. E.g.,

(&&) :: Bool -> Delayed Bool -> Bool
(&&) x y = case x of False -> False; True -> y ()

Given an expression like (x && y), the compiler sees (&&) first, notices the magic "Delayed" annotation, and wraps up y as a thunk. Since it amounts to type-dependent syntactic sugar which disappears immediately once types are fixed there's no complexity penalty for optimizations or hardware mapping.

In fact, I think you can make Delayed a comonad

And you can probably borrow and dualise a large chunk of the Haskell monad libraries to make this useful.

I've wondered if it's possible to define a language which makes a strict separation between data and codata and ensures that computations on data are strict and on codata are lazy. But in practice you want to mix data and codata and I don't know if a compiler could ever infer what you want in these cases.

Data and Codata

I've wondered if it's possible to define a language which makes a strict separation between data and codata and ensures that computations on data are strict and on codata are lazy. But in practice you want to mix data and codata and I don't know if a compiler could ever infer what you want in these cases.

The separation wouldn't be a problem. Charity makes a strict distinction, as does the language I'm working on. Mixing the two is also quite easy (you "interweave" rather than merge, such that an inductive type may have codata elements and codata may return inductive types) and the compiler knows for each value whether it is data or codata because the two are interleaved - there isn't any ambiguity, and no problem for inference.

But laziness is still useful for computations over inductive data, even for those languages that mix the two, especially when dealing with very 'large' data values (i.e. multi-media sized objects and scene-graphs). And laziness (instead of call-by-need) on codata can be achieved by memoization, but the benefit isn't always worth the price (especially if the codata will persist for quite some time but is accessed only rarely).

I am convinced we should simply ensure the languages terminate without failure (at least below a special non-termination type or program layer). This allows lazy, strict, lenient, and concurrent evaluations to all have the same semantics. At that point, you could annotate functions or call-sites with 'advised' optimizations - like memoization, concurrency, delay. The compiler would be free to follow this advise, or ignore it based on non-local information (profiling, perhaps, or data-flow analysis).

Isn't that standard for total programming?

In Coq for example, functions on data must be total, and definitions of codata must be productive. This is certainly enough to justify strict evaluation of data and lazy evaluation of codata - I think that's how normalisation actually works. I'm not sure how it is implemented, it seems like it might require having a strict and a lazy version for sufficiently polymorphic functions, depending which mode they are called in.

Conor McBride wrote a bit on handling laziness with explict delay in the types. This might be enough to ensure productivity without the usual syntactic restrictions.

lenient still too slow

AFAIK, lenient evaluation was still for non-strict languages and so required strictness analysis if you want to decide statically to evaluate portions of code eagerly and sequentially. Evaluating eagerly and in parallel was fine, but then you still have the synchronization overhead. It seems that the fastest point in the design space would be more lenient than lenient, lifting the non-strict restriction and admitting non-deterministic behavior (with respect to non-termination).

Total functional

If you can guarantee your code is type-safe and terminates, then functional will be deterministic regardless of evaluation order.

Non-termination can be achieved in a higher layer (e.g. hooking into a frame-demand) or via a special type.

Push-pull functional-reactive with total functions, or logic-reactive with a terminating interactive logic (where one can insert, update, delete statements and automatically impact scenes using derived data), seem one 'right balance' here. The termination properties avoid divergence issues but allow deterministic concurrency, and it's push-pull when the demand for a value introduces subscriptions on the other end.

This style is very cache-aware, as it avoids recomputing data that hasn't changed. Shared computations can reduce costs. Being demand-driven also cuts down resource costs obtaining data that isn't required, so it distributes efficiently.

Even better, with shared distributed computations one can achieve efficient multi-cast for ad-hoc queried data. That's the sort of feature that allows worlds to massively scale using rules-based physics and AIs, as shared computations and multi-cast - on a p2p overlay - allow the number of global, complex rules and analyses in the game to scale with the CPU and network resources, with the number of players. (This can't be done using a star architecture. Multi-cast trees are required.)

Good Point

The semantics of correct code (terminates, no type errors, all preconditions satisfied, no other semantic errors) really should be deterministic independent of evaluation order in an expression without side effects. So at least in theory lazy evaluation shouldn't be a show stopper in any sense.

Examples of recursive programming techniques

Could you give a few examples of the recursive programming techniques that you want to use, but break down under eager evaluation? I'm guessing you don't have "zip [1..] xs" in your mind, or?

I imagine he meant something

I imagine he meant something like (in ad hoc notation):

obj = record {
   x = 10
   y = obj.x + 1
}

How about a strict data parallel subset?

Haskell's nested data parallelism uses fully strict arrays as their main data structure. That way the low level highly data parallel stuff is efficient, while still allowing you all the goodness of laziness for the higher level stuff... So the vectorizible part *is* strict, while the high level parts are lazy (by default).

Is the algorithmic expression at the language layer?

Or is the parallelism optimization at the propagation of the algorithms of the world model layer?

http://lambda-the-ultimate.org/node/3637#comment-51620

Tim Sweeney wrote:

...To vectorize or multithread a map, fold, or comprehension, you'd need to know in advance that every element of the resulting data structure will be fully demanded later...

...You don't want old-fashioned eager evaluation either. That breaks the recursive programming techniques that are key to writing useful algorithms in languages that lack side-effects...

For it to be the language, then the expression and optimization capability has to be able to know the algorithm of what needs to be optimally vectorized. Can we optimally express the world state and interaction model (rules) of instances (actors) in a scene in recursive functional logic? Can the functional language then find the optimal vectorization?

RobJellinghaus wrote:

...The real world doesn't have "game-object state" scaling problems, because all interactions and all reactions are local. Each real-world object continuously reacts locally to all the forces affecting it...

The algorithmic distinction is not the space scope of effect of interaction, but rather the spacetime granularity of the method for which the asynchronous interaction of real world is atomically modeled on (numerous cores of) a serial computing (imperative) machine. Interactions that occur asynchronously in real analog world, are modeled either with batch integration or with incremental propagation-- the latter needing only per-increment transactions (mutexing). I understand (and interested in) the conceptual direction of your point is towards a more granular per-instance autonomous update increment model, with less global (batch) management state.

Sum-of-forces (purely local) approach to game world state?

Tim, it's always thought-provoking to see your forays into programming language design. (It's my fortieth birthday today, so thank you for a nice birthday present! :-)

I have been wondering about the game-world problem. I don't know that I believe in STM as a way to get serious scaling; the jury still seems very much out on whether the cure is worse than the disease. It's still a model that involves a lot of shared state, and that will just never massively scale.

The real world doesn't have "game-object state" scaling problems, because all interactions and all reactions are local. Each real-world object continuously reacts locally to all the forces affecting it.

So is there a possibility that changing the simulation model to be "force-based" -- where each object reacts to the sum of all inputs arriving on it at each time step, and those reactions are only visible to other objects as a result of subsequent physics simulation -- would enable this "shared world state" model to be broken? In other words, all interactions are mediated through the (pure-functional, massively parallel) physics model. So objects never send messages to each other; they just affect the world locally at each time step, and the world affects the other objects.

That's the only kind of paradigm I can think of that would let all object simulation really be local. Is this kind of model fundamentally too difficult to script? Would it be like trying to program all world objects with crayon physics? Or could it be made intuitive?

In general, is it plausible to contemplate this kind of simulation model for game-world objects, as opposed to the shared-memory model they've always had?

Ditto and Caching

I also think it's awesome that Tim posts in LtU. One thing I'd mention too, sort of along the same lines (I think) is that game world state and graphics state do seem pretty, how to put this, cache-negligent. We're so used to sending world geometry or running world simulation every frame, that we don't stop and try to cache things (or even pre-emptively cache ranges of calculations). This could be similar to how you mention local state (not sure if I understand how forces factor in, but it sounds interesting).

Where I'm coming from is rendering anti-aliased GUI objects. Many of them. I'm writing this news client that has thousands of news updates (from various news sources), and so I render (with cairo) many objects (obviously not as many as GTA IV or UT3). But, quite in keeping with lazy evaluation, I do the 2d rendering on demand. And only when they change (via some animation, or picking, etc.). And even then, if the change is a pure 3d transformation (no actual re-rendering is needed with cairo), it all stays in texture memory. It's sort of like billboarding, but there is actually a mini pipeline for the 2d GUI objects that is fed (on demand) into the greater 3d pipeline.

Anyway, maybe this is unrelated. Perhaps if you're rendering a large monster, or simulating physics for a complex world, there's not a whole lot you can cache per frame (e.g., what if the camera rotates half a degree). But I do think there is maybe a tendency to dismiss lazy evaluation a bit too eagerly... Whether it means stringing together many pipelines (various RTTs cached along the way) and layering them like a bunch of Haskell combinator functions, or algebras like Porter-Duff that allow for memoizing between vector processors...I don't know. But it's interesting.

re: 3d caching

reminds me of ms talisman. sorta.

Physics steps

I'm not too sure I get it, but it seems problematic to me to get the integration evaluated in this way. Consider if you have a heavy object colliding with a chain of light object which in turn collides with another heavy object. Since this requires numerical integration in order to solve in any reasonable time, you will need to accumulate all the forces somehow. Usually this involves building a matrix of physics constraints and then integrating.
From what I know, most physics engines would perform these steps:

  • Collect constraints (Collision Detection)
  • Solve constraint system (Numerical Integration)
  • Apply the result to the objects (Collision Response)

I'm a bit too inexperienced to imagine exactly how this will occur in a functional paradigm, but I imagine it would still involve these three steps.

EDIT: Although, perhaps this is what you mean?

Yes to sum-of-forces

Hi Rob,

I'm delighted to read your thoughts on continuous local interactions. I particularly resonate with "all interactions are mediated through the (pure-functional, massively parallel) physics model".

I'm not sure whether you have continuous time or discrete time in mind for the semantic/programming model. Designers of computer graphics interfaces have mostly moved to continuous space, but have been less willing to use continuous time. My preference is continuous for both time and space, for composability (including zoomability/resolution-independence) and semantic simplicity, leaving introduction of discreteness to the "rendering" step, just as is done spatially.

I've been looking at this

I've been looking at this area for a while. In particular, I've found that the sum-of-forces approach rarely works in practice as many movements we want to express can't be expressed cleanly as forces, and counteracting other forces is extremely difficult (how do you express dragging with a mouse???). Instead, constraints on either properties (e.g., position) or their derivatives (e.g., velocity or acceleration) work a bit better.

As for discrete vs. continuous, intermediate world solutions are rendered into frames to be displayed, so I'm not sure how continuous semantics could be applied to a physics simulation. Well, actually I'm a bit wrong, some constraints (e.g., on collision) must be solved at once together while other are solved incrementally (via relaxation) and their per-frame intermediate solutions are displayed. But I agree that discrete vs. continuous time is a detail that shouldn't be exposed in the API.

I'm a bit intrigued by Rob's idea about "local" physics. As far as I can tell, physics engines always simulate "the world" at once since dependencies between bodies are volatile. Determining body interactions via collision detection is most difficult physics step to parallelize and GPU solutions often amount to rendering all bodies to a frame and detecting their overlapping pixels!!

Continuous time

Designers of computer graphics interfaces have mostly moved to continuous space, but have been less willing to use continuous time.

What graphics interfaces do you have in mind here? The two popular low level graphics APIs, Direct3D and OpenGL, are basically time agnostic. They see things in terms of a sequence of frames to render.

My opinion is that this is a good thing, and time doesn't need to be singled out as a special value. We can view things as a sequentially updated state with the ability to declare values which are functions of that state (and are thus automatically updated when their inputs change). Then time can be just another piece of state which things can functionally depend on.

Graphics APIs and the discrete time bias

Designers of computer graphics interfaces have mostly moved to continuous space, but have been less willing to use continuous time.

What graphics interfaces do you have in mind here? The two popular low level graphics APIs, Direct3D and OpenGL, are basically time agnostic. They see things in terms of a sequence of frames to render.

Seeing things as a "sequence of frames to render" is the discrete bias I have in mind.

Frame sequences are usually discrete samplings over continuous time (e.g., trajectories), just as pixel arrays are discrete samplings of continuous space. With typical computer graphics APIs (e.g., Direct3D and OpenGL) the spatial discretization/sampling is done automatically (by library & hardware), while the temporal discretization must be done explicitly by the programmer.

Given that typial graphics APIs are imperative, I imagine it would be pretty difficult for them not to inherit the discrete-time bias of the imperative model of computation. That implicit bias is so pervasive and so little discussed (in my hearing at least) that I suspect it usually goes unnoticed. As I understand it, the imperative model captures change via state change, and state changes happen only discretely. Paraphrasing Woody Allen's joke from the opening of Annie Hall, the problem with the imperative model from a naturalness POV is not just that mutation is awful but that it's served in such small portions.

My opinion is that this is a good thing, and time doesn't need to be singled out as a special value. We can view things as a sequentially updated state with the ability to declare values which are functions of that state (and are thus automatically updated when their inputs change). Then time can be just another piece of state which things can functionally depend on.

In "sequentially updated state", I again hear a hear a discrete (explicitly sampled) perspective being applied to time, while embracing a continuous (implicitly sampled) perspective applied to space.

I might not be suggesting that time be treated as special as you thought. I see all dynamic values as semantically "functions of time" in the sense that they vary. With what do they vary? With time (i.e., they're dynamic). Among dynamic values, time itself is notable ("special") only in that it's semantically the identity function. Talking about "time" is just the simplest precise way I know of to talk about dynamic values in general. Likewise, talking about "space" is the simplest precise way I know for talking about 2D & 3D phenomena.

Regardless of graphics API

Regardless of the Graphics API, various user-input, network input, game state, and gameplay logic or physics (e.g. switches, collisions) will tend to be 'discrete'. I know of no way to get around that.

What is it you imagine a Graphics API accepting functions over both time and space would look like? and how would it integrate with game-state and gameplay logic? At the moment, the discretized time offers programmers a well defined space (between discrete frames) to inject the gameplay logic that affects the next frame.

I can see this happening to a small degree, i.e. describing 'simple' bezier-style arcs or vectors for colors/pixels/objects over the 'time' dimension (allowing the graphics library or hardware to do its own interpolation). This sort of feature could be the baseline for the graphics API, while the gameplay logic essentially runs parallel and occasionally tweaks or overrides the display model, plus keeps the two in synch using some sort of timing API (stop time, start time, timing heartbeats, with a likely relationship to 'realtime').

Whether the GPU supports time or not could be handled by the API (i.e. asynch rendering thread vs. hardware support).

You say that "dynamic values vary with time". I tend to think it's the other way around. Time varies as a causal relationship between observations. Time is, fundamentally, relative to the observer. Time itself is, semantically, a partial ordering on predictable, shared observations over dynamic values.

As a consequence, adding time to a graphics API introduces synchronization issues that do not otherwise exist, but also allows introduction of a partial-ordering of updates for future interpolation as a single message or display-list.

One challenge I see with this design: it will need to rely a great deal on isolation of changes in the display model, even more so than frame-by-frame designs, because the amount of data pushed to the graphics API will be much, much larger (as it covers not only the data 'now', but also the data about what to display at all points in the near-term future - i.e. perhaps the next 100-300 milliseconds). This would likely require bandwidth optimizations. The graphics API might, for example, be built around (and effectively require) tweaking a display model hidden behind the API barrier.

Effecting the correct such tweaks without rebuilding the entire model would be... a challenge, unless the graphics API is sufficiently high level to hold a scene-graph (annotated with temporal information) that is semantically close to the gameplay logic. Effectively, we'd need not just a graphics API, but a monolithic gameplay engine replete with animations, triggers, dataflow hooks/subscriptions into external game state, stored procedures, etc.

I don't think we're far apart

I don't think we're far apart, but maybe two differences:
1) I don't think the time dependency belongs in the graphics API, and
2) I don't think time should be special, foundationally

Frame sequences are usually discrete samplings over continuous time (e.g., trajectories), just as pixel arrays are discrete samplings of continuous space. With typical computer graphics APIs (e.g., Direct3D and OpenGL) the spatial discretization/sampling is done automatically (by library & hardware), while the temporal discretization must be done explicitly by the programmer.

In a modern game, you have lots of code updating the world simulation that doesn't know about frame sampling. From the point of view of that code, rendering might as well be continuous. Then you have lots of rendering code, building up a frame to render for a single time. The render loop in between these two that decides when to discretely sample the simulation isn't much code. Time is different than space for a graphics API because there is only one time that matters: right now.

I think we agree that the main step that can be improved is the sampling of the simulation state - if the world entities have positions, etc that are functionally parameterized by time, it's easy to sample them. But I don't think the graphics API should know about that step.

Secondly, I don't thnk time is particularly special. As a thought experiment, consider how much of this design would carry over to a machine with no clock (also assume unpredictable time for computation). My view is that everything in the core design should carry over to such a system. We still want to be able to express functional relationships of UI elements in terms of the mouse position, for example.

Of course, you can always postulate an immeasurable time and define your denotations in terms of it, but I think a cleaner approach is to just dispense with the concept of an index and let it be implicit by order in a sequence.

Not just now...

Time is different than space for a graphics API because there is only one time that matters: right now.

It'd be nice to achieve the sort of uber-smooth interpolation that might be possible if the graphics API knows not only about 'right now' but also about how (barring interruption) to render, say, the next few milliseconds. Given a time-aware API, it would be feasible to increase framerates almost independently of the gameplay rendering loop.

As a thought experiment, consider how much of this design would carry over to a machine with no clock (also assume unpredictable time for computation). My view is that everything in the core design should carry over to such a system. We still want to be able to express functional relationships of UI elements in terms of the mouse position, for example.

While I agree that time ain't special, it's a tad unreasonable to request one design things on the premise that commonly available resources suddenly won't be available. Perhaps we should also design all our systems assuming no keyboard or mouse... there are probably more systems without keyboard and mouse than there are with monitors yet without clocks.

'Now' makes for a good separation of concerns

Given a time-aware API, it would be feasible to increase framerates almost independently of the gameplay rendering loop.

The time-aware API that I believe Conal is talking about is essentially putting the graphics API in charge of your gameplay rendering loop. Anything less (linearly interpolating between positions, for example) is just a hack. Of course, it's not going to be correct for the graphics API to advance game time proportionally to wall clock time, since things like networking conditions may result in slowing/stopping the passage of game time. And then what if you want to detect low framerate conditions and automatically reduce rendering quality? Is that going to be part of the API?

While I agree that time ain't special, it's a tad unreasonable to request one design things on the premise that commonly available resources suddenly won't be available.

Well, I did use the words 'thought experiment'. To me it's about finding the right central concepts, and this line of thought leads me to believe time isn't one of them. YMMV.

Performance

The time-aware API that I believe Conal is talking about is essentially putting the graphics API in charge of your gameplay rendering loop. Anything less (linearly interpolating between positions, for example) is just a hack.

Performance for real-time rendering is all about achieving a pipeline of intelligent hacks...

Putting the graphics API in charge of the gameplay rendering loop is doable. Separation of concerns may occur at a different layer, such as scene-graph management. There are many "mediocre" options for separation-of-concerns. "Now" is just one of them, with its own costs and advantages.

Supposing the scene-graph (or elements thereof) were described as a function of time, one could always render this scene-graph in a system without a clock by setting time for the scene-graph to +infinity or an approximation thereof (that is: draw it as it should be after all time-dependent elements have stopped moving). Alternatively, give the UI a slider to control time.

Why?

Performance for real-time rendering is all about achieving a pipeline of intelligent hacks...

Right, but generally when you employ a hack it's because it's simpler or more performant than the alternative.

Putting the graphics API in charge of the gameplay rendering loop is doable. Separation of concerns may occur at a different layer, such as scene-graph management.

Right, but what would be the point of doing that? Note that if you want an API that performs some simple interpolation between frames, you can write that thin layer over the style of graphics API we have now.

Graphics API

if you want an API that performs some simple interpolation between frames, you can write that thin layer over the style of graphics API we have now

I'm curious as to what "Graphics API" means to you that wouldn't include new said layer as part of a larger graphics API. Where are you assuming the graphics API stops?

There are plenty of reasons to push scene-graph into the graphics API and manage it from the outside. Doing so allows the implementation underlying the graphics API to do a great deal more work regarding level-of-detail, interpolations for animations, and triggering demand-driven updates to state of the scene (i.e. for higher-level detail, more recent data) based on actual and anticipated need (i.e. hooks for subscriptions). Time annotations for POV can both improve anticipation of need and provide smooth camera transitions. The scene-graph design also supports real-time rendering solutions, because the rendering can be divided from the updates in a wait-free manner. It can also save bandwidth, avoiding unnecessary updates.

Scene-graph as part of the graphics API isn't uncommon. I think it an excellent design - and I'm hardly alone in that assessment, as you'll find many graphics APIs built atop scene-graphs. Annotating the data describing the scene-graph with forecasts or waypoints seems, to me, a natural generalization of animations - a way to ensure they run smoothly.

The cost of a scene-graph is a binding between gameplay code and the scene-graph description language, and the partial loss of flexibility (though in at least one design I've considered for use, you can render straight OpenGL to the tune of callback triggers).

I'm curious as to what

I'm curious as to what "Graphics API" means to you that wouldn't include new said layer as part of a larger graphics API. Where are you assuming the graphics API stops?

I was specifically referring to OpenGL and Direct3D there. You're right, there is room for higher level APIs. My point is that I think temporal concerns belong on a higher level API that's more intimately tied to your game logic. Also, I agree there are advantages to having a complete static scene graph (scene descriptor) when rendering a frame, even at the level of Direct3D or OpenGL, rather than sending things down in batches like we do now.

retained-mode 3D APIs

Any retained-mode API (Java3D, WPF3D) is already a higher-level API where time becomes implicit. WPF3D in particular supports databinding, so changes to a dependency property are automatically considered when re-rendering occurs. But retained-mode APIs seem to be going nowhere in terms of performance and flexibility, so they don't seem to be the future.

Bling also promotes implicit time while supporting a programmable graphics model. Code can refer to dynamic values, which will be refreshed during implicit re-rendering. Since physics relationships are expressed as declarative constraints, its possible to create an entire simulation without hooking into frame rendering.

Immediate-mode Scene-graphs

I'm also not fond of explicit (imperative) maintenance of retained-mode scene-graphs. I agree with your assessment regarding their flexibility, at least. Work by various groups (such as Quantum3D, Unreal, etc.) have done much for their performance, if you're willing to shell out for licenses.

Describing a scene-graph as a logical or constraint relationship over dynamic values - or whole dynamic databases - is a very reasonable design as the basis for both graphics and user-interface. The 'scene-graph' created from doing so is essentially a cached view of a database. It isn't quite retained-mode because it can be regenerated from the original specification, and knows how to maintain itself.

Bling seems to be aiming in that direction, or at least a similar direction. But integrating level-of-detail, multi-media (textures, models, animations), occlusion (i.e. keeping Bling from unnecessarily processing occluded items) and such will remain a challenge. And the whole system may still benefit from explicit time, as it may aide with cache coherence.

How much have you achieved involving the creation of a graphics API that sucks its display tasks through a Bling straw? What performance are you seeing? How much explicit 'retained mode' mapping gets involved at the edge?

Retained API that supports

Retained API that supports databinding can be seen as at least semi-declarative. WPF 3D supports both databinding and imperative assignment (though admittedly the databinding granularity is way off for point collections).

Bling is still a work in progress and I've just added initial DirectX support recently. I'm still trying to perfect the basic vertex/pixel shader auto-partition as well as texture rendering relationships (since I want to do heat diffusion visualizations on the GPU). Basically, there is a laundry list of things to do, and a bunch of questions on how to do them; e.g., geometry shaders (these are weirdly imperative but mostly can be seen as a primitive map), stream out, culling and depth/stencil buffers, instancing, transparent constant buffer management...and much more (then what about DirectX 11...). Hopefully, something like occlusion is easily dealt with by the GPU transparently, but I'm not really sure, I'm just learning about all of this stuff.

Code generation in Bling allows us to potentially have the same performance as DirectX 10, which is basically what I'm seeing now with the few features that I've implemented so far. The biggest issue seems to be just crafting the straw; e.g., what does a geometry shader look like in a functional world?

Continuous Time

Re: "Designers of computer graphics interfaces have mostly moved to continuous space, but have been less willing to use continuous time."

Interesting observation.

Perhaps this is why: In a complex system, its difficult to derive even an approximate model of the system's state at times distant from those we already know, without first deriving the interval between.

Put another way, knowledge of state(t) allows us to directly approximate state(t + x) or state(t - x), for "small" x values. However, we can't scale this technique to state(t + 10x) or state(t + 100x) without risking massive errors. Moving objects going right through walls, for example.

So, unlike in space, we cant move freely around in time in single jumps. If we want to move around in time, we must go there in tiny steps. This then leads us to standardize upon a "simulation step size", whose choice has major bearing upon our simulation's dynamic behavior.

Hence the discrete bias remains more prevalent in time than in space?

discrete vs tiny

Hi Ben. I enjoyed your musings very much, and I'm having trouble following your train of thought at a couple points.

Here's what I think you're saying: If we know state(t), then we can make fairly accurate guesses about state(t+epsilon) for small |epsilon|. And then you think of some such epsilon as a "jump"/"step" -- hence a discrete point of view. My hunch is that this sequence of thoughts is strongly guided by its conclusion, i.e., it's a route one might take to get to discrete thinking only if one is already in the habit of discrete thinking. On the other hand, I tend to think continuously, so I see the interval from t-epsilon to t+epsilon as a continuous interval, not a discrete jump. I'd like to hear your reaction to my imagined reconstruction of your thinking and to the continuous alternative.

In a complex system, its difficult to derive even an approximate model of the system's state at times distant from those we already know, without first deriving the interval between.

I guess by "model", you're assuming some kind of closed-form formula. I'm thinking of a different sort of model -- something like functional reactive programming, which describes a dynamic (time-varying) system exactly, though not a close-form formula.

Vector Haskell? That's the rough idea!

Tim, I'd agree today's Haskell is not a straight forward shoe in. And I'll check out Id90, it sounds interesting.

The laziness debate, and the related fallout of space leaks are definitely areas that need attention. Not to mention figuring out how to parallelise it effectively, though these things are interrelated.

My talk was less about Haskell than it was about how and why we might go about developing a language suitable for games.

Developing a language that the games industry could use in practice requires significantly more of a shared goal than currently exists. Industry, hardware and academia will all need to see roughly eye-to-eye on some things as discussion needs a common foundation. This includes 'basic' things, like the idea that functional languages are a major element here. Although there are some interested game industry figures, they are still small in number and I haven't heard many voices from hardware agreeing with the notion of functional languages yet.

This is partly why I focused on Haskell over any other functional language. I think it's important there is some starting point and Haskell is the closest and maturest thing to the perfect platonic language that currently exists.

The other obvious reason being that Haskell is awesome. :)

Why fp and for what? E.g.,

Why fp and for what?

E.g., relational/constraints and dataflow/frp derivatives make more sense to me for animation \ positioning. I don't even know where to begin for AI (traditional, rule based, ml...).

Of course, these are pure, if that's what you meant.

Perhaps you meant the engines to drive this all? (Shaders physics etc) a good question is how much can be lifted out relative to current practices..

I don't even know where to

I don't even know where to begin for AI (traditional, rule based, ml...).

Rule-based (Logic programming, with state interaction), is excellent for AIs, at least with regards to decision-making, intelligent reaction, and motivation. In Inform 7, you'd create rules like:

If a soldier sees footprints for which he doesn't know the source, he becomes suspicious.

If a soldier hears a noise for which he doesn't know the source, he becomes suspicious.

A soldier knows the source of footprints made by people he sees.

A soldier knows the source of noises made by people he sees.

A soldier is a person. People can usually see themselves.

Suspicious soldiers will search the nearby area for an enemy.

Suspicion is an emotional state.

Emotional states have expiration times associated with them. This time is usually an hour.

Suspicion's expiration time is usually 30 seconds.

It's interesting how Inform7 takes huge 'rulebooks', combines them into huge databases of rules, and automatically determines what state is needed for each character, device, object, etc.. The above rules would automatically associate knowledge databases and emotional states with soldiers, and so on.

Anyhow, that sort of rule-based 'AI' is also very useful for defining game-world physics, puzzles, and other gameplay concerns. One can even use it to describe interesting animations (fidgeting, looking around suspiciously, details such as how the guard occasionally picks his nose or cleans his gun). But it isn't very good for describing layout and scene-graphs. It also isn't very good for "planning" tasks (route planning, strategy, trying to ensure elements of a HUD don't occlude one another). I.e. the rules-based would decide that a soldier chooses to clean his gun, but the actual animations (movements of the arms and fingers, placing pieces of the gun on a table, etc.) would need some extra support.

For scene-graphs (layout, display, POV), I fully agree that the constraint or reactive-logic or reactive-functional programming seem most appropriate. Ideally, we'd want the scene-graph to hook directly into state-of-game information that is visible to the player, as keeping some information 'hidden' is part and parcel to (almost) any game.

Display should occur in a pure, side-effect free, 'immediate mode' fashion, such that the only retained information is cached computations (which can be regenerated easily enough) and such that display itself does not invoke updates on game state (Heisenburg uncertainty principle notwithstanding - we want programming to be easier than reality, and heisen-bugs are to be avoided).

For 'planning', another aspect of AI (after making a decision), constraint programming seems like the only real way to go.

While no single paradigm is sufficient, I don't believe OOP and FP are good for all that much in game development... except perhaps as an assembly language for higher-level logic, resource management, and module management.

I suspect (though maybe it's just hope) that Actors model style OOP, along with special support for subscription to event-sources and data-sources and functional transforms, will prove very successful as a distributed/secure/persistent/modular-program assembly language, to which higher languages compile. Most programming will occur in embedded DSLs that compile to these.

Many abstractions

Tim's slides suggest he envisages using different abstractions at different levels of the game. At the rendering engine, which is what this talk is primarily about, its is mostly (all?) data parallelism. At the game engine layer he mentions STM and task parallelism.

FRP and FP

Why fp and for what?

E.g., relational/constraints and dataflow/frp derivatives make more sense to me for animation \ positioning.

FRP is an application (instance) of functional programming. Dynamic (time-varying) quanties are pure, immutable values (semantically, functions of time).

See TBAG for a FRP predecessor that combined immutable time functions and mutable constraints. FRP grew out of the question of what might be a declarative/immutable replacement of TBAG's assertion and retraction of constraints.

I don't agree

FRP applies functional programming, but is not an instance of it. If anything, you should say it's the other way around: the elements for functional programming are a subset of the elements for functional reactive programming, and therefore functional programming is a potential application (instance) of FRP.

Reactive programming is well understood to describe automated propagation of changes to observers. FRP is the same, using functions (as opposed to reactive imperative programming, which may execute behaviors with side-effects while propagating changes). The reactive nature of FRP makes for a very different programming style than does plain functional. In particular, it allows data-binding.

Data-binding allows values to be tied together in useful ways for animation and positioning.

FRP does not need to be "semantically, functions of time". You could just as easily say they're "semantically, functions over the state of a world" - a world that may or may not have 'time'. The ability to ask for a value at an explicit time would suggest some need for a temporal database of any data-binding elements, plus the ability to block upon receiving a request for a future value.

I always thought that the

I always thought that the essence of FRP was the tunneling of time (or change over time) in expression composition syntax so that time-change details are hidden from the programmer (what I call implicit, but I see this mentioned as explicit also). The reactive-nature of FRP is then a consequence of this time hiding (rather, programs are automatically reactive vs. having to write boilerplate code that make them reactive).

I'm not sure what the relationship between data-binding and FRP is. FRP inspires a syntax (expression composition that hides time-change details) that can make databinding more convenient. For example, we wrap WPF databinding in Bling with expression composition syntax, which hides databinding details from users while transparently setting up binding operations and implementing custom value converters in much the same way that a FRP program would automatically build/manipulate a dependency graph (or however FRP is implemented these days). Similar FRP-inspired syntax can also make expressing a physics simulation more convenient (also done in Bling). I believe FRP-inspired syntax can also make GPU programming easier (work in progress).

Conal, by extension, I'd

Conal, by extension, I'd argue that impure programs are also functional: just write a pure interpreter... However, I don't think this is a useful line of discussion. I was contrasting a general framework for handling dependencies over time (say, plain vanilla Haskell with fudgets or whatnot) vs. a more specialized one (Haskell with your FRP libraries/EDSLs). My point was that 'FP', without such qualifications, might be *worse* than an impure but specialized system.

By FRP, I typically mean something like a higher-order synchronous dataflow (SDF) language with nodes as first-class values. Further, assuming it is a robust implementation, I'd imagine domain-specific conveniences like integrating a notion of a timeline and thus the ability to drop frames when behind schedule.

Btw, for the connection to data-binding.. data-binding, in systems I've inspected, does not have good synchrony guarantees nor is first-order in a useful way (just flat templates until you dig around messy underlying APIs). Neither is inherent, but as you start fixing these, you're essentially edging towards the FRP featureset. Syntactic support for applying functions over changing values is actually a debatable feature of FRP implementations (disappointingly, I ran a user study that ended up arguing against transparency!).

Edit: a final feature of FRP highlighted by Conal's work is abstractions over continuous values -- I don't view this as essential, but, in many domains, it is great!

I'm not sure what the

I'm not sure what the relationship between data-binding and FRP is.

As I understand it, databinding is a way of making the pervasive use of the subject-observer pattern more convenient. This design pattern is the basic building block used to build dataflow networks, in which we have a network of cells with dynamic dependencies on one another. (E.g, spreadsheets are an example of this.) These cells may also make use of local state, in addition to their dataflow dependencies.

Then, we program the observers to compute values lazily but propagate change notifications eagerly. Then, an interactive system works with an event loop that updates the values in distinguished input cells, and reads the values in distinguished output cells to get the output values at each time step, with a minimum of recomputation.

We can use these networks to implement an FRP library, which a) are a restriction of these networks to series-parallel form, and b) use local state to implement feedback. The cool thing is that you can reason about the imperative combinators exactly as if they had the functional stream-transducer semantics, even though they work through pervasive stateful updates everywhere.

I've a got a draft about this.

Should I think about this,

Should I think about this, in effect, as an approach to show that an FRP implementation preserves a notion of local state semantics and/or is a way of clarifying what this notion means in terms of heap objects? A benefit of an approach like the dataflow semantics of FrTime is that it makes this notion explicit, though at the expense of the clarity of more typical denotational representations. I haven't thought about verifying implementations of them, however: it sounds more like that the goal is to show the library faithfully maintains these properties.

[[Would like to know the mindset for reading this -- as someone who is interested in advancing FRP, I found the intro very unclear in terms of the take-away for me and came back here]]

The paper is primarily about

The paper is primarily about some new proof techniques in separation logic. The dataflow/FRP stuff is a test/demo that it works, because this stuff like GUIs is some of the trickiest imperative code I know of. The code for the libraries there are basically what you would immediately write if you were asked to implement an FRP library in an imperative style.

What makes this tricky to verify is that we've got basically one trick to verify imperative code, and it doesn't apply here. That trick is separation: if you know that your program acts on a part of the heap, you also know that it doesn't act on the rest of the heap. But with observers and notifications, you run some code, and then your program runs off and modifies all sorts of listeners you don't necessarily know anything about. So the standard trick in separation logic/linear logic/etc doesn't apply, and I tried to come up with a generalization that does apply.

The takeaway, from the POV of an FRP guy, lies in the fact that we end up with the ability to justify optimizations of the imperative implementation, in terms of reasoning about the functional specification. So, for example, in the FRP calculus we have an equational theory with equations like:

  
  (lift f) >>> (lift g)  == lift (f o g) 

If these combinators are implemented in terms of notifications and callbacks, it would be nice to rewrite terms like that on the left to that on the right, because it lets us eliminate the callback structure needed to propagate of notifications from the result of f to the input of g. However, it's really hard to justify these optimizations, because the implementation is doing all these crazy low-level pointer-mashing things and changing it changes what happens in non-obvious ways.

But suppose we show that the imperative implementation implements the functional spec. Then we can prove equations on the functional side, using only basic mathematical reasoning, this establishes that transformations of the imperative code continue to realize the desired functional spec.

NVIDIA Fermi/Nexus

http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiArchitectureWhitepaper.pdf. Also a nice short write up at Ars.

More like Larabee than expected for this generation. The language support as of this GPU becomes more complete (exceptions, function pointers!) along with more advanced development tools (profiler and debuggers). The big question is...are we going to be stuck with C++/Fortan for programming GPUs or can someone come up with a better for functional alternative?

a stack on forth?

port forth to it, then target that as a back-end for scheme, ml, etc.? :-}