Spark: Modular, Composable Shaders for Graphics Hardware

If you are interested in the intersection of PL and computer graphics, I invite you to read our SIGGRAPH paper on Spark, a real-time shading language that aims to improve support for good software engineering practice. From the abstract:

In creating complex real-time shaders, programmers should be able to decompose code into independent, localized modules of their choosing. Current real-time shading languages, however, enforce a fixed decomposition into per-pipeline-stage procedures. Program concerns at other scales - including those that cross-cut multiple pipeline stages - cannot be expressed as reusable modules.

We present a shading language, Spark, and its implementation for modern graphics hardware that improves support for separation of concerns into modules. A Spark shader class can encapsulate code that maps to more than one pipeline stage, and can be extended and composed using object-oriented inheritance. In our tests, shaders written in Spark achieve performance within 2% of HLSL.

You can access the paper through either our Intel or Stanford pages. The Intel page also has a link to the source code for our system.

If you are in Vancouver this week for SIGGRAPH, you can catch my talk on Thursday.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Bling has done the cross

Very nice!

Bling has done the cross cutting of shader stages (CPU, vertex, and pixel) for a few years now (for DirectX 10); you write the C# code and the system will decide what stage it should execute in based on some dependency tracing (HLSL and CPU code are generated dynamically via staged execution). I got stuck with my technique at geometry shaders, although I think it could handle hull shaders if I ever move to DX11 (dynamic compilation is too expensive in DX11, so my approach is less viable there).

I'm kind of amazed that this paper got through the PC without citing Conal Elliott's Vertigo work. You should definitely check it out, it sets a high standard from the point of view of shader composability.

I'll look more closely at Bling

I've taken some time to look at Bling in the past, but hadn't noticed that the splitting of code across stages was automatic. Is it similar in approach to Renaissance, where all code nominally executes at the highest possible rate, but can be hoisted at the discretion of the compiler? As you note, dealing with the new pipeline stages (Geometry Shaders and Tessellation) present significant new challenges.

I was aware of Vertigo when we were pursuing this project, and in general I'm a big fan of Conal Elliot's work. The decision not to cite Vertigo was neither accidental, nor malicious; we just felt that other work was more closely related to ours, or more important as background.

I wasn't aware of

I wasn't aware of Renaissance, but ya, it hoists code up to highest possible level based on a dependency analysis (although using stages compilation as an implementation convenience). The problem with the new shader stages is that they are less functional [edit: they are groupwise using your terminology] than the original VS/PS stages [pointwise], it is much more difficult to abstract them away via simple dependency analysis. I think the future of this approach lies in software rendering, where the pipeline is unfixed and you simply stage your computations in a compute shader. The hardware isn't there yet, but its getting their fast (say ~5 years). The idea then would be to store shading details as computations and data with what's being shaded.

I was aware of Vertigo when we were pursuing this project, and in general I'm a big fan of Conal Elliot's work. The decision not to cite Vertigo was neither accidental, nor malicious; we just felt that other work was more closely related to ours, or more important as background.

And all the related related work just happens to be from the siggraph community, which is very convenient for a siggraph paper :). For a sigplan audience, you would have to be more precise about what you mean by composability, and a contrast with Conal's work would be useful to that end. It would be nice to eventually see closer cooperation between the two communities, I believe we both have something to bring to the table.

OpenCL

How does this relate to OpenCL? Using the interop extension (which seems to be pretty standard) it is possible to share video memory directly between the OpenCL and the OpenGL runtimes. Shader programs can then be executed as kernels and the results used directly as vertices or textures in rendering. The memory model for OpenCL is shared memory input and output allowing both scatter and gather operations inside kernels. This would allow pixel shading, vertex shading and tessellation operations to be performed from the same code.

Without the separation into logical pipeline stages each of the different rendering steps would be written in a common language using the same memory interface. How does this approach compare to Spark?

If you did this in OpenCL

If you did this in OpenCL without using the fixed pipeline, this is essentially GPU-based software rendering and you'd have to break everything up yourself (say, if you are doing volumetric rendering). You would still probably create your own logical pipeline since its unlikely you could do rendering in one big parallel pass. And if you were doing something more traditional (pushing polygons), you would still have to rasterize between vertex and pixel processing (which I don't think is possible in OpenCL, but please correct me if I'm wrong, check out Sweeney 2008 for more info).

In that case, from my reading of the paper, Spark is designed to handle new kinds of stages: it could modularize code in your pipeline, and would be able to assist in plumbing of data through to pipeline stages.

Rasterisation

If you mean rasterize in the sense of having the relevant parts of the hardware pipeline transform and light the polygons for you then I think you would need a separate OpenGL pass to handle it. I think Sweeney meant that if you were writing the entire pipeline as software (in CUDA / OpenCL / ...) then the rasterisation step would not be any different structurally to the rest of the code. The program would still be a series of kernel passes over some custom data-structure.

Currently there isn't any specific support for doing that directly in CUDA / OpenCL although part of what you need is there in the dependency DAG exported through the API. By itself that would imply quite a low-level of abstraction in terms of cutting the problem up into independent kernels and then scheduling them. It sounds like Spark (or something similar) could be useful there. One nice thing that is missing at the moment is a partial evaluator. It's easy to parameterise kernels using constant memory that is faster to read, but I would rather not execute that part of a kernel at all at runtime.

Domain Specificity

Sean's response gets to the heart of the matter: a fast software renderer will typically be structured much like the hardware pipelines, with signifiant buffering, distribution, or sorting of work between stages. Spark should be equally applicable to a software rendering pipeline as to a hardware one.

More important than the distinction between hardware and software in a rendering system is the distinction between framework and application. Typically the framework for a rendering system defines the overall scheduling and parallelization strategy, while application code in shaders define the inner loops. This decomposition means that shader programmers do not typically need to be aware of details of parallel execution (since at the end of the day they are usually writing pure functions).

In that sense, OpenCL might be a poor choice for a "shading language" since it lacks any of the domain-specific features and restrictions that languages like GLSL or Spark have. An OpenCL programmer sees a more-or-less Von Neumann machine, and can perform all kinds of side-effecting operations.

Asides:

  • A wrinkle arises in my above argument since Direct3D 11 now allow allows these same kinds of general side effects in its Pixel Shader stage.
  • If you are interested in pure software rendering pipelines, they be sure to check out the OptiX system from NVIDIA, as well as NVIDIA's recent HPG publications on their CUDA rasterizer and voxelizer.

Also:

One nice thing that is missing at the moment is a partial evaluator. It's easy to parameterise kernels using constant memory that is faster to read, but I would rather not execute that part of a kernel at all at runtime.

Agreed.

Like RTSL, Spark has a @Constant computation rate to represent compile-time computation, but it does not yet support @Constant parameters. That is one of my wish-list features, so I hope to implement it as I continue to develop the system.

Partial evaluation

After a second read I'm wondering if you have looked at memoisation of values across stage boundaries. An example would be your comments in section 2.4 in reverse, so a fragment shader that relies on a value that is going to change at a lower rate (i.e per vertex or group). Rather than the unpredictable semantics of writing to a variable "belonging" to an earlier stage, the problem of spotting a computation only dependent on reading that variable and lifting it to an earlier stage in the pipeline. The answer may well be in the paper but I can't tell what kind of code motion you are doing to your imperative subroutines when they are composed together. Overall your paper is a very interesting read, the separate execution environments in the pipeline stages is one of the most convincing arguments I've come across for separating and composing concerns in the code.

Predictability

If I understand you correctly, you are describing the kind of approach taken in Renaissance: a computation may be hoisted to an earlier stage if the compiler can prove that this won't change semantics. If you can detect linear operations (in the geometric sense) then you can be quite aggressive at this (and Spark supports a type-class-like Linear[T] constraint).

Performing this hoisting involves a time/space tradeoff, though, and it would be very difficult for a compiler to evaluate this tradeoff robustly. If you compute and store a value early in the pipeline, then you incur a cost in bandwidth (to propagate that value along the pipeline) and storage (in finite on-chip buffers). In the limit, the amount of storage you use can begin to limit your ability to exploit parallelism.

Given this situation, both Spark and RTSL err on the side of predictability and follow simple inference algorithms that deterministically assign computations to stages. An expert programmer can take advantage of this predictability to achieve the partitioning they want.

I have heard the anecdote that John Carmack was initially quite skeptical of RTSL because he had been lead to believe that an optimizer would assign computations to stages behind the programmer's back, so there is at least anecdotal support for our decision. If our target users were non-experts, of course, this might not be as desirable of a choice.

None of this detail is in the paper, simply because of space constraints. Thanks for bringing this up.

GPipe

FWIW, I agree with your design decision to keep it predictable. At a meta-level, it is a lot easier to add an expert system to a predictable system than it is to wrest a predictable system from the 'expert'.

I'm wondering what it would take to get features of Spark into Haskell, i.e. perhaps as an EDSL above GPipe.

Interesting

There is a lot of similarity there already; a GPipe Fragment Float is in principal quite similar to a Spark @Fragment float, with the GPipe rasterize functions being equivalent to the "plumbing operators" that typically get applied implicitly in Spark.

The biggest difference I can see (and I'll admit to not knowing much about GPipe) is that Spark scopes values with these rate-qualified types inside of shader classes. This is a limited kind of virtual types, so a @Fragment float in one shader class isn't type-compatible with a @Fragment float from another shader class.

Does GPipe draw this distinction, or can I confuse the system by adding together two Fragment Float values that were derived by rasterizing different primitives? My read of the docs says that I can, since Fragment Float is an instance of Num. I initially tried to mock up the system that eventually became Spark in OCaml with a similar approach to this, but my inability to prevent that case was one of the things that pushed me toward virtual types.

Certainly if anybody is interested in cherry-picking ideas from Spark and bringing them to GPipe or other projects, I would be happy to lend my perspective. As Sean noted earlier in this thread, there could be a benefit from closer cooperation between graphics researchers and the "real" PLT researchers (I'm obviously just a dilettante).

Rate-qualified types

In GPipe, you can add two Fragments together, but you cannot compose them arbitrarily - i.e. you are limited to a relatively 'point free' composition style using functor and monoid operators. This applies to both the primitive stream and the fragment stream both. Architectural constraints, I think, serve a similar role.

But, while I've twiddled with GPipe a little, I can't claim to be an expert with it. Using phantom types or the like to protect stages might serve as a useful enhancement for the GPipe model.

Regarding rate-qualified types more generally:

I have been developing an asynchronous arrows model that allows me to control composition very precisely between elements that are potentially at different times or places, or that update at different rates. This was developed in context of distributed systems, but I think it will prove very effective for managing a GPU shader pipeline (or even more than one GPU).

Asynchronous arrows introduce an asynchronous product type (x :&: y) representing values (signals, in RDP) at different locations or places, and an asynchronous sum type (x :|: y) representing exclusive choice of behavior (like if/else). Asynchronous sums and products are not values themselves, and are only used in the context of behaviors.

I currently have 24 behaviors for asynchronous arrows in RDP, 16 of which are generic to asynchronous arrows:

-- behaviors for asynchronous arrows
data B a x y where
  -- Domain Behavior Extension
  Bop :: a x y -> B a x y

  -- Category
  Bid :: B a x x
  Bseq :: a x y -> a y z -> a x z
  
  -- Product
  Bprod :: B a x y -> B a x' y' -> B a (x :&: x') (y :&: y')
  Bswap :: B a (x :&: y) (y :&: x)
  Bdup  :: B a x (x :&: x)
  Bfst  :: B a (x :&: y) x
  Bsnd  :: B a (x :&: y) y
  Bapl  :: B a (x :&: (y :&: z)) ((x :&: y) :&: z)
  Bapr  :: B a ((x :&: y) :&: z) (x :&: (y :&: z))

  -- Sum
  Bsum  :: B a x y -> B a x' y' -> B a (x :|: x') (y :|: y')
  Bmirror :: B a (x :|: y) (y :|: x)
  Bmerge :: B a (x :|: x) x
  Binl  :: B a x (x :|: y)
  Binr  :: B a y (x :|: y)
  Basl  :: B a (x :|: (y :|: z)) ((x :|: y) :|: z)
  Basr  :: B a ((x :|: y) :|: z) (x :|: (y :|: z))

-- RDP Signal Behaviors
data S t a x y where
  -- Effects and Dynamic Behaviors Extension
  Sop :: a x y -> S t a x y

  -- Basic Signals Operations
  Sfmap  :: (SigFun s f t) => f x y -> S t a (s x) (s y)
  Sdrop  :: (Signal s t) => S t a (s x) (s ())
  Sconv  :: (SigLift s s' t) => S t a (s x) (s' x)
  
  -- Signals Composition
  Szip   :: (Signal s t, Ord (Diff t)) 
         => S t a (s x :&: s y) (s (x,y))
  Ssplit :: (SigSplit s t)
         => S t a (s (Either x y)) (s x :|: s y)

  -- Time and Buffering Control
  Sdelay :: (SigDelay s t) => Diff t -> S t a (s x) (s x)
  Speek  :: (SigPeek s t) => Diff t -> S t a (s x) (s x :|: s ())
  Ssynch :: (Ord (Diff t)) => S t a x x

'Ssynch' is a multi-path synchronizer that equalizes logical delay on every branch. This is useful because we don't always know the history of delays that lead to a value, and 'Ssynch' serves as a logical 'zero' point - albeit, statelessly (relying on the fact that RDP has only static delays at every given stage of dynamic behavior). For asynchronous product, this allows synchronized operations on distributed systems. In case of asynchronous sum, for RDP signals, Ssynch supports a perfect transition between effects - no gaps or overlaps. It can also serve as a clean transition between GPU stages - when we want one.

RDP is much more expressive than most declarative languages. One can statelessly achieve many idioms that are traditionally considered stateful. Use of sum types provides static choice (e.g. if/then/else). With demand-effects, one can utilize a demand-monitor (a (s x) (s ()), a (s ()) (s [x])) to combine information from independently developed behaviors. This would allow, for example, cross-cutting concerns to be combined before building a dynamic behavior that manages the GPU.

I expect my RDP model will prove adept at controlling a GPU pipeline, or even many of them, while also supporting the application and service level requirements (e.g. loading textures just in time, anticipating user behavior and what will be viewed next, etc.).

I've known for many years that the problems of secure distributed heterogeneous computing are remarkably parallel to the problems of high-performance heterogeneous embedded computing (e.g. with GPUs and FPGAs). I find it fascinating to watch programming models converge.

But, today, I still need a DSL to ensure the types are GPU compatible and such. So the work done in GPipe and Spark definitely interest me.

Performing this hoisting

Performing this hoisting involves a time/space tradeoff, though, and it would be very difficult for a compiler to evaluate this tradeoff robustly. If you compute and store a value early in the pipeline, then you incur a cost in bandwidth (to propagate that value along the pipeline) and storage (in finite on-chip buffers). In the limit, the amount of storage you use can begin to limit your ability to exploit parallelism.

There is also a quality tradeoff: in Bling for normal calculation you have to explicitly shove the expression into the pixel shader (given that interpolation via the vertex shader is not good enough). Bling provides control via a rate operator (PerPixel()).

I have heard the anecdote that John Carmack was initially quite skeptical of RTSL because he had been lead to believe that an optimizer would assign computations to stages behind the programmer's back, so there is at least anecdotal support for our decision. If our target users were non-experts, of course, this might not be as desirable of a choice.

The right analogy is garbage collection in control vs. ease-of-use. Some programmers are going to benefit from an automatic system even if they don't get the fastest possible solution and have to deal with more degenerate cases where automation can't deliver a desirable outcome. I'm for defaulting to automatic but allowing for manual control as necessary and desired.

Re: Quality

There is also a quality tradeoff: in Bling for normal calculation you have to explicitly shove the expression into the pixel shader (given that interpolation via the vertex shader is not good enough). Bling provides control via a rate operator (PerPixel()).

Do you currently assume that all texture fetches should be performed per-fragment ("pixel" in D3D-speak)? Otherwise I would expect you to see the same problem with texture fetches being hoisted and performed per-vertex, with a similar degradation in quality.

I must say, this thread is giving me a lot of great pointers to work that I'll need to cite and discuss in my dissertation. Thank you, LtU.

All texture accesses must

All texture accesses must occur in pixel/fragment shaders, so all dependent computations are implicitly shoved down into PS. DX10 didn't offer very good access to textures in the vertex shader, so it was a simplifying assumption to assume they were all performed in the PS. For Bling, I copied Conal's computational geometry examples from his Vertigo paper, so there was a real decision to make with respect to per-vertex or per-pixel normals.

Sounds like it will be a fun dissertation to write!