Declarative Assembler

Declarative Assembler. Christopher Kumar Anand, Jacques Carette, Wolfram Kahl, Cale Gibbard, Ryan Lortie. SQRL Technical Report #20, 2004.

As part of a larger project, we have built a declarative assembly language. This language enables us to specify multiple code paths to compute particular quantities, giving the instruction scheduler more flexibility in balancing execution resources for superscalar execution. The instruction scheduler is also innovative in that it includes aggressive pipelining, and exhaustive (but lazy) search for optimal instruction schedules. We present some examples where our approach has produced very promising results.

I think this paper is a nice followup to the recent discussion of SPE because it goes further than that paper by analyzing what data are necessary to achieve the ultimate goal of optimal or near-optimal instruction scheduling on superscalar architectures. In other words, it strongly suggests that we can do better than simply embedding low-level instructions in a high-level language by instead embedding a graph of desired results (vertices) and instructions for reaching them (edges) which can be analyzed to exploit cache-coherency, pipelining tricks, instruction-level parallelism, etc, or which could be supplied to an external scheduling algorithm based on some entirely different paradigm.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Thanks

I should point out that there is a reason this is a technical report -- it was rejected from the conference (PADL) we sent it to. And you know what? The referees were right. The paper is much too vague about the semantics of our language, and what exactly a "scheduler" is [since that's the heart of what gives the whole thing a proper foundation].

The semantics aspects were much better addressed in Control Flow Semantics for Assembly-Level Data-Flow Graphs (pdf).

My colleagues continued the low-level work on scheduling, and that work is in A Domain-Specific Language for the Generation of Optimized SIMD-Parallel Assembly Code, which concentrates on Cell scheduling.

Having said that, we still firmly believe in one fundamental idea: that one ought to be able to give multiple ways to compute a given quantity (different code paths), and then let a scheduler decide which path is the proper one to choose "in context". While efficiency is an obvious goal, there are different ways to achieve it. This gets especially interesting when dealing with numerical codes, where you can provide algorithms for evaluating a special function (anything from Abramowitz and Stegun) at different precisions; by this I mean to say 3 decimal digits of accuracy, 4 digits, ... up to full double precision (ie 53 bits). In some computations, one frequently needs rather low precision [ex: plotting], so that one can be vastly more efficient.

Especially on SPUs

This is especially important for SPUs, as there are multiple ways to get the "job done". The instruction set is designed such that there are several ways to perform the same logical operation. For instance, the operation "transpose a 4x4 matrix" can be done using 8 shuffles (8 odd instructions), but also using 1 even and 7 odd instructions, or 2 even and 6 odd, or even 10 even and 3 odd instructions. The right choice depends on whether or not the loop is even-bound or odd-bound, since this limits the minimum initiation interval (II), since II > mII = max(OddInstructions, EvenInstructions) for SPUs.

Re: Thanks

Thanks for the links to the other writeups; I look forward to exploring them!

Loop-Scheduling

This paper touches on a huge area within the area of optimizing compilers. The main problem is that of efficiently scheduling loops with the constraints of minimizing register load (though usually not a problem on SPUs), minimizing register transfers as well as minimizing the initiation interval (II), which effectively tells us how fast each operation will take (total time = n * II + initialization_time + prolog_time + epilog_time).

The problem is NP-complete, but there are quite a few practical approaches. Several authors have used integer logic programming (ILP), and some have used constraint-logical programming (CLP) to solve it. There are also well-known heuristics that work fairly well, and there's some methods that use random seeding to quickly find fairly good solutions (though not optimal in the sense described above).

Of course, the more complex loops get the slower and more complicated are the solutions. Branch-less (single-)loops are much easier to handle than multiple loops with branches.

If DSELs are languages, how can you implement them?

Thank you for pointing out the connections to our cousin-discipline's daily practice; you've hit on one of the two key design points that I think the paper raises.

In my mind, this paper raises two basic questions:

  1. what is an interesting DSEL?
  2. where are DSELs useful and at what cost (relative to other techniques)?

The answer the paper gives to the first question, and the one that appears to be elaborated in Jacques' new citations, is the one that you helpfully pointed out; namely, that "interesting DSELs include ones that optimize instruction scheduling in the way that optimizing compilers do".

However, the answer that it gives to the second question (and which is implicit in your comment, but not strongly articulated) is, to my mind, even more interesting: namely, that this series of papers, along with the SPE paper and some older work on DSELs for graphics cards, constraint programming, hardware synthesis and modeling, security, &etc, argues that by being aware that a domain-specific embedded language can be implemented by an interpreter (as is common), or by a clever assembler (as this paper suggests), or even by an optimizing compiler, we greatly change our view of what problems can be solved for a given cost by domain-specific embedded languages.

Reasoning about code graphs

Those code graphs look more complex than the ones one usually finds in the term-graph literature...

They look well-suited for treatment by Yves Lafont's algebra for reasoning about circuits: cf. eg. his Towards an Algebraic Theory of Boolean Circuits, which may provide a good, computationally tractable, theory of equality for the class of code graphs, and might provide a basis for an alternative semantics of code graphs.