Register Allocation By Model Transformer Semantics -- need for early comments

Hello,

I'm drafting a paper on a new method for register allocation. But because of my limited experience, I have no idea whether someone else has done this kind of research before. So I hope to get some early comments here before I decide to submit it to a conference. I will appreciate your interest.

Here is the abstract of it:

"Register allocation has long been formulated as a graph coloring problem, coloring the dependency graph with physical registers. Such a formulation does not fully capture the goal of the allocation, which is to minimize the traffic between registers and memory. Minimizing the number of allocated registers, the goal of graph coloring, is an imperfect proxy for the real goal. Linear scan has been proposed as an alternative to graph coloring, but in essence, it can be viewed as a greedy algorithm for graph coloring: coloring the graph vertices not in the order of degrees, but in the order of their occurence in the program. Thus it suffers from almost the same constraints as graph coloring. In this article, I propose a new method of register allocation based on the ideas of model transformer semantics (MTS) and static cache replacement (SCR). Model transformer semantics captures the semantics of registers and the stack. Static cache replacement relaxes the assumptions made by graph coloring and linear scan, aiming directly at reducing register-memory traffic. The method explores a much larger solution space than that of graph coloring and linear scan, thus providing more opportunities of optimization. It seamlessly performs live range splitting, an optimization which is only found in extensions to graph coloring and linear scan. Also, it greatly simplifies the compiler and its semantics-based approach provides possibilities of simplifying the formal verification of compilers."

Its full text can be found here. I also have given a talk at Indiana University in Nov 2011. Here are the slides [PPT] [PDF].

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Hi Yin, I did not have time

Hi Yin, I did not have time to look at your paper in detail, but I look forward to reading it. I skipped to your references section to see the background. I think papers on spill code optimization may be relevant:

"Post Register Allocation Spill Code Optimization" Lupo & Wilken

"Spill Code Minimization by Spill Code Motion" Akira Koseki, Hideaki Komatsu, and
Toshio Nakatani

"Precise register-allocation spill-code costing and placement" [Chris Lupo's dissertation--may require an email to him]

"Spill code minimization via interference region spilling" Bergner et al.

"Progressive spill code placement" Dietmar Ebner, Bernhard Scholz, and Andreas Krall

Thanks for the references.

Thanks for the references. Spilled code optimization is one of the important topics we have discussed but not implemented in my compiler. I'll look into the papers.

Your paper seems primarily

Your paper seems primarily focused on live range splitting. There is an interesting paper Register Allocation Deconstructed that looks at the relative importance of live range splitting, coalescing, register assignment and spill code optimization. They compare a set of configurations which include optimal register allocation and existing heuristics. They conclude that live range splitting doesn't really matter:

The ability to insert moves (that is, split live ranges or undo coalescing) has surprisingly little impact on code quality. What benefit there is can primarily be obtained by allowing move insertions only at basic block boundaries.

They also conclude that coalescing and register assignment are adequately solved by existing heuristics, as they perform almost as well as the optimal algorithm. They conclude that spill code optimization is most important:

When optimizing for performance on a modern processor, spill code optimization is of paramount importance. Furthermore, the various components of register allocation can be treated separately without significantly impacting performance. Therefore, when targeting processor performance, new register allocator designs should focus on solving the spill code optimization problem as the coalescing, move insertion, and register assignment problems are adequately solved using existing heuristics

Your paper does address this in limited form in section 6. I have not read your paper in detail, but as far as I understand it can still be improved a lot. For example if you have this code:

    x = ...
    while(...){
      y = ...
      // x not used in loop
    }
    // x used here

Then your algorithm will spill x to the stack inside the loop:

    x = ...
    while(...){
      save x
      y = ...
      load x
    }
    // x used here

A much better solution is:

    x = ...
    save x
    while(...){
      y = ...
    }
    load x
    // x used here

Though perhaps (and even likely) I have misunderstood your algorithm. Implementation of your register allocator in e.g. LLVM and comparison with LLVM's existing register allocator is of course required to really evaluate your method.

conceptual differences

My method is not really focused on live range splitting. It is just something naturally and seamlessly included. There is some conceptual difference between the method and earlier methods. The register allocator contains only two passes. One for liveness analysis, another for the actual allocation, splitting and so on. The whole allocation process is seamless and doesn't really contain the phases as mentioned in the Register Allocation Deconstructed paper.

One of the interesting aspects is that it doesn't really "spill" variables. Spill means put the whole live range of a variable into stack. It just swap variables out into stack, but then may load them back later. So I don't think "spill code optimization" applies to the method.

As for your while loop example, the method does generate something close to the latter code. A variable is only loaded into a register when it is actually referenced but not in any register.

How does your algorithm work

How does your algorithm work in that situation and where in your paper is that described? My understanding is that it arrives at the `y = ...` statement and then notices that it's out of registers. So it puts the only candidate, namely x, on the stack, INSIDE the loop. So this reasoning is incorrect. How does your algorithm place x on the stack OUTSIDE the loop?

no while loop

Sorry I meant that the load of x will be inserted outside of the loop, but indeed the save will be inside the loop, just above y=...

But in fact, currently there is no while loops (or for loops) in my intermediate language, but the compiler handles tail-calls properly, so loops can be transformed into tail-recursive calls. But in either case, the save of x will be INSIDE the loop. I will think of a way to improve. Thank you so much for the suggestions.

Minimiz[ing] the traffic between registers and memory

Register allocation has long been formulated as a graph coloring problem, coloring the dependency graph with physical registers. Such a formulation does not fully capture the goal of the allocation, which is to minimize the traffic between registers and memory.

I don't think that "minimiz[ing] the traffic between registers and memory" fully captures the goal of register allocation either. This goal statement leads, in the limit, to regarding register-register moves and swaps as free... and some work on SSA-style register allocation does just that.

Assigning a nil cost to register moves and swaps does make most register-allocation-related problems easy, if not trivial. We can simply have arbitrary assignments from values to locations (registers or stack slots) at each instruction, and insert spill-free permutation code between each assignment. Since register-register moves are free, so is the permutation code. We've just redefined our problem from NP-complete (for a decision version, i.e. "can this be allocated in k registers?") to trivially polynomial-time (is MAXLIVE ≤ k). This sort of simplification makes me suspicious.

In practice, algorithms based on the assumption that register moves or swaps are free nevertheless include heuristics to forcibly coalesce live ranges, in order to reduce the number of moves and swaps. Otherwise, performance quickly becomes atrocious: even register-register move instructions consume memory bandwidth, I-cache space, and decoding and scheduling resources. So, the goal is *not* strictly to minimise register-memory moves; the statement is a useful approximation, but forgetting that it's an approximation can easily lead us to trivial and useless solutions. In fact, I would argue that it's more harmful to performance than the approximation made by colouring-based allocators (values or variables must be assigned exactly one location for all program points), which merely replaces penalized constraints with hard ones.

With hardware register

With hardware register renaming register-register moves are pretty close to free, at least in small quantities. But even so, you're right. More modern SSA-based algorithms do attempt to avoid register-register moves. I think the real win is that SSA-based algorithms are vastly simpler than classical algorithms and often outperform them (in allocation time) by an order of magnitude.

Overall, modern processors are so much better at dealing with non-optimal code that the gap created by not attempting to eliminate every last register-register move is closing fast, and might not exist at all. Register allocation results these days are often in such a small margin (<1%) that measurement error and poor statistical techniques are a real concern.

SSA-like, but not SSA

I'd like to mention that although my method doesn't rely on the SSA form, it does implicit renaming on the variable so that each variable can be assigned only once in the _same_ straight line code. So it probably has similar advantages as SSA form based methods.

But I do allow the same variable to be assigned in _different_ branches. At the end of branches, variables may be shuffled to consistent allocations (again preferred register table can reduce the shuffling code). I don't know how this compares with SSA, although it certainly prevents duplication of the context after the "double assignment".

register-register move is not considered free

Although my talk slides haven't mentioned, there is indeed some efforts to reduce register-register moves, although register-memory moves have priorities.

I have a coalescing pass before register allocation, so register-register moves don't happen except for 1) x86-like instruction sets where an operand is both read and written. 2) shuffling code at branches and call sites.

The latter was addressed (although briefly) in the paper by using a "preferred register table" which is propagated between branches. But you are right, this is probably where the NP-Completeness really lies. Without this shuffling, register allocation would be a polynormial-time problem. The preferred register table is a greedy algorithm for reducing register-register moves in shuffles. I haven't yet done experiments to evaluate the performance and don't really know how much this trick helps.