Tagged Arithmetic Optimization

Has there been a paper describing optimizations to arithmetic for tagged pointer representations? For instance, stealing 1-bit to distinguish pointers and integers, and using 0 as the tag value for integers, we don't need to perform any shifting for integer addition in order to normalize the integer back to its tagged representation:

-- integer i is represented as a word shifted by 1 bit:
-- word = tag(int)
tag(int) = int << 2
         = word * 2

-- int = untag(word)
untag(word) = word >> 2
            = word / 2

-- no shift needed; subtraction the same
addition = int0 + int1
         = tag( untag(word0) + untag(word0) )
         = tag( (word0 / 2) + (word1 / 2) )
         = tag( (word0 + word1) / 2)
         = 2 * (word0 + word1) / 2
         = word0 + word1

-- one right shift needed to normalize
multiplication = int0 * int1
               = tag( untag(word0) * untag(word1) )
               = tag( (word0 / 2) * (word1 / 2) )
               = tag( word0 * word1 / 4 )
               = 2 * word0 * word1 / 4
               = word0 * word1 / 2

-- one left shift needed to normalize
division = int0 / int1
         = tag( untag(word0) / untag(word1) )
         = tag( (word0 / 2) / (word1 / 2) )
         = tag(word0 / word1)
         = 2 * (word0 / word1)
etc.

Obviously these can be derived from basic arithmetic, but it's a bit tedious, so I was wondering if there was a reference describing a full set of such identities, or any more sophisticated optimizations. Perhaps a tag calculus of sorts.

Of course, care must be taken to handle overflow on the word when adding and multiplying, but I think "branch on overflow" is a fairly common instruction, and would provide for an efficient implementation to tweak the result.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

LtU isn't really the place

LtU isn't really the place for an implementation discussion of this sort. Get in touch with me offline.

Really OT?

Hmm, this isn't a discussion of any particular language implementation, and I'm not trying to design it myself, just asking for any references, so I think it's on-topic for LtU, just like GC discussions are generally considered on-topic. Most functional language implementations utilize some sort of tagged representation, so these sorts of optimizations could have a significant impact on their numerical performance.

A detailed discussion, as

A detailed discussion, as seems to be developing, seems to me to go into low level details that may not be of general interest. The original question, especially as a request for references seems to me to be on topic.

If there is interest in having a discussion of the hardware issues involved etc., I suggest taking it offline.

Avoiding overflow

Thanks I didn't know about this nice trick, but IHMO you should take into account the need to avoid overflow: I wouldn't implement multiplication as 'word0 * word1 / 2', but as 'word0 * ( word1 / 2 )'.

What I mean is that 'branch on overflow' is okay in case of there is true overflow but not when this is a 'false' one as in 'word0 * word1 / 2'..

I wouldn't implement

I wouldn't implement multiplication as 'word0 * word1 / 2', but as 'word0 * ( word1 / 2 )'

Good idea actually, since you lose no precision given that int = word * 2. So the multiplication is essentially: word0 * untag(word1).

ZINC paper

IIRC, that technique is described in Xavier Leroy's ZINC paper.

You are correct. Here's the

You are correct. Here's the ZINC paper. For some reason I had trouble finding it before.

Leroy describes an encoding where the low-order bit is 1 not 0 like I suggested, so a little more arithmetic is required. He mentions the other encoding, noting only that memory accesses for pointers might be slower using this encoding. This is because pointers must be masked before being used, but if you're using constructor inlining techniques like GHC, you're masking the pointer anyway, so adding another bit costs nothing.

[[Current ML implementations

[[Current ML implementations require lots of resources, both in CPU power and in memory (for instance, CAML needs at least three MIPS and five megabytes of memory to get decent response time).]]

I didn't notice that it was a 1990 paper at first so this sentence made me smile.

Given the slowness of misaligned access to objects (and non portability of this feature) IMHO masking pointer would be faster, as whether it's best to have the tag 0 associated to integers or pointers, I think that only looking at the performance difference in the generated code can help deciding.

pointer-tagging with non-0 bits

On some architectures, memory addressing modes include a constant offset, so you can subtract 1 from a pointer and then dereference it at no extra cost. (I'm pretty sure I read about this in some programming language implementation paper; but I can't remember what paper (or even what language).)

pointer-tagging with non-0 bits

Yes, almost all architectures include a (register + offset) addressing mode (IA-64 is a rare exception), but where using an offset makes the instruction longer (x86), there is a small cost in fetch and cache pressure. It is not an uncommon tactic, as it frees up the all-0 tag for unboxed fixnums.

Lisp implementations sometimes use (register + offset) addressing to get runtime type checks for free on machines that trap misaligned loads/stores — SPARC, MIPS, Alpha etc. The trap itself is expensive but usually acceptable for an exceptional condition.

Unfortunately,

Unfortunately, (register+offset) addressing wouldn't work in the presence of constructor inlining, since the next 1 or 2 highest bits are used to encode constructors tags.

Unscientific benchmarks, constructor discrimination

Some unscientific microbenchmarks based on Ian Piumarta's COLA framework imply tags of 1 perform better at high optimization levels, but a tag of 0 performs better at every level except -O3. With constructor inlining, I'd speculate a tag of 0 would be superior on average, but I'll have to measure for myself when I get around to implementing this.

Leroy's ZINC paper suggests that constructor inlining wouldn't actually improve execution time because it introduces more branching (section 4.3.3, "Constant Constructors"). Essentially, treating empty constructors as integers introduces an extra branch when pattern matching since you must now test if it's an integer first. If instead we statically allocate a constant heap object for all instances of a constant constructor, then constructor tag access is uniform.

This is only for the case of constant constructors, not constructor inlining in general. Have there been any measurements on this specific claim? I know Haskell does constructor inlining, but I haven't come across this specific measurement.

A thought occurred to me while reading this section: the tag of the constructor could be a relative jump offset to a code fragment which then jumps to the particular case handler for a given pattern match. Typically, constructor discrimination is:

  1. Fetch constructor tag (potential cache miss)
  2. Add tag value to base dispatch table address
  3. Fetch address from table (potential cache miss)
  4. Indirect jump to address to handle case (high chance of pipeline mispredict)

Alternatives include linear or binary search on constructor tags, which potentially incur multiple branch mispredicts, though no cache misses. Using the tag itself as a relative offset consists of:

  1. Fetch constructor tag (potential cache miss)
  2. Relative jump to current PC+tag (high chance of mispredict)
  3. Direct jump to case handler (no mispredict)

So we eliminate at least one potential cache miss, though it's unclear how likely the jump table will be in cache. Has this idea been studied at all? I haven't been able to find anything, but I'd be very surprised if I was the first to think of it.

Incorrect

I assume you meant: i << 1 = i * 2 as well as i >> 1 = i / 2. [Warning, this is not correct!]

Notice that with negative numbers, it is not true that i >> 1 = i / 2. For instance, (-1) >> 1 equals -1 whereas (-1)/2 equals 0.

To avoid these problems, just use /2 instead of >>1. The compiler is supposed to optimize that.

I wasn't using the C

I wasn't trying to use the C convention, just specifying the identities using shifting operators.

Good point re: negative numbers. It must be an arithmetic shift, not a logical shift to preserve the signs. Multiplication by a power of two and shifting suppsedly have the same cost in modern CPUs, so it shouldn't matter which is used in the future. Still, the shifting makes it very clear what is going on at a bit-level.

Even arithmetic shifts to

Even arithmetic shifts to the right do not yield the correct results for negative numbers. They round down, instead of towards zero (as per his example). Don't strength reduce divides using arithmetic shift right unless you can show the dividend is nonnegative. It's rumored that this is the most common compiler error ever.

Thanks for the heads up!

Thanks for the heads up!

Not here

In the general case you're right but not here I think: remember these are 0-tagged integers so the last bit is always set to 0.

So in this case I think that here x >> 1 (arithmetic shift) is equal to x / 2 as x cannot be equal to -1..

optimizing right-shift/divide

The compiler is supposed to optimize that.

Well, it will try; but there's a limit to how much it can do. For instance, on x86, gcc produces 4 instructions for i/4 but only one instruction for i>>2.

Note that for the tagged-arithmetic examples in the original post, the difference between arithmetic right shift and division is unimportant, since the division is always exact.

(-1)/2 equals 0

Of course, this depends on your programming language; for instance, I believe that C99 defines division this way, but that earlier versions of C left this operation implementation-defined.

Implementation-defined

Thanks for reminding me about the C strategy to semantic issues; dodge, weave, bob, and poof, gone :-)

Q: "What does that operator do?"
A: "Oh....whatever the hardware feels like that day."

Why tags

So may I ask why tagged arithmetic at all? Why not go for a precise GC with refmaps for the registers, the stack, and all heap entities? Is your language statically typed?

BTW, there are some crazy schemes you could explore for tags, like storing the tag bits for heap values separate from the heap itself (i.e. one bit for each word of the entire heap that is updated as it is mutated; essentially a giant refmap). You can do this for stack locations as well, but for registers you may need to temporarily expand values to 64-bits or use a separate register for the tag bit, or a separate tag bit register which stores the tag bits of all the other registers. This scheme has the overall advantage of allowing the full 32 bit range of values for integers.

BTW, there are some crazy

BTW, there are some crazy schemes you could explore for tags, like storing the tag bits for heap values separate from the heap itself

I actually linked to one such a scheme in the original post. They're simply more complicated as compared to a simple tagging scheme one can implement with C macros. When one is just trying to get a simple interpreter up and running, simplicity is paramount. :-)

Well I scanned that paper

Well I scanned that paper and am aghast that it was publishable in 2006. Alas, JVMs have been doing precise GC with refmaps per object since their beginning (and Smalltalk VMs even before that).

This paper is doing statically computed reference maps. I was talking about having a dynamic reference map that stores the tag bits for values separate from the values themselves, with no static analysis necessary.

Really? I thought the only

Really? I thought the only bit maps involved were card marking updated pointers. All polymorphic Java types are boxed, so I thought handling unboxed polymorphic types without code specialization was relatively new.

Unless by "bit maps" you mean the type descriptor linked from the heap object which describes the layout. Functional languages try to avoid this because of the extra 4 bytes per object overhead for the descriptor pointer -- although you could eliminate that by placing the object tag in the type descriptor, but then constructor discrimination requires an extra memory access.

I'd very much appreciate a link to get me started if you have one!

Data Space Refmaps

If you're going to keep exact information, might as well do it in style. Why not keep exact type information for every register and every entry on the stack (and, by extension, every 'live' entry in the heap)?

A basic design would be to give each 'frame' in the stack a designated slot indicating the 'class' or 'vtable' of that stack frame - i.e. a pointer to a collection of methods that return extra information about that stack frame. That extra information would include:

  • the size of that stack frame
  • the type + representation of every entry on the stack frame
  • a list of 'trashed' registers for that stack frame (could be used for JIT compilation and optimized register allocation; one can assume that the caller saved any relevant frame pointers prior to calling this stack frame
  • type + representation information for every single register currently in use by that stack frame
  • special notations for 'inner' pointers (pointers into an array, for example) indicating that they don't need to be collected individually; the compiler can enforce that inner pointers always have a paired outer pointer earlier on the stack
  • information for exception handling (e.g. which handlers are available, etc.)
  • information for destruction of the frame, in particular which 'destructors' need to be run (e.g. 'fini' statements, or D's 'scope()' statements, or C++'s destructors)
  • possibly information for the debugger, such as variable names, current line in source code based on current return address and PC, function names, etc.

The information wouldn't be entirely constant for the stack frame. That is, these would effectively be methods on stack 'objects' that take, as input, the pointer to that particular frame, including the return address and PC from the frame above it on the stack (if this frame doesn't happen to be at the top of the stack). If a frame has dynamic extents (as per 'salloc()'), then the stack would automatically contain a dynamic entry indicating its current size that could be referenced by the frame object's methods.

So: each 'frame' on the stack is an object (potentially of variant size) with a common set of methods, which are written by the compiler and guaranteed to never throw exceptions and to never consume dynamic memory to access.

The registers, stack, and heap entries would not need to be tagged wrgt the GC. They may still contain tags for polymorphism and variants, such as tags representing constructors for values, or pointing to a virtual table for objects.

This works for both mutable and single-assignment stacks.

The reflection provided by this approach is very good for debuggers, but extra symbol information could easily either be stripped down to the bare minimum necessary for the GC or pushed into a separate file for the debugger.

The amount of static information necessary for this approach is higher than for tagged GC, since one must essentially create a 'frame class' for every function that is exported or not fully inlined, and the frame class (and the methods associated with it) would be stored in the text space. OTOH, one might be able to make up for it by having even more 'packed' representations of the heap and by avoiding the sort of bit-twiddling one would need to keep a dynamic map of pointers up-to-date at all times.

Admittedly, this is for compiled languages, but I imagine you could make it work for a JIT.

Anyhow, I've never gotten around to actually testing this approach for performance, since I then got caught up in subjects a bit further from bare metal.... Chances are that I reinvented something someone else has already done, so I'd bet you can find some information out there. I can only say that it looks, to me, to be very promising, and is very tweakable for further optimizations, such as region inference, or region reference counting given immutable acyclic structures. The frame methods and compiler could cooperate to report information as to whether a reference to an object or region is still owned by just that stack, possibly by an earlier frame on that stack, or shared among threads/actors/asynchronous procedures... allowing programmer-predictable collection-on-the-go when collecting frames (programmers would take advantage of any guaranteed optimization the same way they do with TCO).

TIL uses "nearly" tag-free

TIL uses "nearly" tag-free GC (see "TIL: A Type-Directed Optimizing Compiler" -- only heap allocated data has tags). That paper has benchmarks against SML/NJ if you're interested.

Fully tag-free GC is typically by runtime type reconstruction. This approach was pioneered by Goldberg, "Tag-Free Garbage Collection for Strongly Typed Languages". Goldberg also proved that failing to reconstruct the type of an object means that object is garbage.

"Tag-Free Garbage Collection Using Explicit Type Parameters" has some benchmarks for the Gallium compiler.

Finally, there's Henderson's "Accurate Garbage Collection in an Uncooperative Environment", aka "shadow stacks". Some of the ideas you describe are the techniques used in these papers.

The .NET CLR also uses exact type information techniques, since parametric arguments are fully unboxed on the stack and in objects.

Thanks for the GC links.

Thanks for the GC links. I'll check them out and see whether my design of two years ago needs an infusion of new ideas.

Damn, those TIL guys already

Damn, those TIL guys already thought of my idea. I'll have to read more closely. By runtime type reconstruction, do they mean walking the stack to determine the type parameters for a particular method invocation?

This is pretty much the

This is pretty much the standard approach taken by all modern JVMs, though the stack frame layout usually isn't a fully reified object, but just a pile off bits hanging of the executable code somewhere in the VM. One complication is that the single stack frame has many different points in its code that it may be interrupted (i.e. safepoints), and each may have a different set of live values, thus requiring a mapping from program counter location to a refmap for registers and stack spill slots.

Of course JVMs don't have to deal with parametric polymorphism, so no passing of explicit type parameters is necessary.

Static map

One complication is that the single stack frame has many different points in its code that it may be interrupted (i.e. safepoints), and each may have a different set of live values, thus requiring a mapping from program counter location to a refmap for registers and stack spill slots.

Well, one maps the return address (return program counter) of the frame above to the set of live values (or current program counter for top-of-stack). But that mapping may still be computed statically.