Subclass, superclass, or siblings under an abstract superclass?

The usual pattern in software development is to put the simplest thing as a superclass and things with more complex behavior as a subclass.

But what happens when the values of the so-called superclass are actually a proper subset of the values of the subclass?

For example, in a neural network program, there are two classes; "Neuron" and "Neural_Structure." A Neural_Structure has members comprised of zero or more input points and one or more output points one or more of which are also backpropagation feedback points. An example of a Neural_Structure would be a unit in a "Pooling layer", which takes three inputs and then forward propagates one of them - whichever has the greatest absolute value or the most-positive or the most-negative value.

And the immediate impulse is that the Neural_Structure should be a subclass of Neuron because it is the more complex object, and to make it you need to add methods and overload a bunch of Neuron's methods.

But Neural_Structure generalizes Neuron rather than restricts it. It's the difference between an upperbound and a lowerbound relationship on the type. A Neuron is a Neural_Structure having one input point and one output point which is also a feedback point.

And in most PL's, we don't have a mechanism that would allow us to say "class Neural_Structure generalizes class Neuron" and add methods or add abstract methods or overload a bunch of Neuron's methods, and then allow us to use a Neuron anywhere a Neural_Structure is required.

So I wind up with an abstract superclass Neural_Structure where Neuron and Pooling_Node are both subclasses, and that's obviously the correct structure under classical OO paradigms. But "generalizes" rather than "extends" would have been more efficient, because Neural_Structure adds overhead and handling that Neuron doesn't need but which Neuron inherits.

Thoughts?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Thoughts?

0) i have nothing new to contribute, only regurgitations.
1) all models (taxonomies, too) are a lie.
2) think for yourself. beware immediate impulses.
3) bog-standard run-of-the-mill main-stream inheritance is 99.9% of the time just evil crap in the long run: sublcassing vs. subtyping; we should be composing, not inheriting.
4) structural typing (however on earth that might be different from duck typing :-) is perhaps the right default, instead of nominative.

Looks like you modeled it

Looks like you modeled it correctly. You can put simpler and more efficient versions of input and output manipulators into the Neuron class and then implement the Neural_Structure interface of Neuron with wrappers around those simpler functions. I think it's a flaw of many OOP languages that Neuron-isa-Neural_Structure relationship has to be declared with Neuron and baked into the definition of Neuron.

If your concern is

If your concern is minimizing overhead, then use flat arrays. If you have a matrix library at hand you can even implement feed forward in 1 line and backpropagation in a handful of lines and this will be orders of magnitude faster than the OO version as well as orders of magnitude less code.

Named collections

I agree with this answer. To me it sounds like a neuron_structure is a collection of neurons. As such I would use an existing collection type, say an Array to store them, but create a new named type, something you can't do in C++, but you can in Ada and Haskell. Something like (Haskell):

data NeuronStructure = NS (Array Neuron)

I would be tempted to build the structures more specifically so:

data Layer = Layer (Array Neuron)
data Stack = Stack (Array Layer)
data Group = Group (Array Stack)

Where layer is obvious, stack is a multi-layered block, and a group joins two stacks side by side (normally as part of a wider stack).

In terms of OO I would say a neural structure is not the same thing as a neuron, they have no business inheriting from each other. It would be like having a car as a sub-class of a car-park.

Not a collection

No, it's not a collection interface. A Neural_Structure is 'atomic' in that it can't be subdivided. It represents something that can be used in the context of a neural network, whether or not it is actually a neuron. There is a different class that represents collections of them.

Now that you bring it up though, the similarities between individual Neural_Structures and collections of them, in terms of interface, raises a question of whether there should be another abstract superclass from which both inherit.

One example of a Neural_Structure is a Pooling_Node. It isn't a neuron, it's just a node that receives 'a few' neural signals (3 or 4 where a neuron usually receives hundreds), selects one by its value, and outputs that. During backpropagation, it directs the backprop signal to whichever of its incoming connections provided the signal that it passed along. So it functions within a neural network and has a constructive response to backpropagation training, but it isn't a neuron, nor is it a collection of them.

Another example is an Echo_Node. It receives exactly one neural signal, and then outputs all of the last dozen it's received on different output points. (The network is loop-recurrent, so there is forward-propagation from the 'output' nodes back to most of the 'input' layer). It ignores backpropagation on all of the 'history' output points, but passes a backprop signal along to the neuron from whence its 'current' output came.

Anyway, there's about a dozen different kinds that do various different things with their inputs and then can be trained (or at least appropriately pass training along to one or more of their inputs) during backpropagation.

Neural structures

Okay, I probably misinterpreted the term. Most of those things can be considered neurons with special transfer functions, so having them all have a common interface makes sense. I would have neuron-structure as an interface implemented by all the different flavours of neuron.

SuperGlue had a feature

SuperGlue had a feature where you could declare your subclasses as well as your superclasses, which works kind of well. E.g. say you have Draw and Point as classes, then we can express a rule where any object that extends both Draw and Point also extends DrawablePoint. I believe I called this "reverse inheritance" at the time, and is easy enough to express in a rule driven language (e.g. in prolog, you can always write P(), Q(): R()).

Implicit coercions in Scala also allow you to express post ad hoc subtyping relationships that don't involve new state (since you get a copy of the state on each coercion).

But what you really might want are mixin layers (as described above). You can also do that with Scala.

type relations

I almost miss the question. But being able to say X generalizes Y seems useful, so Y can be used as an X even when Y didn't know anything about X in the first place. It fits with design goals associated with limiting need-to-know (NtK) usually call law-of-demeter (LoD) except the latter tends to carry a lot of baggage (monkey trap for devs who hold opinions strongly).

Inheritance is famously quirky in unfortunate ways, since asymmetries come up in NtK relationships, making it awkward to effect some types of substitution. Some languages support effective subtyping without explicit inheritance, so you can say "use Y as an X" without saddling X and Y with too much baggage, though it might make declaring that relation more verbose than otherwise.

If Y is mutable, and some places can see a given instance as Y while others see that same instance as an X, you can get a problem if you wished the X-ness would propagate to places that see only a Y. Obviously that problem can't happen when Y is immutable, which makes it easier for functional languages to wrap one sort of value as another, just to satisfy new type signatures.

If wrapping is cheap, you can make Y look like an X by having a looks-like-an-X<T> generic type specialized to Y. Memory management (and identity) when state is mutable is the only thing stopping you from unifying that generic type with X by allocating a new copy when needed. See, my train of thought devolves to free association when I can't pin down the question.

Inheritance

I don't think I quite believe in the benefits of implementation inheritance either. It's infamously quirky as you say, and most uses of it, unless one is very careful and quite insightful, turn out to be abuses. So maybe we should leave all the superclasses completely abstract and have the classes at the leaf nodes of the inheritance tree all provide their own implementations for the methods of all the interfaces they inherit.

That would be incredibly

That would be incredibly verbose, and would sacrifice much of the encapsulation that we depend on in OO designs.

Implementation inheritance can be abused, ya, but is not usually the pressing problem in a program design. The idea that I can define a bit of my object there and a bit of my object somewhere else, is in itself quite useful when deconstructing problems. OO is a way of problem solving; what results might not be an optimal solution but it enabled the journey in the first place. We could argue for "refactoring" once the program design is discovered, and I would have no problem with that.

Ocaml classes

One should look at classes in Ocaml. Inheritance is purely a matter of implementation convenience. Classes make objects which are structurally typed. Therefore you can write an algorithm which will work with any number of completely different objects provided they have a matching interface.

So roughly: a class is a function which constructs an object. A class type specifies an interface. Objects can match an interface if they have "enough" methods.

IMHO this is the right way, if you really have to use OO stuff and accept the limitation of the covariance problem.

Structural typing only works

Structural typing only works if you want to forgo by-type encapsulation. It also doesn't help neuron achieve the neuron structure interface if it doesn't already.

That was kind of the whole

That was kind of the whole point of declaring Neural_Structure to be an abstract class. It lays out an interface that I'm responsible for making absolutely every child class respond to, and then I want to be able to use values of those child classes in any context where that superclass interface is sufficient to make the usage valid. -- IE, in variables that I declare to hold members of the supertype.

But noooo..... "invalid cast" my arse. If I wanted it to be invalid, I wouldn't have declared the darn thing to be a variable of the abstract class in the first place!

It implements every blinking method that I'm allowed to call on members of the superclass; what the heck is the point of not allowing me to use it?

It would actually be kind of

It would actually be kind of easy for a nominal OO language to provide structural interface and generate wrappers on coercions to, but I guess languages don't do it because of the overhead (C# supports dynamic, but that is probably what you don't want). And actually, overhead is an issue in any language that does that, since you can't really rely on nicely organized v-tables anymore (which even makes the auto wrapper approach look appealing). Of course, if you always have modification access to the class you are trying to coerce to the structural type, then you can just make the structural type an interface, implement it in the class, and be done with it (signature matching will take care of the rest).

Ergh. This language doesn't do what I want with inheritance.

So, I declared something to be a collection of Neural_Structure, and now it says it's a type error if I try to put any subtype of Neural_Structure in it.

So what the hell good was having a supertype anyway, if I can't treat collections of subtypes in a uniform way or assign subtype values freely to variables of the supertype?

Grrf. I suppose it wants me to make each and every collection of a specialized subtype into a different type, and then I won't be able to use them all together without kicking dead whales down the beach....

I'm about ready to say screw OO with inheritance the way they do it and just have one struct type with pointers to functions that get directly assigned, the way I did OO in C, or capture function pointers in closures the way I did OO in Scheme.

I'm sorry, I've run into this braindead behavior in several languages now, and I have to ask, is there any rational reason to even have OO baked into a language if the implementation is so stupid that you can't use a member of a child class anywhere you could use a member of the parent class?

Languages?

It probably helps to know what the language is (I can't see any mention in the topic). For example if its C++ you would need a collection of pointers, not a collection of objects, if its Haskell you would need to use an existential type to do what you want (which is fancy name for a type where you can only access the contents using the declared interfaces).

It should be possible in any

It should be possible in any OO language but C++. C++ is weird since it allows you to declare instances of classes as values, but as values they do not allow for subsumed assignment. I'm sure it made sense at the time, but this is a decision that no one else has copied (C# has an extensible value type, but they don't allow for inheritance).

Only pay for what you use.

It is because when you declare an array of X in C++, the storage space for X is actually in the array cells. This is one of the reasons C/C++ outperforms other languages. Languages that allow subtyping in containers never do this, but instead are storing the pointer to the object in the array. In other words they hide the dereference cost and make you pay it whether you need it or not. As one of the key design rules of C++ was you only pay for what you need, the decision of whether to have the objects stored in the array or externally is left up to the programmer. If you want polymorphism use a pointer, and pay the cost, if you want monomorphism use the value and get the better performance.

C# structs work like that,

C# structs work like that, and can help it achieve C-like performance numbers when GC can be avoided. The mistake with C++ is that they did that to classes, rather than having separate (structs) that could exist unboxed. Of course, C# comes way after C++, and got to learn from its mistakes.

Reference types and value types

I am not sure separating reference and value types at definition is a great idea. It is limiting, and may mean you have to define some things twice. My personal preference is that all things should behave as values, but that there should be special operations to pass values as references. I am not sure I like C++ explicit exposure of the programmer to reference and pointer types either. Ada does better with in/out/inout arguments and access types. Ada has a limitation that you cannot return by reference which is annoying as it makes containers and iterators awkward to define. In the end no language I have found does exactly what I want, maybe that's because I don't think like other people, or maybe there's lots of people that will really like the language I am developing, I'm not really sure which.

I never use pointer or

I never use pointer or reference types in C# (ok, maybe nullables but very rarely). It's just in, out, ref, most of the time.

Aren't you looking for autoboxing?

Aren't you looking for autoboxing? C# does that for structs, IIUC, so I'm not sure what you'd have to define twice. But I do wonder whether a generic pair would specialize to struct arguments, and store them unboxed, or whether they'd be boxed.

I'm thinking of

struct Pair {
   S s; T t;
}

Not Sure

I'm not really sure. I think autoboxing hides the cost of boxing/unboxing. I think I want things to be monomorphic values, and the programmer explicitly states where polymorphism is allowed. For example in Haskell you would have an array of existential types to allow dynamic polymorphism, this seems a reasonable approach.

Boxing is a bit complicated

Boxing is a bit complicated in C#: it will box on a cast to object, but generics are specialized and avoid boxing costs, in which case it works much like Haskell.

What's wrong with everything

What's wrong with everything is a value, but there is a type Ref[T] or Ptr[T] for references? That seems simple and consistent. Everything is a value, but some values are references. Of course now you open pandora's box of lifetimes. Either you have a syntactic restriction like C#, or a borrow checker like Rust, or you integrate the pointers with the GC so that values they point to don't get GC'd too early, like in Go.

Nothing

There's nothing wrong with that, and that is exactly what Ada's "Access" types are. Ada has a way of ensuring they are valid too, you cannot return an access to a variable from the function in which the variable is defined.

Better semantic modeling

I'd like to see (and am working on) better semantic modeling of the kinds of idioms that are used in e.g. C++ and a distinction between value semantics and implementation. For example, what is a reference? I suspect that simple and efficient memory management policies can be used if we do just a little bit better job of describing the situation to the compiler.

Yes, agreed. Rust goes in

Yes, agreed. Rust goes in that direction, but I think more can be done by going a bit further away from C++, and a bit closer to the functional + linear types world.

Do you have an example?

Do you have an example?

An example

You didn't ask me, but one example I can give is the difference between a reference and a live reference. A reference is just an (abstract) name and a live reference is a name that lives in a particular subset at a given time. The C++ notion of deletion corresponds to removing a name from the live set and the C++ memory policy corresponds to only associating values to live names. Checking "liveness" typically requires dependent types or theorem proving, but an implementation can retain memory safety by doing run-time checks for liveness when it hasn't been proven. Or you can sacrifice memory safety for performance. [Edit: The preceding is meant to convey a broad idea, not to sketch actual language features]

The main idea to me is choosing value semantics that allow possible fast implementations. In general I don't like the Rust approach of integrating low level (e.g. ownership) semantics with the value semantics, even if they've done a better job of it than C++ did.

For example I think

For example I think separating a pointer from the capability to access the pointer is conceptually clearer (and more expressive). The pointer can be freely copied, and can even be live after the data is already deallocated, as long as the capability to dereference it is linear and needs to be handed in when you deallocate the memory. Rust's choice of coupling the two may be more ergnonomic in practice, however.

Reading between the lines a

Reading between the lines a little bit, it sounds like you're proposing this scheme:

- Pointers are names and possibly support operations on them (i.e. comparison) other than de-reference.
- To dereference a pointer, you ask the pointed-to contents to borrow a deref cap that provides access.
- The pointed-to contents must be live to obtain that cap. This raises the question of whether it's possible to ask whether a pointer is live or whether attemping to obtain a deref cap for a non-live pointer is an error. Maybe you want to support either.
- Deref caps are linear and all borrowed caps must be returned before the contents can be deleted/made non-live.

Is that roughly it?

Yes, roughly, with a small

Yes, roughly, with a small difference. You don't ask the pointer for a deref cap, you must already own it.

For Rust style borrowed pointers you need a more elaborate scheme, but the principle is the same. When you allocate memory you get a pointer as well as a mutate cap. A mutate cap allows you to write to the pointer and allows you to deallocate it (which consumes the mutate cap). In addition you can turn the mutate cap into a readonly cap, which can be split into multiple copies (keeping track of how many pieces it was split into), and later combined back together into a mutate cap. This ensures that at any point in time there is either one mutate cap, or possibly multiple readonly caps. Of course you'd want inference for these capability parameters, just like you want inference for type parameters.

You don't ask the pointer for a deref cap

I'm missing something. You need a specific mutate cap that implicitly already knows which pointer goes with it, don't you? So what's the point of the pointer?

The point of the pointer is

The point of the pointer is to be a run-time memory address. A mutate / read cap is valid for an entire region / data structure / region of data structures, and is a purely compile time construct.

Regions

Is the part I was missing. Thanks.

Well, that makes performance sense at least.

It complicates the implementation somewhat. Yes, I am using C++ at the moment. The reason is because I really want that performance. This is GPU-bound and chasing pointers all over the place would make it bus-bound instead and DRASTICALLY slower because all the processing would have to move from the GPU to the CPU.

Correct for performance as high as I want it for this is NOT indirecting and scattering objects in the heap.

I can afford one indirection per array though - move an array into the GPU, operate on all those elements, move it out, move the next one in... so I guess I'll have to implement a wrapper class with a union pointer to a directly allocated array of (one of) the subtypes and either a runtime-typing field to tell what subtype they are or, more likely, method pointers to the methods of the subtype.

Not quite kicking dead whales down the beach yet, but it is getting a bit ugly....

Dynamic polymorphism isn't

Dynamic polymorphism isn't really one of those things you should be using with C++ CUDA, or other GPU frameworks I suppose. You CAN do it (as long as you allocate the objects on host), but it is not can be very performant.

I suggest using a different object model that separates out state into arrays. You can then use flyweights or whatever to create object fronts from the arrays if necessary. Proper SIMT means you are probably means you want to operate on several similar objects at once anyways.

Note that going functional doesn't leave you better off; they don't have a nice story for array updates on GPUs yet.

This OO is no help.

Well, I guess I'm up to kicking dead dolphins down the beach anyway...

It turns out that there's really no performant implementation except to make arrays of different-subtype objects be different types. Which more or less destroys any clarity and nice organization that might otherwise have resulted from the OO system.

So... long story short, I may as well drop back to C since the type system doesn't help organize the code.

There are many kinds of

There are many kinds of object systems beyond the ones you are used to. You'll probably just reinvent them as you poke around your problem, good luck!

Well, yeah. There are SEMANTICALLY good OO systems.

But unfortunately the semantically good OO systems don't get down to the grotty code generation bits close enough to let me do this particular thing.

I got really spoiled using CLOS and a few other highly extended OO systems I guess - expecting C++ to take the hint that code working on two arguments of immediate arrays of an abstract supertype was actually intended to generate specializations for each and every combination of subtype was probably a bit much.

But I have a couple problems with CLOS, too. It's attached to Common Lisp and I cannot for the life of me figure out how to extend it to generate the specializations and output code in a way that doesn't involve pointer-chasing at a sufficiently low level to screw up the CUDA interface.

In case I didn't bug you

In case I didn't bug you already: Have you looked into metaprogramming systems with support for CUDA? LMS, Terra... not sure how production-ready they are, but they have been used to generate highly performant GPU code.

Sadly, C++ doesn't auto-specialize, I think ever (except maybe C++.NET). A C++ virtual call is an actual dog-slow virtual call, unlike in proper OO VMs. The only solution is template metaprogramming.

dog legs

I was just interested in your source for:

A C++ virtual call is an actual dog-slow virtual call

If a coworker said that, I'd bet them a large sum of money it could not be substantiated. :-) I'd assume it was hearsay, rather than based on data. It would be hard to write a test small enough that the extra indirection was observable, and then you would have to isolate the effect of jiggle caused by branch target alignment. (Any slow case observed must prove a tight loop did not jump to a slower, mis-aligned branch target, in contrast to the faster case.)

In theory, the only extra cost is a read from a vtable (to get the function pointer), whose cache line should not be a miss if that object type is in frequent use. For this to get observable in real world tests, you'd have to use virtual function calls pervasively with nearly every call, so cache line pressure was increased.

Templates vs Virtual Methods

There is a measurable performance difference between templates and virtual methods, and in complicated cases template performance can be much faster than virtual methods. So this part is true.

However C++ virtual method calls are by no means slow compared to other methods of dynamic dispatch. The vtable beats "if" and "switch" statements. It's faster to use a virtual method than switch/case on a tag field.

What I don't believe (without evidence) is that other languages manage to optimise their virtual method calls better. I have not seen Java or C# outperform C++ compiled with a decent compiler (especially if you use profile-guided-optimisation, which I tend to use as then you don't have to worry about branch order when writing the code).

Evidence?

However C++ virtual method calls are by no means slow compared to other methods of dynamic dispatch. The vtable beats "if" and "switch" statements. It's faster to use a virtual method than switch/case on a tag field.

Do you have evidence for that? I highly doubt that indirect jumps loaded from some table are faster than branches on any of today's CPUs. Plus, you cannot inline vtable dispatch to avoid the function call overhead, while you can inline branches easily.

Virtual dispatch no slower.

I did some testing whilst developing my Prolog interpreter that showed that virtual methods were faster. I quickly wrote a test program to confirm these findings on a Core-i7, and with 5 cases there is really no significant difference in performance over 100,000,000 element array with either g++ or clang++. I wonder if the performance difference I saw with the Prolog implementation was due to needing nested switch/case in unification to do the double-dispatch? So whilst I probably can't justify virtual dispatch being faster in general, it would appear to be no slower on a current CPU.

Cold paths, big programs

Sure modern CPUs seem to do much better at predicting indirect branches, but it's hard to make conclusions based on microbenchmarks that hit the same (few) indirect branches 100,000,000 times in a row. The reason is that you're only testing the hot path, and CPUs have limited space for all of their predictors. On cold paths which exist in big programs with lots and lots of branches, I doubt indirect branches are on par with direct branches. The reason is that static predictors (without history) also do pretty well. Unfortunately the death-by-1000 cuts drag on performance by these little inefficiencies spread all over the place is really hard to measure and attribute.

Real problems for benchmarking

That's why I prefer benchmarking with real problems.

However lot of the stuff where it is really worth optimising is tight loops with high repetition counts. If this loop isn't going round a billion times then nobody will notice the difference in performance. The test case is actually fairly common, where you have a polymorphic type in a container. This could be a program abstract syntax tree, or the neuron layers mentioned in the original post, etc. If the container contents are not polymorphic, then you won't need virtual methods in the first place, and you can use direct calls, and inlining. Its very rare where you have a large program that requires equal optimisation everywhere, and I suspect having it not fit into cache will be more of a performance problem.

Direct branches will clearly have a problem with the cases that are further down the branch tree, so there must be a limit to the number of options (or the distribution of options such that N are equally likely to be chosen) when indirect jumps will perform better. An interesting question is what 'N' is for different distributions of branch targets.

I also wonder whether if you can have a closed class hierarchy, that is you can have some kind of definition scoping that means you cannot add a new derived-class outside of a shared header that all places using the hierarchy can see, that would enable the compiler to inline virtual method calls with branch-trees?

Cold rather than hot

In the test code there are two phases, initialisation and then test. The working set of data contains 100M items so it is much larger than the branch prediction cache. At the end of the initialisation phase it is guaranteed that all but the tail of the working set has been evicted. The test phase then runs through in sequence: it is guaranteed that only static rather than dynamic prediction is being used: this is teating the cold behaviour of the branch prediction cache.

This is a very unrealistic test case: how many programs exhibit zero temporal locality in instruction despatch? To do so they must be either branch free or contain only loops that have minimum trace lengths larger than the size of the relevant cache.

There is no branching behaviour in the trace being executed, every "dynamic" despatch can be preloaded ahead of execution. A modern processor can perform pointer chasing ahead of time if there are no dynamic decisions about which trace in the pointer graph is being computed.

As a minimum, to be realistic, the set of objects being despatched on should be much smaller and each object should be reused some number of times to allow branch prediction to work more effectively. This would have the effect that the saturating counters in the branch prediction would model the distribution of instruction types in the interpreted program. A large benefit for direct branching that is being ignored in this testcase. But in order to simulate the dynamic penality for virtual despatch correctly the must be some feedback from the instruction sequence to which instruction sequence is selected for execution. A testcase that models straight-line code in the target program is not realistic in any way.

Tl;dr the presented test case is too cold rather than too hot.

Reducing array size.

So I have reduced the array size. The iterations remain 100,000,000 with the lookup modulo the array size.

With 100,000 elements there is no significant difference in performance between the if-tree and virtual methods with clang++ or g++.

With 10,000 elements there is no significant difference with clang++, but the if/then/else seems slightly faster with g++ (~30%).

With 1,000 elements there is no significant performance difference with clang++ (maybe 10% for if/then/else). With g++ there is a significant performance improvement for if/then/else (400%).

L1 data

So it sounds like you were completely dominated by the L1 data cache misses. L1 is pretty small on Intel chips; just 32Kb per core usually. There are multiple prefetchers, but looks like you are just saturating memory bandwidth.

In general I think one should write assembly if they want to microbenchmark low-level things like individual branches. As you see, compilers can do quite surprising things, and unless you are studying the assembly output carefully, you probably aren't testing what you think you are.

If you want to be sure to

If you want to be sure to count mispredictions, you should actually ask the CPU performance counters. And you should indeed ensure that the assembler is what you expect; but writing it yourself seems (a) overkill (b) dangerous, because you might unintentionally get performance wrong.

I believe the inlining point

I believe the inlining point to be solid, and something that a C++ benchmark won't observe. Mispredictions on indirect branches might indeed be a much smaller/solved problem, but *on the latest generations of Intel CPUs* — we discussed this here:

http://lambda-the-ultimate.org/node/5218

Keean: double-dispatch doesn't seem the standard scenario. More generally, I don't distrust you, but your claim seems extraordinary to me, and performance is known to be hard to measure (as you also know). So I won't be convinced until I see, in essence, evidence sufficient for a paper.
(Not that you have to convince me, of course).

Inlining is not relevant

If you can statically determine the dispatch,then generics and templates are the way to do things, and I would not use a virtual method. In my code a virtual method call only occurs when the dispatch is runtime-dynamic, IE descending an AST derived from a program loaded from a file at runtime. Have a look at my parser combinator library, where the combinators all get statically resolved, and you explicitly use a 'handle' where you cannot determine the type at runtime. The general procedure is to assume everything is static, and then insert the dynamic handles only where necessary when the compiler complains it cannot determine the type statically.

Intel CPUs since the Pentium-M have included indirect-branch-prediction, that works for dynamic virtual functions, however branch prediction does not provide that much improvement when the switch statement targets are evenly distributed.

Inlining is relevant outside

Inlining is relevant outside of C++, because good VMs will do it even for runtime dispatches. We're probably mixing the discussion on C++ and the one on VMs.

First, according to the other LtU thread, predictors made *huge* strides from the Pentium-M, *for the interpreter scenario*. They benchmark Nehalem, Sandy Bridge and Haswell, and in their scenario

The global branch misprediction rate observed
when executing interpreters drops from a dramatic
12-20 MPKI range on Nehalem to a only 0.5-2 MPKI
range on Haswell.

where MPKI = MisPrediction per Kilo-Instructions; that is, on Nehalem, "about 240 to 320 cycles are lost every 1 kiloinstructions" on interpreters. Since their scenario is different, I'd love to see them analyze virtual dispatch, but I'll make do with their info.

Next, good VMs can optimize dispatches where the target isn't known statically. In C++ you can indeed make the polymorphism static (as you describe), but only when you know statically the targets; as I mentioned, I'm familiar with that technology. Good VMs can do the same even when you only know at runtime which target is the most common, or which ones are the most common; they can also inline that target. The VMs leave guards in, but ensure that the branch prediction will assume those guards don't fail. A very good indirect branch prediction could make some of those transformations unneeded, *except* that it can't do the optimizations enabled by inlining.
In C++, a call through a pointer prevents any inlining, because the functions you call might not yet have been written. If you have a switch statement, the code to run is there, and (unlike I claimed earlier) could be in principle be inlined under suitable advanced heuristics (e.g. profile-guided optimization). However, PGO can fail on substantially different input, unlike VMs.

An indirect jump could indeed be a bit faster than a badly optimized switch on modern CPUs. Even when inlining enables no optimization, this doesn't explain yet why a virtual call would be faster than a switch in general. In fact, that depends on how the switch is compiled — it can use a jump table (thus an indirect jump), or compile to the equivalent of nested if. If those ifs are in non-optimal order, this might end up being slightly slower — but if branch prediction works, you'll only pay the cost of the extra predicted tests and jumps, so I'd expect at best a small overhead.

Above I've ignored when jump targets are evenly distributed and too many. In VM lingo, such cases are called *megamorphic* call sites. Both branch predictors, VM and compilers can't help, so this case matters little in comparisons.

Megamorphic, and uneven distributions.

Ah, well I am testing almost the megamorphic case, the branches are evenly distributed although there are only 5 cases. I think the indirect-branch prediction would have equal effect as the direct branch prediction.

So I re-ran the test with probabilities halving for each option (so 50% of the time it will be "one", 50% of those that are not "one" are "two" etc, and again over 100,000,000 elements there is no significant difference in performance in either g++ or clang++.

After profile-guided optimisation, there is still no significant difference.

The vtable beats "if" and

The vtable beats "if" and "switch" statements.

Not a chance. Indirect dispatch has gotten faster with today's branch predictors, as we recently discussed, but there's still no way it can match simple branches.

My tests reveal otherwise

First let me admit that testing does not show vtables to be faster (it may have been something about the application that made them faster in that specific instance). My tests on a core-i7 show the performance for vtable and switch/case is about the same, with no significant difference over 100,000,000 iterations. Now whilst this isn't good enough for a research paper, its good enough for me to believe empirical testing over speculation. The switch/case is definitely not better for the random distribution case, nor the biased distribution cases I tested.

Here's he code if anyone is interested in experimenting with it:

#include <iostream>
#include <cstdlib>
#include <vector>

extern "C" {
#include <sys/resource.h>
}

using namespace std;

inline uint64_t rtime() {
    struct rusage rusage;
    getrusage(RUSAGE_SELF, &rusage);
    return 1000000 * static_cast<uint64_t>(rusage.ru_utime.tv_sec)
        + static_cast<uint64_t>(rusage.ru_utime.tv_usec);
}

struct abstract_base {
    virtual int test(int x) = 0;
};

struct derived1 : public abstract_base {
    unsigned y;
    derived1(int y) : y(y) {}
    virtual int test(int const x) override {return x + y;}
};

struct derived2 : public abstract_base {
    int y;
    derived2(int y) : y(y) {}
    virtual int test(int const x) override {return x - y;}
};

struct derived3 : public abstract_base {
    short y;
    derived3(int y) :  y(y) {}
    virtual int test(int const x) override {return x * y;}
};

struct derived4 : public abstract_base {
    short y;
    derived4(int y) :  y(y) {}
    virtual int test(int const x) override {return x / ((y == 0) ? 1 : y);}
};

struct derived5 : public abstract_base {
    int y;
    derived5(int y) :  y(y) {}
    virtual int test(int const x) override {return y - x;}
};

uint64_t test1(int size) {
    vector<abstract_base*> a(size);
    srand(size);
    for (int i = 0; i < size; ++i) {
        if (rand() % 2) {
            a[i] = new derived1(rand());
        } else if (rand() % 2) {
            a[i] = new derived2(rand());
        } else if (rand() % 2) {
            a[i] = new derived3(rand());
        } else if (rand() % 2) {
            a[i] = new derived4(rand());
        } else {
            a[i] = new derived5(rand());
        }
    }

    int x = 0;
    uint64_t t1 = rtime();
    for (int i = 0; i < size; ++i) {
        x = a[i]->test(x);
    }
    uint64_t t2 = rtime();
    cout << "result1 : " << x << endl;
    return t2 - t1;
}

struct thing {
    enum {one, two, three, four, five} tag;
    union {
        unsigned y1;
        int y2;
        short y3;
        short y4;
        int y5;
    } value;
};

uint64_t test2(int size) {
    vector<thing*> a(size);
    srand(size);
    for (int i = 0; i < size; ++i) {
        if (rand() % 2) {
            a[i] = new thing;
            a[i]->tag = thing::one;
            a[i]->value.y1 = rand();
        } else if (rand() % 2) {
            a[i] = new thing;
            a[i]->tag = thing::two;
            a[i]->value.y2 = rand();
        } else if (rand() % 2) {
            a[i] = new thing {};
            a[i]->tag = thing::three;
            a[i]->value.y3 = rand();
        } else if (rand() % 2) {
            a[i] = new thing;
            a[i]->tag = thing::four;
            a[i]->value.y4 = rand();
        } else {
            a[i] = new thing;
            a[i]->tag = thing::five;
            a[i]->value.y5 = rand();
        }
    }

    int x = 0;
    uint64_t t1 = rtime();
    for (int i = 0; i < size; ++i) {
        switch (a[i]->tag) {
            case thing::one: {
                int v = a[i]->value.y1;
                x = x + v;
                break;
            }
            case thing::two: {
                int v = a[i]->value.y2;
                x = x - v;
                break;
            }
            case thing::three: {
                int v = a[i]->value.y3;
                x = x * v;
                break;
            }
            case thing::four: {
                int v = a[i]->value.y4;
                x = x / ((v == 0) ? 1 : v);
                break;
            }
            case thing::five: {
                int v = a[i]->value.y5;
                x = v - x;
                break;
            }
        }
    }
    uint64_t t2 = rtime();
    cout << "result2 : " << x << endl;
    return t2 - t1;
}

int main() {
    int size = 100000000;
    cout << "test" << endl;
    uint64_t t1 = test1(size);
    cout << "virtual methods : " << t1 << "us" << endl;
    uint64_t t2 = test2(size);
    cout << "switch/case     : " << t2 << "us" << endl;
}

To compile I used:

clang++ -march=native -O3 -flto -std=c++11 -o test test.cpp
g++ -march=native -O3 -flto -std=c++11 -o test test.cpp

For profile guided compilation I used:

rm *.gcda
g++ -march=native -O3 -flto -std=c++11 -o test test.cpp -fprofile-generate
./test
g++ -march=native -O3 -flto -std=c++11 -o test test.cpp -fprofile-use

The machine is:

Linux ra 4.0.6-tuxonice #1 SMP Thu Sep 24 19:17:46 BST 2015 x86_64
Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz GenuineIntel GNU/Linux

g++ (Gentoo 4.9.3 p1.1, pie-0.6.2) 4.9.3

clang version 3.6.2 (tags/RELEASE_362/final)
Target: x86_64-pc-linux-gnu
Thread model: posix

clang++ results:

result1 : 898510266
virtual methods : 2155937us
result2 : 898510266
switch/case     : 923936us

result1 : 898510266
virtual methods : 2128584us
result2 : 898510266
switch/case     : 2059621us

result1 : 898510266
virtual methods : 2085846us
result2 : 898510266
switch/case     : 2131867us
keean@ra ~/code/Clors $ ./test
test
result1 : 898510266
virtual methods : 2065496us
result2 : 898510266
switch/case     : 2054573us

result1 : 898510266
virtual methods : 2087931us
result2 : 898510266
switch/case     : 2065548us

result1 : 898510266
virtual methods : 2025488us
result2 : 898510266
switch/case     : 2056149us

Now to me the first run looks like an anomaly, all the other runs show no significant difference.

g++ results:

result1 : 898510266
virtual methods : 2007482us
result2 : 898510266
switch/case     : 1990807us

result1 : 898510266
virtual methods : 1976209us
result2 : 898510266
switch/case     : 1426730us

result1 : 898510266
virtual methods : 1963883us
result2 : 898510266
switch/case     : 1975236us

result1 : 898510266
virtual methods : 2021076us
result2 : 898510266
switch/case     : 1947571us

result1 : 898510266
virtual methods : 865201us
result2 : 898510266
switch/case     : 826346us
keean@ra ~/code/Clors $ ./test
test
result1 : 898510266
virtual methods : 859406us
result2 : 898510266
switch/case     : 824846us

result1 : 898510266
virtual methods : 1819161us
result2 : 898510266
switch/case     : 1933315us

result1 : 898510266
virtual methods : 1992200us
result2 : 898510266
switch/case     : 1982372us

Again apart from one outlier, they look about the same.

g++ profile-guided:

result1 : 898510266
virtual methods : 1897211us
result2 : 898510266
switch/case     : 1935109us

result1 : 898510266
virtual methods : 1908610us
result2 : 898510266
switch/case     : 1998429us

result1 : 898510266
virtual methods : 1897874us
result2 : 898510266
switch/case     : 1972943us

result1 : 898510266
virtual methods : 1930932us
result2 : 898510266
switch/case     : 1926157us

result1 : 898510266
virtual methods : 1913850us
result2 : 898510266
switch/case     : 1813990us

result1 : 898510266
virtual methods : 1956103us
result2 : 898510266
switch/case     : 1968317us

result1 : 898510266
virtual methods : 1905247us
result2 : 898510266
switch/case     : 1827178us

result1 : 898510266
virtual methods : 828129us
result2 : 898510266
switch/case     : 834788us

result1 : 898510266
virtual methods : 1931714us
result2 : 898510266
switch/case     : 1822788us

There does not seem to be a statistically significant difference here either.

Have you verified that the

Have you verified that the machine code that is generated is actually doing an indirect call?

Verified Assembly

Here's the assembly generated for the virtual method:

    movq    0(%rbp), %rdi
    movl    %ebx, %esi
    movq    (%rdi), %rax
    call    *(%rax)

And here's the assembly for the switch/case:

    movq    0(%r13), %rdx
    cmpl    $4, (%rdx)
    ja  .L67
    movl    (%rdx), %eax
    jmp *.L69(,%rax,8)

So it looks like it generates an indirect jump for switch.

So I probably need to rerun with 'if/else' instead of switch/case. Replacing the switch/case with:

enum thing::tag const t = a[i]->tag;
if (t == thing::one) {
    int const v = a[i]->value.y1;
    x = x + v;
} else if (t == thing::two) {
    int const v = a[i]->value.y2;
    x = x - v;
} else if (t == thing::three) {
    int const v = a[i]->value.y3;
    x = x * v;
} else if (t == thing::four) {
    int const v = a[i]->value.y4;
    x = x / ((v == 0) ? 1 : v);
} else {
    int const v = a[i]->value.y5;
    x = v - x;
}

The assembly for g++ if/then/else looks like this:

.L106:
case1:
    addl    4(%rax), %ebx
.L68:
iftree_end:
:
:
:
iftree_start:
    movq    0(%r13), %rax
    movl    (%rax), %edx
    testl   %edx, %edx
    je  .L106
    cmpl    $1, %edx
    je  .L107
    cmpl    $2, %edx
    je  .L108
    cmpl    $3, %edx
    je  .L109
case5:
    movl    4(%rax), %eax
    subl    %ebx, %eax
    movl    %eax, %ebx
    jmp .L68
:
:
:
.L107:
case2:
    subl    4(%rax), %ebx
    jmp .L68
    .p2align 4,,10
    .p2align 3
.L108:
case3:
    movswl  4(%rax), %eax
    imull   %eax, %ebx
    jmp .L68
:
:
:
.L109:
    .cfi_restore_state
case4:
    movswl  4(%rax), %ecx
    movl    %ebx, %eax
    testl   %ecx, %ecx
    cmove   %esi, %ecx
    cltd
    idivl   %ecx
    movl    %eax, %ebx
    jmp .L68

The results with g++ look like this:

result1 : -1716938139
virtual methods : 1839387us
result2 : -1716938139
switch/case     : 1859421us

result1 : -1716938139
virtual methods : 865422us
result2 : -1716938139
switch/case     : 1503111us

result1 : -1716938139
virtual methods : 1982474us
result2 : -1716938139
switch/case     : 1945302us

result1 : -1716938139
virtual methods : 1992082us
result2 : -1716938139
switch/case     : 854776us

result1 : -1716938139
virtual methods : 1999764us
result2 : -1716938139
switch/case     : 1393510us

result1 : -1716938139
virtual methods : 2030440us
result2 : -1716938139
switch/case     : 2007008us

result1 : -1716938139
virtual methods : 1922737us
result2 : -1716938139
switch/case     : 1783523us

result1 : -1716938139
virtual methods : 2002159us
result2 : -1716938139
switch/case     : 1984027us

There doesn't appear to be a statistically significant difference with if/then/else using g++ either. As the branches are already ordered, profile-guided optimisation makes little difference.

results for clang++:

result1 : -1716938139
virtual methods : 876797us
result2 : -1716938139
switch/case     : 1661169us

result1 : -1716938139
virtual methods : 2096607us
result2 : -1716938139
switch/case     : 1671649us

result1 : -1716938139
virtual methods : 2056776us
result2 : -1716938139
switch/case     : 1542670us

result1 : -1716938139
virtual methods : 1466639us
result2 : -1716938139
switch/case     : 1631889us

result1 : -1716938139
virtual methods : 2081096us
result2 : -1716938139
switch/case     : 1412045us

result1 : -1716938139
virtual methods : 2129466us
result2 : -1716938139
switch/case     : 1340515us

result1 : -1716938139
virtual methods : 1396443us
result2 : -1716938139
switch/case     : 747637us

result1 : -1716938139
virtual methods : 968393us
result2 : -1716938139
switch/case     : 1572550us

I would cautiously say that with clang++ if/then/else has a slight performance advantage over switch/case and virtual methods. Its interesting that clang++ manages to do this but g++ does not, so it appears to be highly compiler dependent.

This is the assembly that clang++ for if/then/else generates:

iftree_start:
    movq    (%rsi), %rax
    movl    (%rax), %eax
    cmpq    $3, %rax
    jbe .LBB1_31
# BB#36:                                #   in Loop: Header=BB1_30 Depth=1
case5:
    movq    (%rsi), %rax
    movl    4(%rax), %eax
    subl    %ebx, %eax
    jmp .LBB1_37
    .align  16, 0x90
.LBB1_31:                               #   in Loop: Header=BB1_30 Depth=1
    jmpq    *.LJTI1_0(,%rax,8)
.LBB1_32:                               #   in Loop: Header=BB1_30 Depth=1
case1:
    movq    (%rsi), %rax
    addl    4(%rax), %ebx
    jmp .LBB1_38
.LBB1_33:                               #   in Loop: Header=BB1_30 Depth=1
case2:
    movq    (%rsi), %rax
    subl    4(%rax), %ebx
    jmp .LBB1_38
.LBB1_34:                               #   in Loop: Header=BB1_30 Depth=1
case3:
    movq    (%rsi), %rax
    movswl  4(%rax), %eax
    imull   %eax, %ebx
    jmp .LBB1_38
.LBB1_35:                               #   in Loop: Header=BB1_30 Depth=1
case4:
    movq    (%rsi), %rax
    movswl  4(%rax), %edi
    testl   %edi, %edi
    cmovel  %ecx, %edi
    movl    %ebx, %eax
    cltd
    idivl   %edi
    .align  16, 0x90
.LBB1_37:                               #   in Loop: Header=BB1_30 Depth=1
    movl    %eax, %ebx
.LBB1_38:                               #   in Loop: Header=BB1_30 Depth=1
iftree_end:

So my guess would be that clang++'s code layout fits into the L1 cache (edit: L1 seems a bit big, maybe the re-order buffer?), and the performance difference is not due to any branch predicting, but simply the compact and contiguous nature of the code. It may be that if the core of the loop were bigger, doing more real work, the advantage would disappear. I also wonder why g++ uses such a weird memory layout for the code?

Nice work. If I'm not

Nice work. If I'm not mistaken then clang is doing an indirect jump whereas gcc is doing a series of tests, so that could be the difference. It's weird that clang compiles the ifs to a more efficient indirect jump than the switch!?

If it's due to caching it could be the uop cache, which is very small.

Assembly for each pattern.

Both clang++ and g++ compile if/then/else to a tree of branches, but with different memory layouts and choices of branch (jne vs je).

Both clang++ and g++ seem to switch between a branch-tree or an indirect jump depending on the number of cases and whether there is a default option. They both happen to generate indirect jumps for the sample code, but this changes if I make the last switch option a "default", when g++ seems to generate a branch-tree, and clang++ a combination of branches and indirect-jumps. There are obviously some heuristics going on here.

Both clang++ and g++ generate an indirect call for virtual functions.

Both clang++ and g++ compile

Both clang++ and g++ compile if/then/else to a tree of branches, but with different memory layouts and choices of branch (jne vs je).

What about:

This is the assembly that clang++ for if/then/else generates:

[... code ...]
jmpq    *.LJTI1_0(,%rax,8)
[... code ...]

?

You are right.

I missed that. It would explain why only g++ gets the 400% performance improvement with smaller datasets (~1000 elements), and clang++ does not. I wonder why clang++ prefers indirect-jumps over branch-trees?

No idea, but it appears to

No idea, but it appears to be a performance bug in this case. The LLVM developers may be interested in a bug report.

Clang just botched the

Clang just botched the heuristic (I guess). Maybe everybody knows that, but given enough tests, an indirect jump is faster, that's why sometimes a compiled switch uses a jump table. Clang's heuristic is wrong here, but it can be nice that clang can compile nested if to a jump table.

Computed Goto For Efficient Dispatch Tables

I though this was interesting and relevant to the above thread:

http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables

What is interesting is the shows one of the cases where computed goto's are more efficient due to the way branch-prediction works. I wonder if this is why my unification algorithm performed better with virtual functions instead of a switch/case block, because it used double-dispatch, there are efficiencies in the pairings that don't get expressed in the same way, or maybe I could have structured the branch tree differently knowing how branch-prediction works.

Low level optimizations become obsolete

they keep adding new optimizations to these processors (then taking them back out for small-silicon-footprint versions).

And they're so complex that they can't document them all. And there are so many versions of processors out there.

So in the end, for the sake of sanity, I'd consider differences in speed of 2 or something should just be considered noise. I remember papers on all sorts of weird optimizations that became obsolete as processors got smarter. Self modifying code etc. In the long run, the processors will catch up to common coding practices.

Sorry, no.

You are basically saying "The layers below me will just magically become faster. I'm only responsible for whole factors of 2."

This bugs me a lot. Not just because it's a dodge of responsibility, but because employed at every layer, you get lots of factors of 2. Not only do you get slow, but you usually end up with some much variance that there's even more "noise" and that factor of 2 keeps growing.

I'll give you an example. These days there is a loop stream detector in Intel chips that detects a stream of instructions that forms a loop if it falls into fewer than N cache lines. That loop stream detector skips the decode step for those cache lines and can make loops 40% faster! The larger your machine code, the more cache lines. Lots of optimizations all over the compiler try to reduce the code to fewer instructions, move code out of loops, select more efficient encodings, select more beneficial registers to achieve smaller prefixes, avoid moves, etc. All of which help just a little tiny bit--perhaps not much to even measure. But every one that reduces the code size by a little bit improves the chances that a small loop will fit in this LSB and go 40% faster. Are all those low-level optimizations obsolete? No. Resoundingly no.

You can take one or two little optimizations away and probably not notice, until you fall off the cliffs one by one. First you miss the LSB buffer size, then you run out of BHB entries, then the instruction cache thrashes or the pipeline stalls due to memory dependencies on the _stack_ because you decided not to try very hard to allocate registers, and on and on. If you do nothing, you will get shit code. Maybe your micro experiments don't show it, but scale up all the cuts, and death ensues.

And they're so complex that they can't document them all. And there are so many versions of processors out there.

That's another dodge. Most low-level optimizations are obvious no-harm improvements and it'd be silly not to do them.

It's based on research since

It's based on research since the 80s on fast dynamic dispatch. What you're missing is that indirect jumps are hard to predict, except on recent CPUs, and mispredictions require flushing the pipeline — on a Pentium 4, I'd guess that wastes 30-40 cycles (the pipeline depth). What we're debating in the rest of the subthread is that some recent Intel CPUs improve branch prediction dramatically, and what's the effect.

To be fair, I'm just comparing genuine C++ virtual calls with efficient VMs. Ruby, Python etc. are even slower.

For more, you can look at the literature pointers in V8 docs:
https://developers.google.com/v8/design?csw=1
starting from "Efficient Implementation of the Smalltalk-80 System".

Slow vms are a choice

There is no reason that Ruby couldn't flatten its virtual tables to use the same mechanism as C++. The tables would be longer and dynamic but the mechanism could be the same.

I implemented that in Racket (Scheme) once and tested it at an order of magnitude faster than the OO system that came with it.

But getting the hobbled macro system to implement all of the features of OO was beyond me, so I gave up. The code for the existing OO system was so horrible as to scare anyone away.

"Dynamic tables" aren't a

"Dynamic tables" aren't a trivial complication, but I agree that after Google V8 we basically know what to do — I and a colleague prototyped this for a Python subset for this course.

But PyPy people claim that that the engineering effort to reimplement the full semantics in a compiler is too big, simply because Python/Ruby's OO have too many special cases to get right; anecdotally, Google's Unladen Swallow gave up, so they might have a point.

Yet, it's generally agreed that authors of Python & Ruby interpreters ignored altogether lots of consolidated wisdom in VM design, without (convincing) reasons.

Generic Vectors

It sounds like neurons are best represented by vectors, and operations on them are probably easier using some kind of vector operation. There is still a point to objects (and templates in C++) as you can write generic operations over the whole vector. Also using templates you can write operations where the neuron-structure is a type parameter.
Edit: my phone seems determined to "correct" generic to genetic for some reason.

I figured out how to do it.

I can't decide whether this solution is horribly beautiful or wonderfully ugly, but I figured out a way to do it.

With 12 different kinds of Neural_Structure, I need 144 feedforward routines and 144 backpropagation routines. I had sort of hoped that the OO system would generate those for me if I wrote code that acts on immediate arrays of Neural_Structure, but apparently it won't. To be polymorphic code, those need to be pointers with a cast operation on dereferencing. But those need to be tight little inner loops and matrix operations for CUDA to work properly, so any indirection to an object on the heap is wrong.

But, because all twelve of those types implement the same interface, all of those 144 versions of those two routines have exactly the same source code except for two type names each.

So I created a complex-but-sane macro that generates those functions given two type names (in both directions, _unless_ both names are the same), then a crazy variadic recursive macro that expands into a call to that one plus a recursive call to itself, iterating over all type pairs.

The result: 78 recursive expansions of the crazy macro resulting in 78 calls to the sane macro, expanding to 4 routine definitions in all but 12 cases and 2 routine definitions in the last 12 cases, for a total of 288 routines defined by two macros.

Sometimes I'm a bit ashamed of my M4 black belt; it's a bit like being the best there is at doing utterly horrible things.

So now I have two "overloaded calls" with 144 specialized versions each, and a whole lot of inlining and parallel operations, and no pointer chasing. Which is what I had wanted the OO system to do for me but it wouldn't, and which I suspect template metaprogramming could do but I couldn't figure out how to make that work.

Whatever. To be fair, the OO system isn't useless. It will be very beneficial in the rest of the code; the neural network is the central bit, but it needs a whole lot of support and framework and serialization and inputs and outputs and test case handling and etc.... And that stuff doesn't need to be tight inner loops, so pointer indirections are no big deal there.

No, you just figured out OO is screwed

What you have just figured out is in fact that OO is provably stuffed. It's called the covariance problem. Anyone making claims about OO representing abstractions in general is a fraud. In particular anyone teaching OO at a university should be fired and if legal put in gaol for the rest of their natural life.

In the abstract, a binary operation A * A -> X is enough. In C++ it is easy to write the abstract class A with a pure virtual method accepting an A reference or pointer argument.

But now you have two implementations of A, call them B and C. How do you implement the binary method?

The answer is quite simple. It is completely and utterly impossible. Its that simple. You cannot do it. Clearly to do so you must know the encoding (representation details) of both B and C and clearly that breaks encapsulation. For just two implementations of A you need FOUR routines, but OO provides only linear dispatch.

Quadratic doesn't fit into linear. That's it. End of OO.

There's a special case. When a method has only *invariant* arguments, this is equivalent to a fixed family of methods with no arguments. In that case there is no covariant argument and it works. So for this special case OO is actually useful: device drivers which read and write characters are a good use of classes and virtual dispatch. I also use them heavily to represent function closures in my programming language Felix (one distinct abstract base for each function type!)

The point is: OO does not in any way shape or form represent any kind of general method of handling abstraction. There is ONLY one known way to do that, namely category theory, which is in fact the theory of abstraction, that is, representation of things by functions.

Functional languages do not have the covariance problem: for example if you use a discriminated union (variant, sum type) to unify the types of A and B, the language forces you to implement the four required routines. There may be shortcuts if the operator is symmetric of course.

The point is, there is no pretense here of abstraction. If the language has subtyping and a sound type system, it works where it is valid and not otherwise. In this respect, C++ type system is sound: it prevents you actually implementing your binary method with four routines. Unless you break encapsulation by using a dynamic cast chain, which of course only works if you have a fixed, known, set of derived classes.

Please note Bjarne Stroustrup went to great pains to ensure the C++ community used the term *derived class* and NOT subclass. Class and inheritance are a feature of C++ but it is not an OO language.

So the answer to your question: How do you model a covariant method in an OO language? is simple: you can't. Don't try. Give up on the rubbish you're using and use a language or method which makes no claims to be OO. If you're writing C++, use functional methods: templates for example.

You presume one-dimensional VMT's.

There is no law of the universe that says virtual method tables are required to be one-dimensional. There is absolutely no reason why type dispatch should not be a reference into a two-dimensional or three-dimensional etc array depending on how many polymorphic arguments a routine has; indeed I've used systems that handle OO type dispatch in exactly that way (CLOS to name one - but I couldn't get CLOS to play nice with the REST of my problem here).

Although I wouldn't mind if the system generates a "Did you really mean to do this?" warning before it makes more than, say, 500 different type specializations of any routine.

Oh, for the love of braindead zombies...

And another body blow to OO has been visited upon me. Overloaded functions in C++ dispatch only on their first argument.

Sigh. A slight revision to my macrology, and now I have 288 routines named feedforward_1_1, feedforward_1_2, feedforward_1_3, ..... feedforward_12_12 and likewise backprop_1_1, backprop_1_2, etc.

And I will now make wrapper routines that do their own damn function dispatch using a static constant 2-dimensional array of function pointers.

I swear this is like trying to get a slug through a salt maze.

Template Specialisation

If you know all the types statically template specialisation uses all type arguments, so its the way to go. It also has the added advantage that all the polymorphism gets resolved by monomorphisation, resulting in better performance.

Ages ago, I've used the full

Ages ago, I've used the full bag of C++ template metaprogramming tricks there (including CRTP). But I find it hard to recommend it in general, unless you can justify getting a C++ black belt rather than, say, a D course (and even there, some template metaprogramming is easier to express, but I'm not sure C++'s static polymorphism becomes easy to express).

Once upon a time

maybe fifteen years back, I had a notion for what I called "owned multimethods". The idea was that any object could know how to handle various messages with parameters of various types and itself in various positions (not just the front position). When the dispatcher is presented with a message, it may use a method belonging to any of the parameters. I remember, I compared it to students doing a group project; they schedule a meeting place and time, all get together, there's an awkward pause, then somebody says, "does anyone know what we're supposed to be doing?"

But the idea raised a variety of surrounding questions that ultimately led me to rather lose faith in static typing, so I never implemented it.

C++isms

Wouldn't that be a multi-method that is a friend of all the classes?

My personal preference is for separating data hiding from overloading, so that data hiding is achieved by something like a closure, and overloading by something like type classes. Objects are then the fixed points of types.

Coherence

How often do your types vary? If it's often, you are going to get screwed by branching hazards during SIMT...to avoid that you need to either ensure your objects are laid out into the vector so that same types are relatively contiguous, or eliminate branches in your behavior somehow (e.g. By using expression-based conditions).

Types vary at interfaces between layers.

Each contiguous array is an array of like-typed elements. My method specializations need to take an array of any one Neural_Structure type and feedforward (or backpropagate) to an array of any one Neural_Structure type. There may be absolutely no type branching or pointer chasing from inside those inner loops.

Type dispatch (figuring out exactly which inner loop to call in the first place) requires knowing exactly which two subtypes we have arrays of. I can do it with a two-dimensional array of function pointers if the compiler does something dumber than that with function type dispatch.

edit: And the compiler did something dumber than that with type dispatch. It looks only at the first argument to do type dispatch and calls any specialization on any other argument an error. So now I need to make my own method table for both of those routines.

If the entire arrays are

If the entire arrays are monomorphic, then I think templates are the way to go, but you have to plumb exact array types all the way through.

This would work on C# at least using generics (and all the dispatch would be resolved at compile time for structs if they are used as the bounds of the type parameter).

Macros not Templates?

Surely you mean templates? You can make the functions parametric on the types that need to change. Parametric types are a far safer thing than the string manipulation that goes on in 'C' macros.

There is probably a way to do that. I didn't find it.

I must confess that I have the M4-fu born of decades of experience writing C programs and the template metaprogramming-fu of maybe 5 or 6 C++ projects under my belt. Most of what I've written has been either very high-level code (in Lisps or the like) or very low level code that has to go down to the grotty-bits level in exhaustive verification to be sure that things are NOT happening.

Much of what I've written is crypto and security related, and when you have to verify things down to the bit level before deploying, (for example to specifically guarantee that things like copying sensitive data ARE NOT happening, even if that's a faster way to produce the desired return, and to make sure that writes to delete data at sensitive locations ARE happening, even if not necessary to produce the desired function return, etc) C is preferred because it makes the generated machine code easier to read and easier to relate to the source code, and makes it harder to hide sneaky stuff.

C++ is frowned on in those fields because the machine code is harder to read and harder to relate directly to the source, because operator and call overloading can make source code that looks like it does something obvious do something else, because references can look just like direct uses of variables, and other sneaky ways for malicious actors to write code that, in details, can mislead reviewers into examining the wrong routine, or which does something in some way different than it looks like it's "obviously" doing it.

So I'm relatively inexperienced in C++ I didn't figure out how to get the template system to do what I needed.

I don't doubt that there's a way, and I did spend some time trying to figure it out because, "it's C++ and that's the way you're supposed to do it in C++". It's just not a way I found before realizing that (A), I already knew how to do it using m4 macros and (B), C++ code goes through m4 as well as template expansion.

C++ code doesn't actually go

C++ code doesn't actually go through m4 — but I agree it is still similarly expanded, and I guess you know that.

I'm sympathetic to your concerns on crypto in C++. I've casually discussed similar problems for crypto in Haskell O_O.

Which compiler?

I think m4 preprocessing is compiler specific. g++ puts both C and C++ through the preprocessor, so you can write C macros in C++. I suppose some compilers not having a preprocessor pass is a good reason not to use macros in C++ though.

M4 != CPP, CPP is a standard part of C++

Sorry, but that sounds... wrong. And there's some bigger misunderstanding in this discussion.

TL;DR: The C preprocessor is standard, also in C++, and you can't even use headers without it; M4 is a *different* preprocessor, but it *seems* that all mentions of M4 in this discussion (apart from my ones) should probably read "C preprocessor" O_O.

CPP (the C preprocessor) is a standard part of the C++ standard (see Sec. 2.1, at least in my draft), and some standard library APIs are *specified* to be macros — starting from NULL. Also headers are handled by the preprocessor. However, the *name* CPP might be GCC-specific, and is not used in the standard as far as I know. It handles `#include`, `#define` and so forth.

M4 is a different preprocessor, standardized by POSIX (https://en.wikipedia.org/wiki/M4_(computer_language), https://www.gnu.org/software/m4/manual/m4.html). Very different from CPP. It's used to implement autoconf, if you're into that kind of thing. I never touched it. Example from the manual:

define(`foo', `Macro `foo'.')dnl A very simple macro, indeed.
foo
⇒Macro foo.

Until now, I assumed that whenever Ray Dillinger talked about M4 preprocessing, he talked about really using M4 as an additional preprocessing pass, I guessed because his macros were too fancy.
Now I suspect that he always talked about the real C/C++ preprocessor. And I guess that using M4 for C is a reasonable option (they have very different lexers).

You are correct. My

You are correct. My makefile directly calls M4, and M4 macros are processed before CPP is invoked by GCC.

I could also have done it with sed, but didn't. I don't know if I could do it with CPP; I think I probably actually could. But if I'm willing to expend that amount of effort, I'll spend it on figuring out C++ native template metaprogramming system.

Partial Evaluation

But "generalizes" rather than "extends" would have been more efficient, because Neural_Structure adds overhead and handling that Neuron doesn't need but which Neuron inherits.

If your layout of node types is static over a run then partial evaluation should remove all of the overhead. In the case that you have a switch/case or if-tree that checks each type and decides which of the specific 288 bodies to execute, those tags could be extracted from the objects and stored in a separate array. That array is then constant (static) for any particular topology that you are testing. Cmix would be a good candidate to try and partially evaluate with respect to the contents of that array.

If you are lucky, generating that array as an explicit constant value and then compiling your code against it may have the same effect. It depends on whether or not the compiler can propagate the constants per-element into the branch code - at which point it should eliminate it and call the routine directly.

Compilers not good at partial evaluation.

For ages C/C++ compilers did very little of this. Now if you declare something "constexpr" it may get statically calculated.

An experiment, write a Mandelbrot generator and hard code the coordinates into the program. In theory the Mandelbrot can be calculated statically, but I don't know of a single compiled that will do so.

I suspect that parial evaluation is nowhere near as widespread and well developed as people think it is.

Constant propagation is a form of partial evaluation

double post

Constant propagation is a form of partial evaluation

It is true that partial evaluation in its most general form is unusual, but that was not the form necessary here. Constant propagation is one special case of partial evaluation. Constant propagation is implemented in every compiler that I can think of. Partial evaluation of this particular problem can be performed using constant propagation.

The availabilty of partial evaluation in general is not relevant: the specific case needed for this partial evaluation is widespread. Sometimes when you do not have the tools that you need, the tools that you have can be used in unusual ways. Not every problem is a nail, but a hammer makes a surprisingly good screwdriver when you are desparate enough.

Inlining

Inlining is another very common form of partial evaluation.

True.

Without seeing any of Ray's code it's hard to guess which technique would instantiate his family of methods with the least pain. If I'm guessing blind I would probably use templates - where the implementation in C++ is a bit laborious for indicating static binding times, but at least it allows integer parameters that would probably be helpful given the varying connection topologies of the the neuron types in the "hierarchy".