## function overriding workaround in dynamically typed languages

hi,
i had a chance to make my hand dirty with ruby. IMHO untypedness of these dynamically typed language is more a nuisance than as a help. define a class, and if you want your behaviour different depending upon the type of object, only option left is check if it is one of the kind of object you are interested and then work accordingly..

def fun(inobj)
if inobj.kind_of?(obj1)
do something..
elsif inobj.kindof?(obj2)
do something.
endif

except these switch cases, is there any better way of handling the situation, where overridden function would have been helpful (in pure OO sense)

regards
chinmay

## Comment viewing options

### Google up on ruby ducktyping.

Google up on ruby duck typing.

The "multi" package at http://multi.rubyforge.org/ --

multi(:fib, 0) { |x| 0 }
multi(:fib, 1) { |x| 1 }
multi(:fib, Integer) { |x| fib(x-1) + fib(x-2) }
fib(1)  # => 1
fib(15) # => 610


It's probably not "the Ruby way," but it's surprisingly non-painful.

### Indeed

Indeed. Languages such as Ruby excel at polymorphism, and Ruby's typing plays a large role in making all of the capability simple to understand.

### RuleDispatch

For Python there's RuleDispatch

### In Ruby, you can modify classes.

Quoted from Ruby in Twenty Minutes:

In Ruby, you can open a class up again and modify it. The changes will be present in any new objects you create and even available in existing objects of that class.

So, you could do:

 class Obj1 def _fun_ do something... end end class Obj2 def _fun_ do something... end end def fun(inobj) inobj._fun_ end 

I believe that is a considerable advantage of dinamically typed languages (note that this can be done by a dinamically loaded library, which would thus change some class of the main application and then use the objects, created by the application, as arguments to the application's methods that weren't expecting such object at all!!). Does anyone have an idea how this could be done in statically typed languages (or, at least, without significant performance penalty)?

### Define significant, I think

Define significant, I think the dynamic lookup needed to implement this in Ruby is considerable (well, unless they are using some fancy Self-like optimizations). Scala supports implicit coercions, so we can add new methods to objects during compilation. It doesn't cover all situations where duck typing is needed, but captures many use cases.

### I don't think type feedback

I don't think type feedback is all that fancy really. It isn't practical for one off scripts as Ruby is typically used for, it probably wouldn't even get fired off on 90% of the things ruby is used for; it works better with a compiler, dynamic compilation and long running programs or persistant systems.

### Well, you should consider

The idioms for developing in Scala and Ruby are different, so maybe the 90% is addressed by other mechanisms (e.g., mixins and virtual class constructions). There are cases where dynamic refinement and adaptation are useful, but in general these mechanisms are used to do things that are necessarily or fairly static.

At any rate, it would be nice if you could share some specific examples. This is in my area of interest.

### +1

It's in my area of interest as well...

### Self is a dynamically typed,

Self is a dynamically typed, pototype based OO language and system. This creates a bit of flexibility in terms of what you can do in these situations but usually, you'd do what Tom suggests.

Say you have two objects, x, y and z and you're writing a method which you want to work on all objects so you write (in Io sudo code):
 obj2 fun1 := method(obj2, obj2 foo) 
now x, y and z don't have method foo, so because Self (and Io) are prototyping languages you define it!
 x foo := ... y foo := ... z foo := ... 
Now, running "obj2 fun1(x)" works just fine for x, y and z, without any sort of case or if statement.

Now with Self's type feedback system, say you run "obj2 fun1(...)" 100 times from a specific bit of code and 98 times it's x and 1 time it's y and 1 time it's z. This bit of code would probably get optimized to something from:
 obj2 fun1(...) 
to having an internal representation of (assuming x foo isn't too big):
 if(obj2.type() == "x") { *contents of x foo* } else { goto unoptimized_version; } 
That optimizes the common case by inlining the method call and avoiding dynamic dispatch, without too much effort.

It's all in one of the self white papers, so that's all I'll say about it, but type feedback is an optimization technique, not a work around for case statements, the language provides those.

I'm not familiar with Scala, but I'd imagine you could do some sort of pattern matching to avoid writing out if/case statements.

### Cheap polymorphic inline caches

A large part of the gains from such a transformation will be the optimisaton opportunities uncovered by the inlining. However, if we only focus on the misprediction cost, there are simple ways to emulate the effect of polymorphic inline caches. Indirect branch prediction lets us 'fake' inline caches (even with the basic last-taken heuristic), without runtime code generation (since mispredictions are so expensive on modern cpus, the lookup cost seems nearly negligible). We can use the same indirect branch prediction heuristic for PICs.

For example,
 addr : address of the function to call hash : hash value of addr (in [0, 3]), potentially precomputed

 

if (0 == hash)    (*addr)(); else if (1 == hash)    (*addr)(); else if (2 == hash)    (*addr)(); else    (*addr)(); 
(ignore the fact that a compiler will often fold the branches away)

If the call site is polymorphic (but not too much), we can expect the ~3 call targets to have different hash values, and thus be distributed to different indirect call instructions. The only misprediction will be in the conditional branches, and we expect, on average, at most 1 misprediction/call. Branch predictors are much more sophisticated for conditionals than for indirect branches. Empirical results indicate that, overall, the penalty for monomorphic call sites is small, with large gains for polymorphic sites with well-predictable patterns.

### Or even better, use binary

Or even better, use binary tree dispatch and have log2(n) runtime instead of linear [1]. The above check becomes a nested if statement:

if (2 < hash)
if (0 == hash)
else /*1 == hash */
else /* 2 == hash */


The improvement is significant above 4 or 5 cases IIRC.

[1] See refs I added to: http://en.wikipedia.org/wiki/Virtual_method_table

### Cuckoo

Io uses cuckoo hashing, which is O(1) for lookup without exception.

### The problem is not so much

The problem is not so much the big-O of the lookup algorithm, in many cases it's the CPU's branch mispredictions and cache misses that dominate virtual dispatch. See the two papers I referenced in the wikipedia article for details. The linear or binary if-branch approach can be easily predicted by modern processors which leads to no pipeline stalls. For hash-based lookups, you incur (possibly multiple) indirect memory accesses which often can't be predicted.

### Io and cuckoo hashing.

Io uses cuckoo hashing, which is O(1) for lookup without exception.

Except that that only applies to a lookup in a single object. Io has to search the object and its inheritance tree for every message send. So the expected time to lookup a message is linear in the number of objects in an object's inheritance tree.

Also, if you are trying to improve overall performance then you'd want to optimize the average case, not the worst case. Linear probing outperforms cuckoo hashing in the average case for the most common operations.

### More mispredicts...

Binary search leads to more mispredictions than a linear search (expected) [in addition to being more bug-prone ;)]. Note that the graph is logarithmic on the x axis. The situation isn't getting much better with newer architectures either.

That link discusses "searches" based on scanning a linear region of memory, so I agree, cache misses due to the random memory access patterns alone favour linear search.

The kind of search we are talking about is an unrolled loop with no memory references (all register compares). If you check the referenced papers, binary beats linear with as few as 4-5 cases, and it should be obvious that this is the case: since the hash is known ahead of time (before any comparisons), and the hash values are statically known, branch prediction should be fairly straightforward, and stalls almost non-existant (though still not as straightforward as predicting a linear sequence of comparisons).

### (This is getting a bit OT. http://www.quicktopic.com/ ?)

For a polymorphic site, a binary search does *not* lead to easily predicted patterns! The link actually discusses search trees exactly like the one you describe (to determine whether a value falls in [0, 0.25), [0.25, 0.50), ..., with constant intervals). Assume an uniform distribution among 2^n values. A binary search will lead to n/2 mispredicted branches and n/2 well predicted ones (since each branch is taken with p=.5). A linear search will instead have, on average, (2^n)/2 - 1 well predicted branches and 1 misprediction (simply assume branches are never taken).

Mispredicts are much more expensive, compared to well predicted branches, than they were on P-III. One would expect the size at which a binary search tree is more efficient than a linear scan to be larger. Moreover, branch prediction units are more sophisticated; I believe, but have not tested empirically, that a linear scan leads to more easily extracted patterns (than a binary search) when the call targets themselves follow a simple pattern. For example, when the target follows a cycle with a period of 4, as was tested in http://discontinuity.info/~pkhuong/branch-summary.txt, there is next to no penalty compared to the same code tested with a single constant target, on both P4s and Opterons.

(the code is at http://discontinuity.info/~pkhuong/branch.c)

### A binary search will lead to

A binary search will lead to n/2 mispredicted branches and n/2 well predicted ones (since each branch is taken with p=.5)

p=0.5 is unjustified. Call patterns are not (necessarily) random, and the analysis in the article assumes a random distribution. In fact, a quick search on branch misprediction yields numbers below 20%, which means a binary search quickly catches up to linear search. Of course, call patters are application-dependent, so results will vary.

As the paper I referenced concludes [1], the optimal dispatch technique depends on the call patterns, and overall they conclude:

For numbers of possible types up to 4, if sequences are most efficient. Between 4 and 10, binary tree dispatch is generally preferable. For more types, the best implementation is a classical table-based implementation such as currently provided by most JVMs for virtual calls. These are safe, conservative bets, that generally provide a significant improvement and, when not optimal, result only in a small decrease in performance.

Finally, these measurements expose architectural features (especially branch predictors) of the target hardware. For instance, when executing virtual calls the Pentium III branch target buffer ensures that constant patterns have performance nearly identical to that of slowly changing stepped patterns, whereas this is not the case for the UltraSparc III. Similarly, when executing if sequences, small cyclic patterns are predicted accurately by the Pentium's conditional branch predictor, which, for all JVMs, results in better performance on small cyclic patterns than on random patterns.

I don't doubt that mispredictions on the P4 are worse than the PIII, due to its ridiculously long pipeline, but as you say, this may be offset by a more sophisticated predictor. Since we're dealing with JITs (or at least, that's what I assume we're talking about), and I presume that the number of polymorphic types in an application generally falls between 3-8 (subjective experience in C#), that binary tree is an efficient, safe default if you had to choose a single dispatch technique.

One thing I'm not sure about: what is the misprediction cost of Intel's Core 2 and Core 2 Duo, since they are based on the PIII core?

Re: offtopic

I'm not sure if it's entirely offtopic, since we're discussing optimal polymorphic dispatch techniques which have direct applicability to virtual machines.

[1] 4.2 Discussion, http://www.sagecertification.org/events/javavm02/full_papers/zendra/zendra_html/index.html

### No JIT

The point is that we can have some of the advantages of PICs (which require a JIT) with static code, by exposing some of the potential regularity to the branch prediction units better than a simple indirect branch does. That is why the original snippet uses indirect branches. The code doesn't have to change, and, if there is no collision in the hash values for the call targets, each indirect branch will be well-behaved (only ever point to one address during a given period of time). If the set of (dynamic) targets changes, there'll be few mispredictions, but we expect the predictors to adapt quickly.

Does anyone have an idea how this could be done in statically typed languages (or, at least, without significant performance penalty)?

the traditional decorator pattern allows for this kind of object behavioural modification in static oo languages.

i think that this ease of modification and extension of classes is rather a huge disadvantage of dynamic languages. it encourages bad design: for any significantly complex system you would very much try to avoid behavioural modifications of this category, which lead to opaque and weakly specified component protocols (rendering such systems hard to analyse and predict).

dynamic languages profit from clear and well-defined interfaces, even though their agile nature and some best-practise methodologies might often suggest different.

### typical dynamic lookup cost

Sean McDirmid: dynamic lookup needed to implement this in Ruby is considerable

I'd expect ten to one hundred processor instructions if care is taken. It'd be fancy if memoization and caching are fancy. More expensive than C++ vtable dispatch by a small factor, but often informally amortized over a fair amount of code because low level tight loops rarely need dynamic dispatch spang in the middle. (It's something a coder might specifically aim to avoid doing per byte, say, when processing large streams.)

### I guess we should also

Also consider missed optimization opportunities; i.e., inlining. Again, the Self work addresses this, but I'd reason there is still a measurable hit in performance. Also, I don't think the Self optimization techniques are being used in dynamic languages, or even if the techniques are effective over the long term. It would be nice to see more retrospective papers on Self and what has happened to its technology.

### Cost of runtime optimization

There's also the fact that languages like Ruby are often used for scripting or web-related tasks (or other brief I/O-bound applications), and a given "program" may run for a few tenths of a second at most. It's unlikely that any optimization that impacts startup times would be permissible.

### I think Self would only

I think Self would only optimize code after it executed a few times so that it remained responsive. The Sun's Hotspot compiler also kind of works like this. The idea being that with a profiling JIT, you could have your cake (responsiveness) and eat it (throughput) too. If the Ruby program only runs for a short time, I have no idea why we are even talking about performance :)

### If the Ruby program only

If the Ruby program only runs for a short time, I have no idea why we are even talking about performance :)

It may run very often in its lifetime :)

Seriously. This type recording procedure is interesting. But I have no idea why it must be done any time the program starts? Usually the type profile doesn't show much variation after a run and if an unexpected signature occurs the program can still branch into unoptimized code or restart logging and updating the profile for a next run. I suspect JITs are not really necessary.

### I consider small programs

I consider small programs that run often as very similar to programs that run continually like interactive programs. In this case, the online profiling and optimization could be pretty effective.

### JIT + image

I consider small programs that run often as very similar to programs that run continually like interactive programs.

But it has an impact when a JIT is involved because most of the profiling and compilation effort is spent short after startup when the JIT doesn't yet know anything about the behaviour of the program. Later it has to respond to few changes occasionally. When the program terminates information is wasted - unless it is stored in an image that includes cached native code. A JIT + an image would do the job.

I guess one could variate this JIT + image pattern and provide something simpler e.g. a simple type recorder and a postprocessor that factors and compiles code according to the collected data.

### You're right that JITs are

You're right that JITs are not necessary, though byte code probably is, either static or run time generated (ie. Python/Lua). Ruby could do the same (YARV?). I don't believe that being image based (ie. Smalltalk/Self/Common Lisp) is necessary either. The Self paper on type feedback even describes how one could implement type feedback in a static compiler, and even go so far as to guess that it would be more effective in a static compilation setting. My thoughts are, that for a small program that is run often, it's probably better to recompile offline, or even record information offline too, with the option to turn on runtime optimization for long running programs.

### I don't know if it's 'better'

But you can create a dictionary where the keys are object types, and the values are functions, and dispatch it that way. It makes things look more like a config-option than code (even though it's code) and should be O(1) instead of O(n) if you have an insane amount of object types.

See also this Python PEP 3124 and the pointer to an actual implementation.

Note that the type annotation syntax does not contradict with the dynamically typed nature of Python.