## Standing on each others' feet

The below email was recently forwarded to me.

From: Kent Dybvig
To: qobi@purdue.edu
Subject: Re: benchmarking Chez

Jeff,

[...]  By longstanding policy, we do not publish nor support others in
publishing benchmarks comparing Chez Scheme with different versions of
Scheme or with other languages.

Kent


Discuss.

## Comment viewing options

### What's to discuss?

Lots of companies do this. What's bad is when they try to stop others from publishing such benchmarks.

### Unfortunately true...

Perhaps the classic case of this is the purveyors of the Gaussian quantum chemistry program, who banned John Pople, the Nobel laureate who invented the technology they use, from using their program, for benchmarking it!

Josh

### Good policy!

I have a policy of not helping my neighbors tie their shoes, although I think it is often a good idea for them to have their shoes tied if they are wearing laced shoes and are up and about. Reasonable exceptions to that policy occur but not often, so I may as well just point out the general policy if the question starts to come up with any frequency. I pretty much trust that they can take care of tying their shoes themselves.

Mostly it's a matter of time. Every minute I spend trying to tie my neighbor's shoes is time I could better spend otherwise.

-t

### Filling in the ...s

Perhaps I was too subtle.

(a) The chez scheme home page says: "Welcome to www.scheme.com, home of Chez Scheme, the world's fastest and most reliable implementation of Scheme."

(b) In biology, a policy of not assisting with replication or testing would never be tolerated. Anyone publishing a paper in any reputable biology journal is obligated to assist others in replicating and extending their work by, eg, supplying copies of unique clones or other relevant unique materials. Often this can be quite burdensome, much more so than "tar cf - benchmark|mail"

Some posts in this discussion have said basically "well that's okay, that was said while wearing a corporate hat and so there is real money at stake." Compared to the situation in biology, that is silly: the amount of money people in biology stand to make from preventing others from replicating and extending their work dwarfs anything to be made in PLT by orders of magnitude, but for the good of the field people feel obligated to do what is best for science.

### Without the source or even

Without the source or even descriptions of the implementation approach, there's no scientific benefit of benchmarking Chez Scheme. Even with that, it's not worthwhile.

### Barak was being too subtle

Whatever we think about the worth of benchmarking, I think Barak's point is simply that when Chez Scheme claims to be "the world's fastest and most reliable implementation of Scheme", we might hope that there's something behind that claim - but it isn't being demonstrated.

### Not publishing benchmarks

I think that there are two separate issues here. The first issue is whether or not it is appropriate to make wild claims without supporting evidence. I think that most people would agree that it isn't. The second issue is whether or not a language implementor should publish or help with publishing benchmarks comparing different languages or implementations. Frankly, I'm beginning to think that a policy of not publishing or helping with publishing benchmarks is more than acceptable.

### Microbenchmarks

What is the actual purpose of microbenchmarks, anyway?

Microarguments.

### A solid suite of

A solid suite of micro-benchmarks is worth its weight in gold, because they let you test the effects of changes in isolation, which you can't do with programs found "in the wild". For example, if an optimization doesn't help you on the micro-benchmarks designed specifically to exercise it, you can throw it out without any regret and simplify the compiler.

Think of it as the difference between microbenchmarks and real benchmarks as the difference between controlled experiments and observational data.

Next question: what's the purpose of comparing microbenchmarks between different languages?

Knowing how much things cost affects how you approach problems. This is clearly true for syntactic "cost" -- if certain operations have terse syntax, you tend to use them more. For example, in a language with compact syntax for associative arrays, you are more likely to use them. In one supporting "association lists", you will probably use those unless they are a bottleneck.

This is also true for performance cost -- all else being equal, you want to play to your language's strengths. Except unlike with syntax, performance is not reflected in obvious surface features, so microbenchmarks are important for understanding which operations to favor.

### subjective

I think purpose is subjective, and I guess you understand that, so I guess you're asking - can someone give me an example purpose which I might consider valid?

I'd also be guessing about whether you really meant microbenchmark or whether you just meant tiny program (which I've heard people call microbenchmarks but I think they intended that to be dismissive. If I was going to be dismissive about the programs on the benchmarks game, I'd call them "toy programs" rather than microbenchmarks.)

But I find blatantly invalid purposes more interesting - I've seen 3 or 4 cases where someone has stated language A is X times faster than language B with a reference to the benchmarks game - and when I've looked at the measurements shown on the website they simply don't correspond in any way with the claim that was made.

These are cases where the website has not been updated since the person made the claim, and I've been quite thorough in looking for something/anything that might support the claim. I've tried to understand how the data shown on the website might have been misinterpreted, but in these 3 or 4 cases have concluded that wasn't the case.

I think those people either knew that what was shown contradicted their claim, or they simply weren't interested whether it did or not. For them, the purpose was a discussion-stopping appeal to some independent authority, with the practical understanding that no one would check if what they claimed was in fact supported by that authority figure.

### Window into a possibility space

If you're working on a language implementation, then microbenchmarks are useful to see how your implementation stacks up against your competition in specific areas. How meaningful a specific microbenchmark is can vary significantly, though.

Sometimes, wildly varying microbenchmark results across languages or implementations may tell you something interesting about the feature being measured. For example, the performance of small programs which create and use first-class continuations heavily can vary enormously between languages and implementations. That's because there are many different ways to implement the feature being tested. In a case like that, a microbenchmark can be a good predictor of how the language is going to perform in general, on certain kinds of tasks.

If you choose your microbenchmarks carefully, they can be indispensable: a way of measuring and gathering information about the performance of a large class of applications that you couldn't benchmark individually.

### Macro-/microbenchmarks

A macrobenchmark (or "real benchmark") can tell you that something is wrong; a microbenchmark can help you figure out what.

### Mircobenchmark bias

I think that the main weakness of microbenchmarks is that they don't measure abstraction penalty. A microbenchmark usually just doesn't contain any user-defined abstractions worth mentioning. This means that CPU time and memory usage microbenchmarks are strongly biased towards languages and implementations that allow efficient programs to be written through careful manual tuning, but do not necessarily have good abstraction facilities or implementations capable of eliminating abstraction penalty. Picking such a language (bad at defining abstractions or bad at eliminating abstraction penalty) for anything but the tiniest of projects is a bad idea.

Another strong bias of microbenchmarks is towards languages and implementations that come with lots of libraries. Sometimes the presence or absence of a reasonable library for a particular problem domain can determine whether it is feasible to implement a particular microbenchmark succinctly and efficiently or even (within a couple of hundred lines) at all. Languages or implementations with lots of libraries do not necessarily have particularly good abstraction facilities, but usually have only a single main/strong/non-toy implementation, a relatively large user community, or perhaps a relatively stably funded research group behind them. Picking such a language (single, stably funded implementation; large community) may not be a bad idea. OTOH, you are likely just following a bandwagon. The negative aspect of voting for the winner is that the full potential of many good languages and implementations (good abstraction facilities, good abstraction penalty eliminating implementations) will probably never be realized.

### I'll be absurd

"... main weakness of microbenchmarks is that they don't measure abstraction penalty"
Isn't that like thinking the main weakness of French language grammar tests is that they don't measure ability in Mathematics?

"... whether it is feasible to implement a particular microbenchmark ... (within a couple of hundred lines)"
a couple of hundred lines? a microbenchmark?

Thank you for confirming that I don't understand what's meant by microbenchmark :-)

### Insight, not numbers

Hmm... We may be thinking about slightly different notions of "microbenchmark". I was mostly thinking about the kind of toy programs that you see in the language shootouts, but that also includes more traditional compiler benchmarks like tak.

Isn't that like thinking the main weakness of French language grammar tests is that they don't measure ability in Mathematics?

Well, yes and no. As Hamming put it:

The purpose of computing is insight, not numbers.

Whatever benchmarking you do, you need to understand what you are actually measuring. My point is that, in general, microbenchmarks, as I understand them, are often strongly biased and simply do not measure many of the features that I find highly valuable in programming languages and their implementations.

In particular, an attempt to understand the performance profile of a language implementation based on results of microbenchmarks is severely hampered, because microbenchmarks generally do not measure abstraction penalty. For example, suppose you have two language implementations. The results of some set of microbenchmarks might say that the implementations perform equally well on average. But microbenchmarks are not typical programs. In reality, it may well be that one of the implementations performs significantly better on typical programs, because it is significantly better at eliminating abstraction penalty. This is neither supported nor contradicted by the results of the microbenchmarks.

a couple of hundred lines? a microbenchmark?

Well, in the language shootouts, some of the benchmarks have required features such as regular expression matching, sockets, etc... Most such features can be implemented in just about any language, either within the language or through FFI bindings, but it may take some amount of code.

### As a compiler writer, I use

As a compiler writer, I use microbenchmarks to measure the abstraction penalty, so that I can do something about it. For example, if you've got a functorized and de-functorized program, you can measure the possible overhead of functorization.

This is particularly valuable when you face very complex tradeoffs. For example, if you have (map f (map g list)) in ML and Haskell, lazy evaluation has profound performance effects. On the one hand, laziness faces a penalty due to the indirection imposed by thunking. On the other, the strict implementation will completely build an intermediate list before consuming it, whereas the lazy one will do it incrementally. That's going to be much more GC-friendly. So if you plot relative performance as a function of list argument size, you can estimate how these two effects interact.

We are talking about rather different issues here.

You are talking about using tailor made micro-benchmarks to gain insight on the tradeoffs of different implementation strategies or semantics of various abstractions. Using data from such benchmarks to decide which implementation strategy or semantics to use or provide can certainly be a good idea.

For example, if you've got a functorized and de-functorized program, you can measure the possible overhead of functorization.

I agree that such a test, when designed and implemented properly, can give insight into a particular abstraction penalty (the penalty of abstraction through functorization). That is also very different from the kind of micro-benchmarks that I had in mind. In particular, none of the toy benchmarks in the language shootouts properly measure anything like that.

This is particularly valuable when you face very complex tradeoffs. For example, if you have (map f (map g list)) in ML and Haskell, lazy evaluation has profound performance effects. On the one hand, laziness faces a penalty due to the indirection imposed by thunking. On the other, the strict implementation will completely build an intermediate list before consuming it, whereas the lazy one will do it incrementally. That's going to be much more GC-friendly. So if you plot relative performance as a function of list argument size, you can estimate how these two effects interact.

Certainly, if all you want is to gain insight on the performance tradeoffs between particular implementations of strict+impure and non-strict+pure semantics of sequences, then such micro-benchmarks may be useful. However, trying to reach more general conclusions from data obtained from such micro-benchmarks is going to be incredibly tricky and the route is full of pitfalls. Such micro-benchmarks are next to useless for predicting the performance of real programs written in real languages and implementations.

### Well, yes and no.

In your reply, I found the "well, yes" but I'm struggling to find the "well, no" ;-)

"do not measure many of the features that I find highly valuable"
I'm missing why this is more than - the French language grammar test doesn't tell us about ability in Mathematics? Are you saying that these features that you find highly valuable are unmeasurable?

I think by now it's clear that language implementors have reasons for comparing microbenchmarks between different languages, that Luke would probably think valid - the purpose is to make the implementation better.

We might anticipate a follow-on question - can someone give me an example purpose, not involving language implementors, which I might consider valid?

### The no part

I'm struggling to find the "well, no" ;-)

Sorry, I wasn't clear enough. The no part is that I see no controversy here. My point was specifically to say that interpreting data from the kind of micro-benchmarks (toy benchmarks) that I was talking about is full of pitfalls. The yes part is that, like using a French grammar test to predict real world ability in Mathematics, using data from micro-benchmarks, particularly from the toy benchmarks that you see in language shootouts, to predict performance or clarity of real programs would be meaningless and stupid.

Are you saying that these features that you find highly valuable are unmeasurable?

No. I'm just saying that you need to understand what a particular benchmark or test actually measures. I'm also saying that the toy benchmarks that you see in language shootouts basically measure nothing relevant for gaining insight on the performance and clarity of real programs.

Let me quote a couple of paragraphs (partly my emphasis) from the second edition of the book Computer Architecture a Quantitative Approach:

There are four levels of programs used [for evaluating machines and other evaluators], listed below in decreasing order of accuracy of prediction.
1. Real programs -- While the buyer may not know what fraction of time is spent on these programs, she knows that some users will run them to solve real problems. Examples are compilers for C, text-processing software like TeX, and CAD tools like Spice. Real programs have input, output, and options that a user can select when running the program.
2. Kernels -- Several attempts have been made to extract small, key pieces from real programs and use them to evaluate performance. Livermore Loops and Linpack are the best known examples. Unlike real programs, no user would run kernel programs, for they exist solely to evaluate performance. Kernels are best used to isolate performance of individual features of a machine to explain the reasons for differences in performance of real programs.
3. Toy benchmarks -- Toy benchmarks are typically between 10 and 100 lines of code and produce a result the user already knows before running the toy program. Programs like Sieve of Erasthosthenes, Puzzle, and Quicksort are popular because they are small, easy to type, and run on almost any computer. The best use of such programs is beginning programming assignments.
4. Synthetic benchmarks -- Similar in philosophy to kernels, synthetic benchmarks try to match the average frequency of operations and operands of a large set of programs. Whetstone and Dhrystone are the most popular synthetic benchmarks. [...] Synthetic benchmarks are not even pieces of real programs, while kernels might be.

I agree with Hennessy and Patterson.

Edit: To add insult to injury, many of the toy benchmark implementations that you see in language shootouts are specifically not written by beginners, but are rather extremely carefully tuned implementations written by experts. Language shootouts are so far removed from reality that they are best described as huge jokes (guess on who).

### For what it's worth, I think

For what it's worth, I think that language shootouts have been good for the Haskell community - they've encouraged a greater understanding of performance behaviour, and the community taste for well-factored code is helping to filter it into libraries that improve everybody's performance.

That said, Haskell is something of a special case here compared to the other languages it's competing with.

### side-effects

brickbats accepted, bouquets welcomed :-)

Back to Luke's question of purpose:

• for microbenchmarks/kernels I think there's a clear direct purpose (at least for language implementors)
• for toy benchmarks, specifically for the benchmarks game, speaking for myself, there have been several less-clear less-direct purposes ...

One purpose was simply to show people programs in languages that they weren't likely to have heard of, let alone used. (We can speculate about how few people actually look at the programs, and how many fewer look at programs written in languages they haven't heard of - but I know some do, and I know some've gone on to learn that new programming language, and for me that's enough.)

Sometimes there seems to be the idea that if the benchmarks game didn't exist, people wouldn't write badly flawed toy benchmarks and people wouldn't have strange ideas about the relative performance of programming language implementations - that isn't so, people would still write badly flawed toy benchmarks, they would still have strange ideas, they still do.

One purpose was to shake-out those egregiously flawed toy benchmarks, and bring them together in one place with more attention and more criticism.

One purpose was to shake-out those truly mistaken ideas by providing some (any!) actual basis for comparison.

If, as a side-effect, we provided a vehicle for changing attitudes and expectations in the Haskell community that's very cool ;-)

### kernels = microbenchmarks ?

Pity you didn't paraphrase the description of "Kernels" - my guess is that it would correspond to what our language implementors mean by microbenchmarks.

And I think you'll agree that "toy benchmarks" is much more satisfyingly dismissive than microbenchmarks :-)

"... interpreting data from ... is full of pitfalls."
To be trite, life is full of pitfalls - that isn't an interesting criticism of anything.

"I'm just saying that you need to understand what a particular benchmark or test actually measures."
Well I don't think you can just be saying that because it verges on tautology, my guess it that you are saying other people don't understand what a particular ... test actually measures. And you're explicitly saying "toy benchmarks ... measure nothing relevant for gaining insight on the performance ... of real programs."

I think we should expect that kind of sweeping generalization to hit against exceptional cases, and it does - someone really did make the effort of emailing me to say their work is writing programs just like a couple of those "toy benchmarks", for them they are "real programs"!

And with that we should understand that the refrain of "real programs" is itself a simplistic conceit - it doesn't matter how "real" the programs are, it very much matters how like your particular programs they are.

"... specifically not written by beginners, but are rather extremely carefully tuned implementations written by experts."
In your reality are programmers careless beginners?

### Trite

To be trite, life is full of pitfalls - that isn't an interesting criticism of anything.

Oh, come on! Right in my first post on this thread I critique toy benchmarks for a couple of their main deficiencies. My later posts use much more general statements mainly because you chose not to understand the purpose of the more specific criticism.

Are you saying that documenting fallacies and pitfalls is uninteresting? I think that you should get into contact with Hennessy and Patterson to explain to them that their decision to include a section on Fallacies and Pitfalls in every chapter makes their book uninteresting, because "life is full of pitfalls" and they are just reciting tautologies.

I'm just saying that you need to understand what a particular benchmark or test actually measures.

Well I don't think you can just be saying that because it verges on tautology

So, you must then find people like Hennessy and Patterson to be the most boring and intellectually dull people ever, because they've written quite a few pages explaining that basic issue from a great many perspectives in their very highly regarded set of books.

my guess it that you are saying other people don't understand what a particular ... test actually measures.

No, I didn't say that. In fact, I've often listened to people explaining how some other people are incapable of understanding certain things. Frankly, I don't like such elitist attitude. People may be ignorant, but I believe everyone can learn assuming they put enough effort into it. One thing that doesn't help to cure ignorance is labeling hard earned lessons from experts like Hennessy and Patterns as trite.

And you're explicitly saying "toy benchmarks ... measure nothing relevant for gaining insight on the performance ... of real programs." I think we should expect that kind of sweeping generalization to hit against exceptional cases

Indeed, that is precisely why I said "toy benchmarks basically measure ..."!

someone really did make the effort of emailing me to say their work is writing programs just like a couple of those "toy benchmarks", for them they are "real programs"!

I would assume that those programs are the ones that are used for some ad hoc analysis or manipulation of raw data. Sure, practically every programmer sometimes writes short, throwaway programs for some very specific needs. I do that quite often. Doing that for a living is the exception.

And with that we should understand that the refrain of "real programs" is itself a simplistic conceit - it doesn't matter how "real" the programs are, it very much matters how like your particular programs they are.

Notice that I included the informal definition of "real programs" by Hennessy and Patterson. Certainly, real programs may be short. I think that a crucial difference between real programs, that just happen to be exceptionally short, and toy benchmarks is that real programs are run for the (valuable) outputs they produce, while toy benchmarks are used for evaluating performance and the output they produce has no value.

... specifically not written by beginners, but are rather extremely carefully tuned implementations written by experts.
In your reality are programmers careless beginners?

Now you are missing the point. The main point is that I find, like Hennessy and Patterson, little sense in using toy benchmarks for evaluating programming languages and their implementations. The other point, also important, is that I equate the sort of tuning that I'm talking about here with the kind of tricks that Hennessy and Patterson describe being used by processor and compiler vendors to circumvent the usefulness of benchmarks for comparing performance.

### because you chose not to understand

... because you chose not to understand ...
Is it possible that you find what you write understandable and I find what you write less understandable?

Are you saying that documenting fallacies and pitfalls is uninteresting?
I'm saying I don't see where you documented fallacies and pitfalls, I do see where you flatly stated that interpreting data from toy benchmarks is full of pitfalls.

a crucial difference between real programs, that just happen to be exceptionally short, and toy benchmarks is that real programs are run for the (valuable) outputs they produce, while toy benchmarks are used for evaluating performance and the output they produce has no value.
So the same program, producing the same output, can be either a "real program" or a "toy benchmark" - depending on whether we make use of the output or the time measurement?

I equate the sort of tuning that I'm talking about here with the kind of tricks that ...
Have you given a specific example of the "sort of tuning" that you are talking about?

### Have you given a specific example

Is it possible that you find what you write understandable and I find what you write less understandable?

Or maybe you are just exceptionally poor at asking questions that would help you understand. Most of your questions on this thread have been based on silly analogies (e.g. your first question) or have been "begging the answer" indirectly (e.g. your last question in this post). At the best, analogies can give a hint of intuition, but intuition is often wrong, and an indirect question may not give the answer you want for many reasons. Please learn to ask more specific technical questions.

I'm saying I don't see where you documented fallacies and pitfalls, I do see where you flatly stated that interpreting data from toy benchmarks is full of pitfalls.

My first posts asserts that micro benchmarks, better referred to as toy benchmarks, do not measure abstraction penalty. My first posts then explains that abstraction penalty is not measured, because the benchmark implementations do not contain significant user defined abstractions. Then my first post asserts that not measuring abstraction penalty means that it favors languages and implementations that allow efficient programs to be written by careful tuning. Then my post asserts that using such a language for anything but the tiniest of projects would be a bad idea.

The implied fallacy (or misbelief) is obviously to believe that measurements from toy benchmarks would correlate with measurements from large real programs. The implied pitfall (or easily made mistake) is obviously to choose a language for writing large real programs based on measurements from toy benchmarks.

So the same program, producing the same output, can be either a "real program" or a "toy benchmark" - depending on whether we make use of the output or the time measurement?

In a couple of special cases, yes. In most cases, no.

Have you given a specific example of the "sort of tuning" that you are talking about?

No, I have not.

### Can we please try to make

Can we please try to make the discussion a little more civil guys? Thanks.

### assert without evidence

"The implied fallacy (or misbelief) is obviously to believe that measurements from toy benchmarks would correlate with measurements from large real programs."
Do you have corresponding measurements from large real programs?
Without those measurements how do you know there is not a correlation?

"The implied pitfall (or easily made mistake) is obviously to choose a language for writing large real programs based on measurements from toy benchmarks."
Do you have any evidence that a language has been chosen for writing large real programs based on measurements from toy benchmarks?
Maybe it is a mistake that is not easily made.

### The opposite of the scientific method

Do you have corresponding measurements from large real programs?

Without those measurements how do you know there is not a correlation?

Excellent question! I think we are finally getting to the root of the problem with your understanding.

A correlation doesn't exist just because there is no proof that it doesn't exist. The scientific method would be to first make a hypothesis that such a correlation exists and then use experimental studies to test the accuracy of the hypothesis. What I have been trying to say here for a long time now in many different ways is that such correlation does not necessarily exist for the reasons I have given (e.g. lack of correspondence to real programs, huge variations between implementations of the same benchmark, bias introduced by the rules, ...). Because there is no evidence that such a correlation exists, you should not use the data assuming that the correlation exists.

Indeed, are you claiming that results from toy benchmarks would correlate with results from large real programs?

Do you have any evidence that a language has been chosen for writing large real programs based on measurements from toy benchmarks?

Maybe it is a mistake that is not easily made.

Let's hope so. I'm trying to make sure it doesn't happen. However, the bigger problem is advocacy based on toy benchmark results from language shootouts.

### no measurements, no evidence, no problem

"Excellent question!"
- Do you have corresponding measurements from large real programs?
- Do you have any evidence that a language has been chosen for writing large real programs based on measurements from toy benchmarks?

I think this kind of remark is the reason that Ehud asked that we try to make the discussion more civil.

"you should not use the data assuming that the correlation exists"
If you ever find someone who is confused about that please tell them, until then please consider whether you're telling people something they already know.

### The thing is...

- Do you have corresponding measurements from large real programs?

- Do you have any evidence that a language has been chosen for writing large real programs based on measurements from toy benchmarks?

I don't need to prove anything about any correlation between toy benchmark data and the things that I'm advocating, because I'm NOT using data from toy benchmarks to advocate anything. People who advocate languages based on results from the toy benchmarks of language shootouts are the ones who have the burden of proof. For example, if someone goes around suggesting to people that Haskell/GHC is prime for real applications, because it performs well on toy benchmarks from language shootouts, then that someone has the burden of showing the correlation between results from toy benchmark and performance of real programs. The same goes for any language/implementation, like SML/MLton or C++/g++ and so on.

### burden of proof

Vesa Karvonen: I don't need to prove anything about any correlation between toy benchmark data and the things that I'm advocating, because I'm NOT using data from toy benchmarks to advocate anything.

You're correct, with a clear description nicely done. Burden of proof is on the party claiming data means something. But sadly it's often the case folks absurdly claim burden of proof is other way: that meaninglessness must be proved — or substantiated by evidence attempting to demonstrate a negative (not a good idea). You might call this tactic "proof by denial of service" because you can claim anything you want and require others to work without bound to dismiss something not shown.

Personally, I find it pretty irritating when folks say "show evidence my claim means nothing", which is why I'm chiming in to help. It's often done by someone who knows exactly what they're doing, with experience in working others into a lather, by choosing a random series of tactics aiming to elicit an emotional reaction. When you begin to refer to them in your responses, this tells them the emotional part is working.

It's both more effective and more civil to avoid verbal reference to antagonists. Whether they actually take offense — or it's just a pose to keep emotions up — one is better off structuring sentences to avoid "you" (except when it means "one"). In contrast, one of the most patronizing things you can do is start a sentence with "Name, " as if you're talking to a wayward child. For what it's worth, you've only been slightly careless in this department, and it shouldn't have had this much effect. But LtU threads tend to go off by themselves, so some extra care is useful. [cf]

### just a question

You might call this tactic ...
Sometimes a question is just a question - a simple "no" would tell us there was no new data to consider, a "yes" would have been interesting.

... when folks say "show evidence my claim means nothing", which is why I'm chiming in to help.

Do you have corresponding measurements from large real programs? Without those measurements how do you know there is not a correlation?

The first sentence, qualified by the italicized fragment, loosely (but not precisely) means what I paraphrased, to my ears. (One can always insist precision means only exact quotes, after which no interpretation of text is possible.) I realize I'm not actually helping, but I wanted to cooperate once.

To help folks who don't want to unsnarl the tangled graph of post references, I'm saying I agree with Vesa: unless you compare only benchmark results on different versions of the same system, benchmark data implies very little about cross system qualities, without someone having shown benchmark data correlates with something else across systems. Vesa is saying no correlation has been shown, while you (seem to) say he has the burden of showing there is no such correlation. My apologies if I'm mistaken.

### accepted

"My apologies if I'm mistaken."
You are mistaken, apology accepted.

The questions were in response to
"The implied fallacy (or misbelief) is obviously to believe that measurements from toy benchmarks would correlate with measurements from large real programs."

Once I realized that my feeling that there would be /no correlation/ was no more than an assumption, and correlation is such a low bar (sun-spots penguins), I questioned the assumption, and we saw a revised statement:

"... such correlation does not necessarily exist ..." (my emphasis)

### re: burden of proof

Burden of proof is on the party claiming data means something.

Thank you!

### The thing is

The thing is, you are the person who suggested that choosing "a language for writing large real programs based on measurements from toy benchmarks" is an easily made mistake, so you do have the burden of proof for that.

### skepticism has no burden of proof

When he says that, it's just a rewording of "no correlation has been shown between benchmarks and performance of large real systems" for which he has no burden of proof. He's saying only that it's a mistake to assume benchmarks mean something significant.

We need not always prove or back up our remarks, when those remarks are substantively skepticism based on scientific method. There is no burden of proof to express skepticism. If there was, then skepticism would not be permitted.

We disagree about what was meant.

There was a 2 part statement - "The implied fallacy (or misbelief)" and "The implied pitfall (or easily made mistake)".

I don't think it's correct to say the second part is simply a rewording of the first part. The first part points to the need for evidence of correlation. The second part goes further and claims choosing between languages for large real programs on that basis is an easily made mistake.

I have heard a story about a software company moving all their development work to C# and .Net after the CEO played golf with Bill Gates; I don't recall any stories about companies choosing a programming language because of toy benchmarks, maybe you know some.

### we're in a rat hole

We're in tar baby territory. (cf) "Easily made mistake" is commonly understood as rhetorically hypothetical, and doesn't need a factual exemplar. If someone says they know for a fact mistakes made numerous times with dire consequences should be a lesson to us all, then you ask for cites. I didn't feel that here. Signing off.

### hypothetical

If you're saying the problem we are being so vehemently warned against is purely hypothetical then we are in violent agreement.

### Which people?

I've looked back through this topic and haven't found where anyone advocated a language based on results from toy benchmarks - are these people who you know posted on a blog or mailing list or ... are they hypothetical people?

The thing is, you are the person who suggested that choosing "a language for writing large real programs based on measurements from toy benchmarks" is an easily made mistake, so you do have the burden of proof for that.

### What do language shootouts really measure?

Let me elaborate a little more on why I think that the toy benchmarks from language shootouts basically measure nothing and are biased.

The main problem is that the language shootouts that I've seen have never been anything like controlled experiments. The requirements for the implementations are often loose, vague, arbitrary, or artificial and the implementations are written by various authors with differing objectives. One author might sacrifice everything for performance while another might sacrifice some (or most) performance for clarity. Some optimize their programs for some measure of brevity or clarity and others use a different measure or don't care about such issues at all. The variability between toy benchmark implementations is huge. The bottom line is that (I claim) nobody knows what a typical toy benchmark from a language shootout actually measures. It would take a lot of careful analysis just to understand the tradeoffs chosen by the various authors for their implementations of the toy benchmarks.

To put it in other words, language shootouts are typically bad, immature examples of the empirical method. In accordance with the empirical method, they produce lots of data, whose true meaning and significance is not understood. Yet, and this is the bad part, the data is used for making comparisons and endorsing particular products or points of view. This is bad, because the true meaning and significance of the data is unclear and it would take a lot of careful analysis to understand the data and that analysis is not done.

On the other hand, the rules of some of the language shootouts are strongly biased. Typically the rules strongly favor kitchen-sink implementations with lots of standard libraries and lots of built-in features. They also favor groups and individuals who are in a position to define what goes into the language or "standard" libraries and make releases of the implementations they wish to endorse. This bias comes mainly from limitations on and the measurement of program size (amount of source code). When you happen to have (or can make it happen) a library or built-in language feature that greatly helps in writing a specific toy benchmark you have a huge, and potentially unfair, advantage.

If the intention is to gain insight on the potential performance of programs, then source code size should not matter, and it should be only fair to allow one to include a potentially large library with the program to get good performance. On the other had, if the intention is to gain insight on the potential for writing neat, well-factored, and abstracted code, then it should be only fair to allow one to include domain specific libraries with the toy benchmark implementation. Typically this isn't allowed or is punished by the measurements. Implementations that are too long are often rejected by the rules and when an implementation includes code that could well be in a library, it counts against the entry. OTOH, when an existing library is being used, the library implementation isn't measured at all. The library could include a significant amount of code and, in the worst case, even be written and released more or less specifically with the toy benchmark in mind.

In other words, the rules of language shootouts, in general, are typically vague, arbitrary, artificial, and biased. They don't really serve any well defined purpose or support the comparisons that are being made. This makes the meaning of the data obtained from the toy benchmarks vague, arbitrary, artificial, and biased.

What do language shootouts really measure? Nothing. Nobody knows.

### language shootouts, in general

When you talk about "language shootouts, in general" there isn't something we can look at and say Vesa's right about that, or I don't understand why Vesa said that.

### Something to look at

BTW, I DO have specific examples of where shootout benchmarks use techniques that are similar to the techniques described by Hennessy and Patterson being used by CPU/compiler vendors to circumvent the usefulness of benchmarks.

As a very clear example, Hennessy and Patterson point out the way that CPU/compiler vendors use benchmark specific compiler settings for tuning. Here is what they say on SPEC benchmark suites (Page 48, 2nd ed. of Computer Architecture - A Quantitative Approach):

[...] vendors found methods to tune the performance of individual benchmarks by the use of different compilers or preprocessors, as well as benchmark-specific flags. [...] In fact, benchmark-specific flags are allowed, even if they are illegal and lead to incorrect compilation in general! This has resulted in long lists of options as Figure 1.18 shows. This incredible list of impenetrable options used in the tuned measurements [...] makes it clear why the baseline measurements were needed. The performance difference between the baseline and tuned numbers can be substantial.

So, if you want something to look at, go look at the compiler options used on different benchmarks on a single language/implementation.

However, I just don't think that any specific example is all that relevant. The problem is within the rules of the shootout (and those who use the results for advocacy) and not with the people who (ab)use those rules to their advantage.

Well, I think that this discussion is over from my part. I've pretty much said what I wanted to say.

### Experts vs beginners

I have no problem with expert solutions to toy benchmarks; this is not a social experiment. Even if you wanted to treat it like one, the expert's results give you an upper bound.

If you can interpret it correctly, any measurement is better than none. I especially like it when I'm proven wrong with the outcome. Even toy benchmarks help a lot with dismantling misconceptions.