Why determinism matters in language design.

In the recent set of discussion on functions vs procedures a reference was made to the ADA language design choice neither to ban side effects in functions nor specify the order of evaluation of expressions.

So let me explain why this upsets me.

Thus thanks to our indecision coupled with human incompetence of the average programmer.. if you run a large program on a different compiler, or a different version of the same compiler, or with different optimization settings.... expect different results.

If you look at large modern commercial software systems... a large chunk of the cost (even worse schedule time) is in testing and retesting.

Mostly our test department goes hunting in areas where we have introduced change.

However, bad language design choices as made by the Ada design team add to the need for us to say to the test department... "Ah, we've had to change X (where X is one of that list above), so anything is possible. Sorry, you have to retest everything."

This can easily result in the vast proportion of the cost of a release being in the test phase. Which can be devastating to the release date schedule, as they are _always_ on the critical path. (Final test comes after everything else is done and dusted)

Determinism SHOULD feature highly as a design criteria for language designers.

Yes, programmers shouldn't code bugs.... but get real. In very large industrial systems there are hundreds even tens of thousands of bugs in shipped systems. Thus it is important that wherever possible, even if the program is just plain rong, different compilers, compiler versions, optimizer settings and memory layout, wherever possible produce the _same_ tested and accepted wrong behaviour.

Obviously it will not always be possible.

Obviously wherever possible defects should be prevented from being written, defects should be detected and fixed.

But simple language design choices that "prematurely optimize" by destroying determinism cost millions.

Determinism matters.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Advantages of nondeterminism.

There are advantages to nondeterminism: the compiler can potentially reorder the code to be more efficient.

For example, in Scheme, the order in which the right hand sides of a (let ((v0 rhs0) (v1 rhs1) ...) body) are evaluated is undefined, and it really doesn't create too much difficulty. You can always let* it. On the other hand, the order of which arguments to a procedure call are evaluated *is* precisely defined, and as you point out, people unfamiliar with the standard semantics of a multi-variable (let ...) that happen to use an implementation that evaluates it top-down might get surprised when you switch implementations.

Most schemers though, are aware of that, thanks to (let* ...), and tend to write portable code.

But I don't really disagree. I remember one time how supremely annoyed I was to find out that C++ doesn't define the order in which arguments to a procedure call are evaluated either... I had written a really concise, beautiful, recursive descent parser only to see it get quite a bit longer when I found that out. (It wasn't too bad after I fixed that, though.)

Glad that's resolved!

On the other hand, the order of which arguments to a procedure call are evaluated *is* precisely defined,

It is?

Oops

Sigh... I could have sworn. :-)

Well, honestly, I really don't use effects that much, and almost never use effects in that fashion, so I guess it's not that surprising (to me) that that particular misconception of mine can linger...

Definitely not.

Quoth both R5RS and R6RS: "The operator and operand expressions are evaluated (in an unspecified order) and the resulting procedure is passed the resulting arguments."

However, the ANSI CL standard says: "A function form is evaluated as follows: The subforms in the cdr of the original form are evaluated in left-to-right order in the current lexical and dynamic environments."

Non-deterministic, both ways, in the snow

Some context is in order.

Of course, all things being equal, guaranteed deterministic behavior is preferable to potentially non-deterministic behavior, in almost all reasonable cases.

However, the context with respect to languages like C++, Ada, Scheme, and quite a few other languages that have grappled with this issue, is that they were being designed and specified at a time when the global availability of CPU bandwidth was a ridiculous number of orders of magnitude lower than it is now. I.e., not only did every CPU run thousands of times slower, there were orders of magnitude fewer CPUs on (and off) the planet.

In that context, compiler optimizations were a much bigger deal in practice than they are now, for most applications.

In addition, one of the avenues that was expected to improve performance, even as early as the 1970s (when Scheme was invented), was parallel computing. These days, anyone looking to optimize parallel performance might look to an effects system so that they can statically ensure that parallel optimizations maintain determinism. However, back in the late '70s and the '80s, that kind of technology was science fiction, for all intents and purposes.

That left language designers with a choice: they could try to exploit the optimizations to be had by allowing evaluation order to be shuffled, or they could eschew that and go the more safety-oriented route that Java finally took, decades later. There's a reason that a language like Java didn't start to become mainstream until the late '90s. That kind of attention to safety just wasn't seen as a realistic approach ten years earlier.

Another factor that's changed in the intervening time is the size, composition, and attitude of the pool of programmers. Back then, many programmers were used to being responsible for the lowest-level details. The story of Mel is not as unusual as it might sound today. Many programmers from that time could tell similar stories, perhaps slightly less extreme, but along the same lines. Telling those programmers that some new high level language was choosing guaranteed determinism over performance to save them from themselves would have been an insult, at best.

If I was concerned about the millions (billions) being lost or saved, I might prefer to trade determinism for a better pool of programmers, but that option doesn't seem to be on the table.

Tl;dr: it's easy to say today that determinism is good, mmmkay, but it wasn't always as easy as that. The languages that still have this issue have to deal with their legacy, which often means that simply declaring an evaluation order for them today is not without its own set of problems.

Not just historical context

It's not just of historical interest either. If a company is building a highly constrained embedded device then the sum total of little optimizations like out-of-order evaluation might very well make the difference between one class of CPU and its next larger cousin and that will have real costs. In fact, if the target environment has tight constraints on size, power consumption, or heat production then the next larger cousin may not even be an option so little optimizations might make the difference as to whether the product sees the light of day or not.

Even on the disgustingly powerful machines sitting on our desks the differences can be substantial. The wrong order of execution can cause pipeline stalls, missed branch predictions, and other CPU killers. The difference in performance at the various levels of the memory hierarchy can be several orders of magnitude and out-of-order evaluation can do a lot to relieve register or cache pressure, especially in a tight loop. That's not to say that every bit of software written should care about these things, but there are certainly large classes of software that do.

Effects systems have been brought up several times, but it's still very much open research to create effects systems that are both usable and granular.

I share John Carter's frustration at the kinds of bugs that unspecified ordering can cause. But that's the least of things that remain unspecified. In his rant he says

if you run a large program on a different compiler, or a different version of the same compiler, or with different optimization settings.... expect different results.

And then

However, bad language design choices as made by the Ada design team add to the need for us to say to the test department... "Ah, we've had to change X (where X is one of that list above), so anything is possible. Sorry, you have to retest everything."

But that's ridiculous. Even if this one thing were guaranteed by Ada, there are still tons of things that aren't going to be specified (like details in behavior of memory allocation/freeing) that can make the difference between "working fine," "too slow," and "totally borked." Even the language aspects that are specified may have bugs in new compiler version. If changing the compiler doesn't automatically prompt a full test of your product then you are playing Russian roulette, regardless of a guarantee on evaluation order.

In fact, the way effects from one area of your program can affect other seemingly "distant" parts means that testing only changed code is a recipe for disaster. The road to hell is paved by development teams that said "but we didn't change anything in that part of the system." There's a reason that QA professionals use phrases like "full regression test", "code coverage," and "automated acceptance testing."

Got a universe or two?

There's a reason that QA professionals use phrases like "full regression test", "code coverage," and "automated acceptance testing."

I have seen several commercial systems that would require way greater than 2**1000 test runs to get anywhere near "full coverage".

"Full test coverage" is the idle dream of those who haven't looked deep into the very murky bowels of large systems.

The QA professionals I speak to talk of Guerrilla testing, and testing most likely customer configurations.

Careful with terminology

The phrases used by the previous poster DO have sensible meaning in the QA world, and aren't just buzzwords:

* "full regression test" means that all logged defects have a regression test generated (and regularly run); so any reappearance of the defect can be automatically caught.

* "code coverage" means ensuring as much of the source code as possible is exercised by the test suite.

* "automated acceptance testing" is probably the most weasily of the term, as the acceptance tests (tests specified by stakeholders outside the development team) might be laughably weak (and in any case are constrained by human effort); but these can be done.

All of which differ from the unreachable goal of demonstrating a complete absence of defects.

(I might add that guerrilla testing, as I understand it--there is a dearth of formal literature on the subject--is probably more important for design validation than correctness verification--and if used in lieu of more formal testing, a sign that an organization might not take QA seriously.)

"Rapid Testing" and over and beyond const/volatile.

A less edgy name for Guerrilla Testing is "Rapid Testing" as taught by James Bach.

Rapid Testing does have a proven track record in our organization of finding more of those "bugs that matter" than the more formal methods (which we also do) you list above.

I attended one of his courses and he certainly stimulated a bunch of new and useful thoughts.

Still, I'm are wandering too far off topic.

Over and beyond const/volatile.

I hope I haven't "oversold" what I'm saying...

Given a choice of hard compiler enforced attributes about a function, or determinism... I would choose enforced attributes. (See my comment on useful gcc function attributes for some ideas.)

Personally I wish the next round of standardization in C/C++ looked at what where we can go next "over and beyond const/volatile"... I'm sure there is a lot of mileage to be gained from it with current compiler technology.

Given a choice of unordered evaluation (compiler implementors choice) or semi-ordered (any order so long as it has the same result as specified order)... I'd choose semi-ordered.

I bet you'd be hard pressed to find realistic cases where you pay a significant performance penalty for the very practical gain in determinism.

It's never too late to add semi-ordered evaluation to the C/C++/... standards.

I think your argument proves too much

It doesn't explain why Common Lisp, standardized around the same time as Scheme (you really can't speak of a Scheme standard until RRRS), went with a defined order and Scheme with an undefined order.

Gently down the stream

I was describing a factor that has influenced language design decisions, not a universal law.

The difference between CL and Scheme can be explained quite glibly, and I'm up to the task: a major focus of the early Scheme papers was compiling lambdas efficiently, particularly "Debunking the “expensive procedure call” myth". Allowing a compiler to shuffle argument evaluation fits right in with that goal.

One of the plethora of sub/alternate titles of that paper was "Procedure call implementations considered harmful," which most likely referred to extant Lisp implementations, with their stodgy in-order argument evaluation and semantically-challenged dynamic scope.

Which raises another way to look at this: if Lisp hadn't guaranteed argument evaluation order, the ramifications for the funarg problem would have driven half of Cambridge to drown themselves in the Charles.

Jedi programming

"Use the Force, Luke. Let go, Luke. Luke, trust me."

The world is a chaotic place, determinism is only an illusion and breaks down quickly. We write code that does exactly what we want it to do...except when a condition we hadn't considered arises, or there was a small mistake somewhere, or something crashes or breaks. A language that focuses on statistical efficiency has a much better chance of dealing with a chaotic world than one that is based formal verification. E.g., a program that is running at 95% efficiency (consistently making the correct answer 95% of the time) is often better than a program that is either running at 100% or 0% efficiency, depending on the circumstances.

Of course we aren't there yet, but I'm seeing it in some of my projects where I have to give up on controlling everything and instead make my programs more fuzzy and robust to handle multiple unanticipated situations (say an editor interfacing with a finicky compiler and finicky IDE framework).

yeah, statistically useful is what matters

Sean McDirmid: E.g., a program that is running at 95% efficiency (consistently making the correct answer 95% of the time) is often better than a program that is either running at 100% or 0% efficiency, depending on the circumstances.

You just described my day job in WAN optimization. Most of my work involves increasing percentage of efficiency, and making more components tolerate less than 100% when possible, which is both subtle and tricky. To evaluate new work, I must often do a lot of empirical measurement and analysis (before and after) to characterize exact nature of changes.

However, I like as much determinism as I can get. Lack of it makes many things very hard to study. In particular, when a system has many parts tolerating partial failure (and I add more any chance I get) it masks weak problems subsequently covered by adaptive algorithms filling in blanks. It's easy to ask, is X covering for Y?, and then have a very hard time answering that question, even when precisely defined. I often have to tell folks I can't tell. In particular, when failure to do Y (let's say: recognize a pattern) is permitted by the rules, how can you say any given failure to do Y is wrong? But when it happens too often, it's wrong. Hmm.

Anyway, I agree with both of you: coping with fuzzy results is absolutely necessary and very much a pain. Life would be easier with just a little more simulation support, for repeatable empirical analysis of fuzzy systems.

Your problem smells like bad engineering

I do agree that languages should try to have deterministic behavior, but what you describe is simply bad engineering.

Any system of measurable size must have an automated test suite. This is the only way of getting working software today.

Get this suite and put it do to continuous integration and you have a very high bar on quality.

By the way, if your QA team is testing the same thing every time, they are doing it wrong. Automation is the key to reduce waste and the testing cycle.

Said that, I believe that a language design should focus much more on making it easily testable than focusing on detailed evaluation semantics. The former has significant cost and quality advantes, the second can easily be overcame with coding conventions and style checkers.

Sigh! Large Engineering, not Bad.

Academics need to scale up their notions of software products to multi-millions of lines of code.

You can't scale to very large systems if you are doing Bad Engineering.

The Coverity static analysis tool's website once had one of those "too much truth in advertizing" moments.

At one time it had a rave recommendation from a satisfied blue chip customer along the lines of "we found thousands of defects in mature code that had been shipping for years".

Your first reactions is, nah, that's just False Positives.

Then when you dig deeper you realize, yes, in very large systems, they can and do find thousands of defects.... BUT...

Nobody gives a shit.

Why? Think for example about indexing 1 beyond the end of an array.

That's a defect. A bone fide bug.

But if there is just alignment padding beyond that array, nobody cares. Your test team isn't going to find it, your customer isn't going to find it, so who cares?

And that's the case with 99% of the defects remaining in mature shipping software. It's not that there are no defects, there are lots. They are just not customer impacting. (99.9% of the time)

Change your compiler, change your compiler version, change your optimizer settings, make a change that changes the memory layout....

And a large pile of brown stuff hits the big fan.

Yes, such companies _do_ unit test, they do have regression test suites, ..... you don't get that big without some pretty advanced processes.

The fact is there are products out there that literally have man centuries of coding effort in them.... and are going to get many more man decades of work (and subtle bug injection) before being shelved.

The indexing 1 beyond the end of array problem is conceptual simple and simple to fix.

However, I have come across quite a few such "no impact on customer" bugs where the fix would require largish design alterations and hence more risk to the release schedule than just leaving the bug in!

Laziness and Lenience

The advantages of "in-order" evaluation include determinism. The costs include: loss of parallelization opportunity, a binding to the syntactic form, poor semantics when dealing with macros and other forms of extensible syntax (i.e. left-to-right might not work for a macro, and might change if a function is replaced or walked by a macro), and a tendency to do work even if it isn't necessary.

I've been struggling off and on with the problem for order of evaluation semantics when executing procedures as arguments to a function. John's OP rant has brought the struggle back on again.

I'm somewhat partial to: executive arguments to a function being some combination of lazy and concurrent. Not only is the 'order of execution' undefined, but the arguments might not be executed if the return value isn't necessary. Also, the argument might be executed even if the return value isn't necessary (i.e. based on anticipation of need). If programmers wish to achieve determinism at the procedure or behavior layer, they must explicitly sequence their behaviors.

Those semantics are (I suspect) easier to understand and explain to programmers: an argument to a function described by a behavior is considered a proposed means to obtain a value, should it is necessary, rather than a command to obtain it. This allows programmers to, in a commonly occurring context, treat procedures as "behavioral abstractions of value discovery" rather than "commands".

Admittedly, I do have a deterministic pure functional core, and support for both distributed transactions and futures to help tame the concurrency. A typical large expression is purely functional at its center, with data-binding around it. One may functionally compose procedures rather than execute them at the edge. Behavior based discovery of values would only be used where truly necessary, and I suspect such discovery would usually be largely independent of other arguments to the same expression.

I can only imagine "everything is potentially parallel unless explicitly marked otherwise" would only result in one fantastic mess in a language like Ada or Scheme that lack sufficient provision for taming fine-grained concurrency. But other languages are taking this route, Fortress among them.

How to provide determinism in an imperative language

For those interested in this topic, I'd like to mention a paper that several of my students and I wrote on the topic of how to provide determinism within an imperative language (namely, Java):

Matthew Finifter, Adrian Mettler, Naveen Sastry, and David Wagner. Verifiable Functional Purity in Java. ACM CCS 2008. http://www.cs.berkeley.edu/~daw/papers/pure-ccs08.pdf

We adopt an object capability point of view: we say that the base language should be deterministic, and that access to non-determinism should require a capability. Thus in our language one can tell from the type signature of a method whether or not it is deterministic (functionally pure, actually). We also discuss a number of applications of functional purity and determinism to computer security.

So you are equating

So you are equating determinism with functional purity? I thought determinism had more to do with how well-defined the behavior was, whether it was simple (functional) or complex (imperative). To me, determinism has always been an issue of semantics: are the results easily determined or not...

Determinism != functional purity

I'm not equating determinism with functional purity. Functional purity requires (a) deterministic behavior and results, plus (b) absence of side effects. So determinism is part of functional purity, but not the whole story. We argue in the paper that a deterministic language -- in particular, a deterministic object capability language -- can facilitate reasoning about functional purity and determinism properties of code.

Perhaps I should explain how I'm using terms: A deterministic method is one where its results and behavior are a fixed (deterministic) mathematic function of its inputs. A deterministic language is one where every operation is deterministic: it does not provide a way for behavior/results to vary depending upon something other than explicitly provided inputs (e.g., does not provide ambient access to a random number source, clock, or other non-deterministic value). If that's not what you meant by these terms, I apologize for any confusion I may have triggered.

The paper probably presents our argument better than I can do off-the-cuff, so I apologize in advance if I've been unclear or imprecise.

I understand what you are

[Mis-post, should be under David Wagner's Determinism != functional purity post]

I understand what you are saying, but your first post equates determinism and functional purity. Without IO, most any language, imperative, functional, or otherwise, has semantics that are deterministic. Functional purity enables easier reasoning about deterministic behavior because it limits inputs (the store), but of course is not necessary.

I'm actually interested in languages that are purposely non-deterministic, in the sense that behavior is not fixed by code, even if no IO is involved. You can do a lot of fun stuff with such a language...think fuzzy reasoning and fuzzy programming based on natural language instructions (e.g., draw a big dog on the screen has many interpretations, a deterministic language is useless here). The ability of a language run-time to improvise given incomplete information or unforeseen circumstances seems like a powerful and useful concept.

Several Language Revisions of Ada

To my knowledge neither of them (1995, 2005) even addressed the fact that functions might have side effects and the order of evaluation of expressions is undefined. Given some of the arcane issues with generics and elaboration (initialization) they did fix, it strongly suggests that production systems aren't being bitten by this issue.

Based on my experience functions get limitted use as Ada code relied much more on procedures than function. Most functions don't have any side effects and code reviews catch these kinds of style issues. The few that do is precisely because you are hiding an entire system behind the function call and the other parts of an expression wouldn't have access to it. Most Ada compilers don't do anything too fancy in terms of reordering evaluation of expressions that aren't the built-in types.

Have you ever observed a problem "in the wild" or is this more of an abstract concern for the issue? If it is the latter then I would agree that I wouldn't design a language that way today although I would say a much tighter control over side effects is the way to go rather than order of evaluation of expressions.

In the wild...

Well, I mostly use gcc and they're pretty good about it. I have seen a couple of posts go by over the years in the gcc developer list that suggests they are quietly aware of the issue.

Whether their reluctance to make such changes lies in their awareness of the cost to their users... or their awareness that any such change results in a storm of bug reports of the form "GCC is Borked! Package XXX worked for all previous versions of gcc, but not version N.M.O!".

I suspect the latter. The gcc team are also pretty well behaved about adding warnings such things may break dodgy code. (Think -Wcast-align and -Wstrict-aliasing )

Even so I have certainly met and cleaned up a bunch of bugs emerging from the woodwork in all the cases I have mentioned.

However this forum is about Language Design way beyond the cast in reinforced concrete realm of the languages supported by gcc, which is why I raised the point.