Is it possible to write code that works the very first time?

Is it possible to write code that works the very first time (without testing it)?

Carlo Pescio, in his post, quotes C. A. R. Hoare:

"there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult"

Then he follows the first way in a real case, applying a sort of Compositional Correctness, as he calls it.

You can read between the lines a critique to TDD, that I share, and an interesting thought:

“a real economic theory for software design” should help to minimize the cost of changes basing design on

“a number of useful, small abstractions that do not change at all, or rarely do, because they're in frictionless contact with the nature of the problem”

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The Right Stuff

Figuring out the right abstractions 'the very first time' is unlikely. Even requirements are usually incomplete and ambiguous.

I have written medium-sized programs (up to 20k lines) that work correctly the first time. I can't recommend the practice, but sometimes I don't have an implementation of the API I'm developing against (i.e. due to concurrent development efforts).

I am always surprised and suspicious when this happens - I seem to find it unintuitive that code works, I'm so much more used to it breaking.

Modulo Not Knowing What You Want...

...yes. Just write strongly-specified functions in the type-theory-based proof assistant of your choice.

The joker in the deck is that writing strongly-specified functions is a lot harder than it sounds. The hope of a lot of us in the type-theory community is that at least we can automate a lot of the tedium in developing the proof parts (as opposed to computational parts) of such developments. Pretty much everything that Adam Chlipala has done is relevant here.

As D. Knuth once said:

As D. Knuth once said: "Beware of bugs in the above code; I have only proved it correct, not tried it. " (Sorry, I couldn't resist the tempation to reply with another quote).

Now, seriously: isn't this about reiventing the power of Scheme/lisp macros in .NET?

I feel compelled...

...to observe that Dr. Knuth was referring to a pencil-and-paper proof of correctness of an algorithm that he then implemented in Pascal.

Today we can develop a piece of software and its correctness proof in a proof assistant like Coq or Isabelle and extract running code that is indeed, to use the technical term, "correct by construction." A small, simple example of this for Coq is this quicksort implementation. Whether it does, in fact, sort a list doesn't need to be tested.

And Coq was proved sound

And Coq was proved sound by...? And the code it generated is compiled by a certifying compiler? And Coq was compiled by the same certifying compiler?

Two Things...

The first steps to the answer to the first question are found in Coq in Coq. For now, the more pertinent answer is that Coq satisfies the de Bruijn criterion: its proof checker is small and simple enough to be hand-audited. If you're satisfied that the proof checker is sound and faithful, then you can have the same confidence for all the proofs that it checks. Is your claim that Coq's proof checker is unsound? If so, demonstration, please.

With respect to the compiler, the answer is similarly "not yet," but keep an eye on OCaml Light. In the meantime, why is this even an issue? It's not as if having to trust the compiler you compile the extracted code with is somehow different from having to trust the compiler used to compile the code that isn't correct by construction. The observation is merely that you don't need to test the code you wrote, not that you can have any more faith in your compiler than before, unless, indeed, it is a certified compiler.

My point was perhaps lost in

My point was perhaps lost in the snark. You were implying that if the code was "correct by construction" you shouldn't have to try it, but the OP asked about code that "works the first time"--therefore implying one would actually execute the code. I'm not sure the post was about being correct by construction.

I didn't mean to derail the discussion with paranoia, but Reflections on Trusting Trust is instructive here...eventually humans do have to place some measure of trust in the system, whether that be in an implementation or a tool that processes the implementation. Proving this correct automatically is nice, but extensive testing is inescapable, and for sure the two are not mutually exclusive.

Definitions

Ben L. Titzer: You were implying that if the code was "correct by construction" you shouldn't have to try it, but the OP asked about code that "works the first time"--therefore implying one would actually execute the code. I'm not sure the post was about being correct by construction.

My observation reduces to: the answer to the question "Can you write code that works the first time?" is "Yes, if the code is correct by construction." That is, the post was about being correct by construction by definition. Later posts have explored some of the implications of the observation, but haven't deviated from the answer, by definition.

I didn't mean to derail the discussion with paranoia, but Reflections on Trusting Trust is instructive here...

No, it isn't, and I'd be really happy if people would quit trying to inject it where it doesn't apply. The question was whether it's possible to write code that works the first time. No one asked about malicious compilers. It's really quite simple: I can write some code in OCaml and apply the customary level of trust in the OCaml compiler, and get a runnable result with confidence-in-correctness X. The question is: is there a way that I can write OCaml code that "does the same thing" with confidence-in-correctness Y, where Y >> X and is in fact essentially 100% modulo compiler bugs, using the same compiler? The answer is yes: I can write strongly-specified functions in Coq and extract them to OCaml. Period, the end.

Proving this correct automatically is nice, but extensive testing is inescapable...

False.

...for sure the two are not mutually exclusive.

True. But you shouldn't waste your valuable time testing what's been proven correct.

Testing is always a good

Testing is always a good idea. For example to catch problems like that your extracted code relies on integers being arbitrarily large, which is correct in PolyML, but not in Standard ML.

A way to better control problems like that is not only to do code extraction / code generation, but also to tightly control the environment in which the code runs.

What's in YOUR TCB?

That's an interesting observation. It would seem that Standard ML implementations have fewer excuses than most targets, by virtue of at least having a formal semantics.

More germane to the thread, though: this isn't an observation about proof and extraction. A human being writing what they thought was correct Standard ML code would also have to know about the distinction between the SML/NJ and Poly/ML implementations.

Nevertheless, it seems to me that your point remains, which I take to be: your extractor, target compiler, OS, device drivers... are part of your (very large, very complex) TCB. This is most certainly true, and a worthwhile long-term target for certification.

Choose your own "waste"

you shouldn't waste your valuable time testing what's been proven correct.

Why not? Testing and proofs are both means of increasing trust in code. Trust is a human value judgment and not a mathematical concept. Would you trust your life to an avionics system that has been "proven correct" but its code has never been executed? I wouldn't.

So personally, I wouldn't consider it a waste a time to test code that has been certified. Sure, maybe I wouldn't spend an inordinate amount of time testing every input permutation to said certified code, but I sure would like to see it execute correctly at least once before I trust it.

OK

Ben L. Titzer: I sure would like to see it execute correctly at least once before I trust it.

Why?

Specification for Pudding

For humans, believing often comes with the seeing or tasting.

OK

But that's an observation from psychology, not mathematical logic, so it's not obvious to me why I should care.

Your abstraction used to

Your abstraction used to prove correctness could have been wrong? By reading this thread, I think you are assuming that abstraction wasn't necessary and the code was proven correct directly. But that would also require a perfect model of the environment that the code was running in, and in the case of an avionics system...that would be fairly impossible. Instead, you need to test a lot, and perhaps lots of simulation.

Circles

Sean McDirmid: But that would also require a perfect model of the environment that the code was running in, and in the case of an avionics system...that would be fairly impossible.

On the contrary; such modeling, especially in avionics, is done all the time.

But remember what the question is: would I trust avionics software that's correct by construction without tests? The answer to that question is yes, with it being completely understood that the "TCB" in this instance is the plane itself, all of its physical systems, etc. which I would indeed expect to have undergone rigorous testing.

Are you assuming that

Are you assuming that simulation against some approximate model of the environment a form of testing or formal verification? In either case, you would still have to keep testing at each step in the composition process.

And anyways, such testing is the law.

TCB

In a given verification context, if something is part of the TCB, what you call how you gain what trust you have in it—which isn't optional, which is why "Trusted Computing Base" is a misnomer; it's more like "Computing Base You Have No Choice But To Trust"—doesn't really matter much.

why I should care

why I should care

Isn't caring, fundamentally, a psychological rather than mathematical issue? Why should you care whether you care?

Anyhow, you try giving me a specification for pudding. We'll see how mathematical it is. ;-)

As Sean notes, it's easy to get the abstractions wrong. That will make it very difficult to write code that works correctly. Coq can provide verification. Testing can provide validation.

Psychology vs. Math :-)

dmbarbour: Isn't caring, fundamentally, a psychological rather than mathematical issue? Why should you care whether you care?

I guess that depends upon where you fall on the normative vs. descriptive question in economics. Do you believe you care, i.e. have a utility function? If so, why? At core, I care about software correctness for the same reason I think most people care about what they do: they wish to improve the state of the world. In software even more than most domains, the mapping from "improve" to "reduce entropy in" is quite direct. Correct software reduces entropy; incorrect software increases it. So I attach a high premium to software correctness, and like any other premium, it has to be weighed against other economic considerations. This is why I pay so much attention to efforts to improve the automation of certification, verification, etc.

As Sean notes, it's easy to get the abstractions wrong. That will make it very difficult to write code that works correctly. Coq can provide verification. Testing can provide validation.

I's an evocative turn of phrase, but what does it mean? How does this perspective account for the 2nd paragraph of Finding and Understanding Bugs in C Compilers, for example?

Compilers are a (relatively) Easy case

A compiler's job is straightforward. There shouldn't be any observed behavioral differences between the input and output. So the issue of knowing and formally specifying 'what we want' isn't as difficult - the only problem would be if we incorrectly model the behavior of the input or output language.

But, if we do incorrectly describe a language, e.g. get some subtle semantics wrong on an x86 macro, we would need testing to find out.

Language design is a bigger challenge. Languages are user interfaces, of a sort, where users happen to be application developers. Given a powerful framework, we can verify that the language has arbitrary properties. But only by using (aka beta testing) the language can we ever validate our language design - i.e. that the properties we specified are properties we actually want.

why test?

Mainly because correctness with respect to a spec doesn't necessarily guarantee intended behaviour. When that means my word processor typesets like a POS, that's one thing. When lives are in the balance, as is the case with avionics and many other embedded control systems, it's a whole different kettle of fish.

True...

...but I think misses two critical points:

  1. The proven part vs. the TCB: No one, least of all me, is arguing that the TCB doesn't need extensive testing, because it's the part you have no choice but to trust.
  2. Proof vs. testing: I can't do any better than Edsger Dijkstra here. Testing can only demonstrate the presence of bugs, never their absence. Proof, on the other hand, has the power of universal quantification: ∀x... Of course, the intention behind the spec has to be correct, but that's true whether the spec is formalized or not, and avionics is a singularly poor example of the potential problems there: the expected behavior of avionics systems is exceedingly well defined.

Poor choice ...

... hope you are not a nervous flyer.

Aircraft (especially high performance ones) can enter non-linear parts of the envelope where performance can't be specified, it can only be tested numerically and physically to check it doesn't appear to do anything too stupid.

This is because the whole system including the software being developed forms a non-linear dynamical system and may exhibit chaotic behaviors in some regimes. And things like windshear, turbulence or combat maneuvering can push the aircraft into these parts of the envelope.

Of course this is an example where a specification can't be produced, and does not nullify the argument that if one can be produced that we are willing to trust then proof will always trump testing.

The requirements (or

The requirements (or informal spec) are vastly more expressive than a formal spec. For example "the text in the word processor should be easy to read". Thus in your language: the formal spec is part of your TCB relative to the requirements. The disconnect between the requirements and a formal specification is most times much larger than between the formal specification and the implementation. So the formal specification is just as likely if not more likely to not match the requirements than an implementation is to not match the spec.

The requirements are what we really care about, not the formal spec. Formal verification is no magic, it cannot ensure any more than testing that the requirements are satisfied. The right question to ask is not "Which technique (user testing, unit testing, formal verification, model checking, etc.) is best to ensure that the software satisfies the formal specification?". Using this as a goal obviously puts formal specification at a huge artificial advantage. Compare: "Which technique is best to ensure that the software passes the unit test suite?".

The question that should be asked is "Which technique (or combination of techniques) is best to ensure that the requirements are satisfied?". As "the text should be easy to read" is already a very detailed requirement, a human testing the software is required for virtually all software. More commonly requirements are even more vague "the client likes the way the website looks".

The same thing applies even to avionics software. One requirement is "the plane shouldn't crash". This requirement cannot be formalized, so you have to rely on a formal specification that details exactly what the output voltage to the flap motors should be given the input voltage from the control column. Or with which formula to display the speed on the speed meter given the voltages from several sensors. How do you get evidence that this formal specification is good? You test the implementation.

So even in the prime area where verification is used, things turn out not to be that rosy. Again the hardest part is finding a good formula for the speed given the voltages from the sensors, not typing that formula in the software. But because the damage done by a mistake in the software is so great, it makes cost-benefit sense to also do formal verification. And even in this area I believe that they don't do full verification. They do verify some properties like "no division by zero can ever happen", but not full formal verification.

Well

... it's easier for me to judge the Haskell "quicksort" correct than it is for me to judge the Coq type correct.

Program Fixpoint quicksort (l : list t) {measure length l} :
{ l' : list t | sort le l' /\ permutation l l'}

That looks like a reasonable signature, but where are 'sort' and 'permutation' defined? By the time I go look at those, I've put just as much thought, with just as much room for error (IMO), into judging the types correct as it would take to judge the Haskell correct.

Of course once I've used the specification for Quicksort to prove the correctness of some other code in Coq a few times, I start to have pretty good confidence that I got the specification right. i.e. after I've tested it.

On Trust

I'm not suggesting you shouldn't trust your understanding of Haskell. Indeed, I'm generally happy to trust Haskell code. The observation is only that, given such an understanding of the types, Coq's guarantees are stronger than those of, say, GHC.

At least until we have a formal semantics for Haskell and a certified GHC, that is. :-)

"sort" is in the Coq standard library, here. "permutation" can be found here.

Trusting this (again) comes down to trusting the proof checker thanks to the de Bruijn criterion. Maybe you trust it more due to the "Coq in Coq" work, maybe you don't. Skepticism is a healthy thing. But there's a specific claim being made: "The Coq proof checker is sound." It's certainly possible to offer a demonstration to the contrary; that's the whole point. But I'm aware of no such demonstration, and I don't ever expect to see one.

I don't think this is the point

Knuth wasn't saying "I've only proven this code correct, I haven't certified the compiler.". Proof establishes confidence in a different way than testing, so I'm not saying it's a waste of time. But specifications can have bugs just like code, and in this case the spec is at least as complex as the code.

Spec as Complex as Code

In which case? The quicksort case? If so, I'm going to have to respectfully disagree. I have no trouble at all understanding that the result of a quicksort needs to be a sorted permutation of its input. :-)

The "spec is at least as complex as the code" canard comes up a lot, and always baffles me. The spec is just the type, and the code is the proof. It's very rare indeed that any given type is as complex as its proof, and the construction of the proof is subject to automation and being mechanically checked. In practice, the process is used to create software that is used in cases where its failure means someone dies. I'm curious how critics of certified software development explain its success.

I'm curious how critics of

I'm curious how critics of certified software development explain its success.

For the most part, they don't know of its success.

If people see a certified browser (e.g. with proofs of real-time and security and JS JIT) succeed in the market, that might change.

Mainstreaming

dmbarbour: If people see a certified browser (e.g. with proofs of real-time and security and JS JIT) succeed in the market, that might change.

I think this is exactly right. Most of the interesting effort in certified software development is focused precisely on making certification practical for the development of software that is real-world both in terms of semantics (side-effecting, concurrent, running on X86_64 and/or ARM v8 instruction sets...) and in scope (web browsers, JavaScript interpreters/compilers, garbage collectors...) I expect it will take roughly another decade before this becomes mainstream. For now, we can see some lighthouse beacons pointing the way, such as Adam Chlipala's work on Bedrock.

Let's compare apples to apples

I have no trouble at all understanding that the result of a quicksort needs to be a sorted permutation of its input.

Right, but as I alluded to a few posts ago, for an apples to apples comparison, you also need to check the definition of 'permutation' and 'sorted'. Judging the correctness of all of the above is in the same league as just judging that the quicksort is correct. For example, I'd argue that the spec you linked to has a bug. The documentation at the top claims that the quicksort implementation is stable, but the type doesn't record this fact.

And BTW I'm not a critic of certified software.

Tongue in Cheek

You're right; my reply was a bit too tongue-in-cheek to be helpful.

More seriously, I don't worry about "permutation" and "sort" precisely because they're in the standard library and have proofs associated with them. I've been using Coq long enough, though, that when I did look at their definitions, I didn't have any more trouble understanding/"trusting" them than I do, e.g. a body of OCaml code that I look at. YMMV, of course.

> For example, I'd argue that the spec you linked to has a bug. The documentation at the top claims that the quicksort implementation is stable, but the type doesn't record this fact.

This is the philosophically interesting bit to me. I would argue that it's perfectly fine for the comment to make an observation about the implementation, but "the type doesn't record this fact" just means that, by definition, it isn't part of the spec. In other words, there's no guarantee that the sort will be stable. The comment is just pointing out that this one happens to be.

> And BTW I'm not a critic of certified software.

That comment wasn't directed at you personally. I must apologize if that wasn't sufficiently clear.

"quicksort" spec...

This is the philosophically interesting bit to me.

To me as well. I find it interesting that the so-called simple specification of quicksort above would not be able to distinguish quicksort from, say, bubble sort. I find it hard to call something a specification for quicksort that doesn't establish that it is, in fact, quicksort as opposed to any other sorting algorithm. Could such a spec really be kept simpler than the implementation?

And then, what is quicksort, anyway? Must quicksort involve in-place mutation? For instance, people continue to argue about whether the famous Haskell "quicksort" actually deserves the name. (I'd say it probably doesn't, actually.) Seems to me that flaws or gaps in the spec are at least as likely as errors in the implementation, in this case.

Intensionality vs. Extensionality

Matt Hellige: To me as well. I find it interesting that the so-called simple specification of quicksort above would not be able to distinguish quicksort from, say, bubble sort. I find it hard to call something a specification for quicksort that doesn't establish that it is, in fact, quicksort as opposed to any other sorting algorithm.

Sure, and in that respect I would agree that the title of the file could be considered misleading. But I don't care about the title of the file any more than I care about the textual comments in the file, i.e. they're both human-oriented metadata. Rather, I look at the type of the function and see that it comports with what I expect to be proved about something that claims to sort a list of some orderable type.

Now, it could be that I'm not satisfied with that for the reason you mention: I want intensionality rather than extensionality. In that case, I would probably use a verification tool like Frama-C and write the code in a conventional language like C, annotating it with the conditions I wanted to use to prove that it faithfully implemented quicksort.

So the bottom line for me, I guess, is that intensionality vs. extensionality is the pivot upon which my decision whether to write code that is correct by construction or to use a verification condition generator on hand-written code hinges.

Re: Intensionality vs. Extensionality

Rather than treating this as barrier or pivot, I would suggest you explicitly model the intensional properties you need (timing, stability, space, et cetera) and provide them as part of the type and proof. I.e. there are no 'true' intensional properties, just properties you choose not to observe.

For my own work, I've been doing a lot of explicit modeling of time, and I'm somewhat interested in making my models of space more explicit too. Maybe I'll do that once I have grasped my model well enough to formalize it in Coq.

I'd also love to see what can be done with modeling 'energy' costs in a temporal semantics. I think it's an area we could do much better for mobile devices.

More seriously, I don't

More seriously, I don't worry about "permutation" and "sort" precisely because they're in the standard library and have proofs associated with them.

In this case you were lucky because exactly what you needed happened to be in the standard library. The standard library happens to have a sort predicate and a permutation predicate? Seems like something was designed to showcase how to write the spec for a sorting function.

In practice there are at least two complications:

1. Nearly all problems aren't as well or easily defined as sorting.

2. If you do have a spec, it's likely to be much more complicated than implementing the software in the first place. How would you formalize the software running LTU in Coq? How many orders of magnitude bigger is the spec than the actual software running LTU? The spec is harder to get right than the implementation: the implementation is in some ways a clearer spec than a spec in Coq would be.

Algorithmic problems are a very very small part of software development.

Baselines

Jules Jacobs: In this case you were lucky because exactly what you needed happened to be in the standard library. The standard library happens to have a sort predicate and a permutation predicate? Seems like something was designed to showcase how to write the spec for a sorting function.

From this, I can only infer that you're unfamiliar with Coq, and possibly discrete math generally. Does it really seem to you that the only reason a proof assistant for higher-order type theory would include such predicates would be to show how easy it is to declare a type that means "sorted list?"

Nearly all problems aren't as well or easily defined as sorting.

You certainly won't find me arguing that nearly all software is sufficiently specified.

If you do have a spec, it's likely to be much more complicated than implementing the software in the first place.

Why do people keep saying such trivially obvious nonsense without offering a single iota of evidence? If that were true, then most intuitionistic logical theorems would be as long or longer than their proofs, by the Curry-Howard Isomorphism. It isn't true.

How would you formalize the software running LTU in Coq?

I don't know. What properties would you like to formalize?

How many orders of magnitude bigger is the spec than the actual software running LTU?

Zero, unless what you want to do is just take the implementation and say something formal about each state transition in the implementation. This reflects the "intensional vs. extensional" post elsewhere in the thread. Your argument seems to assume intensionality, which isn't necessary to have useful formalizations in the real world.

If I wanted to write LtU today and know that it was immune to the usual range of injection attacks and adhered to the security policy I wanted, I'd write it in Ur/Web and be quite happy even though my "theorems" would "only" be Ur/Web types, and my "proofs" would "only" be Ur/Web functions. The Curry-Howard Isomorphism, being an isomorphism, works both ways. Similar thoughts apply to Opa. Yes, my TCB would still be a lot larger than I'd really like. Until Adam Chlipala finishes Bedrock, though, I'd live with it, and it would still be a significant improvement over LtU's implementation today (sorry, guys).

The spec is harder to get right than the implementation...

True.

...the implementation is in some ways a clearer spec than a spec in Coq would be.

False, assuming you actually know Coq and whatever formalization framework you'd actually use.

I'll ignore the ad hominems

I'll ignore the ad hominems other than saying that math is my primary occupation. Yes it does really seem to me that you got very lucky in the case of specifying quicksort. Specifying other real world things won't nearly be as easy.

Why do people keep saying such trivially obvious nonsense without offering a single iota of evidence? If that were true, then most intuitionistic logical theorems would be as long or longer than their proofs, by the Curry-Howard Isomorphism. It isn't true.

If it's obvious nonsense, perhaps you could explain why it is obvious nonsense instead of getting upset about it? Modern web app libraries and frameworks contain massive amounts of domain specific help for web apps that Coq doesn't contain. Hence the spec in Coq will obviously be much longer. I can write moderately complex web app like LTU in Rails in a couple of hundred of lines. Now try to formalize the same in Coq.

Libraries

That Coq doesn't have libraries for all the HTTP and Rails stuff is true, but this is not a property inherent to Coq.

How should we make a fair comparison about the general approaches? I can think of only two ways: either you assert that Coq cannot provide libraries that would allow a Rails-like app in a couple hundred lines (i.e. are there any particular functions or objects that cannot be formalized in Coq?), or you measure all the libraries and implementation details all-the-way-down to a common TCB (Bedrock, metal, OS, whatever).

The comparison I'm making is

The comparison I'm making is the most obvious one: a programmer trying to decide whether he wants to use Coq to formalize his web app, or use Rails to implement it. This programmer doesn't have to reimplement Rails or the network stack, but if he chooses to use Coq he *is* required to do all these things. That is ultimately the comparison that will determine the success of approaches like Coq. I don't think the result of this comparison is at all controversial.

I'm not saying that proving code correct can never work, just that in the foreseeable future it is restricted to an exceedingly small class of problems.

Specifications can in the not-near future have an advantage because they are easier to write than the implementation. For now that is not the case for nearly all software, and it is possible that the implementation languages will stay ahead or sufficiently close to specification languages that for most real world tasks it is more productive to just do the implementation rather than formal specification plus implementation. Another way to view it is that for more and more expressive specification languages, we figure out a way to make them executable.

Somewhat Agreed

I think our languages are gradually meeting in the middle, and that this can be observed today... i.e. with verifying compilers on one side, and correct by construction on the other.

Worse-is-better languages tend to be quite effective at adopting features from their competition.

Reflection

I do owe you an apology. I was reacting to "Seems like something was designed to showcase how to write the spec for a sorting function" which, given "In this case you were lucky because exactly what you needed happened to be in the standard library" I now see I took too literally. "Seems like" is exactly right: it makes the declaration of something that returns a sorted list "seem like" those definitions are there for that purpose.

Jules Jacobs: "If it's obvious nonsense, perhaps you could explain why it is obvious nonsense instead of getting upset about it?"

I did. You even quoted it: "If that were true, then most intuitionistic logical theorems would be as long or longer than their proofs, by the Curry-Howard Isomorphism." The argument is even more straightforward from Algorithmic Information Theory: the Minimum Description Length principle wouldn't work if your claim were true.

Again, to be clear: it's certainly true that you can write a "spec" (theorem) that is no less complex (formally, whose Kolmogorov complexity is no less) than its implementation (proof). It's also true (in fact, it's one of the defining discoveries of Algorithmic Information Theory) that this characterizes most mathematical objects. That is, most mathematical objects aren't amenable to any description that's simpler than they themselves are. But it's precisely the computable domain that doesn't suffer from this problem, by virtue of the computable domain being so vastly much smaller than the totality of mathematical objects.

Modern web app libraries and frameworks contain massive amounts of domain specific help for web apps that Coq doesn't contain. Hence the spec in Coq will obviously be much longer. I can write moderately complex web app like LTU in Rails in a couple of hundred of lines. Now try to formalize the same in Coq.

I'm sorry—I really am trying not to offer what you could reasonably interpret as an ad hominem argument—but this is just a muddle. You seem to be conflating "spec" with the entirety of a Coq development, and making no distinction between general-purpose developments and portions that are specific to a particular proof. No one is suggesting that Coq and its standard libraries are, by themselves, going to be of a lot of use in formalizing web applications, but that observation rather obviously has no bearing at all on the relative complexity of a useful specification for a web application such as LtU and, to stick with the comparison suggested, a Rails reimplementation of LtU.

I see how you could read

I see how you could read that as an accusation that the sort and permutation predicates were put in on purpose to shed Coq in a better light, but I didn't mean it like that :)

"If that were true, then most intuitionistic logical theorems would be as long or longer than their proofs, by the Curry-Howard Isomorphism."

I see your point but this is a rather theoretical claim, and it doesn't really help the real world programmer. Moreover I think I did once read a theoretical result that showed that proofs were short relative to the theorems. It was something like for every theorem there exists a proof that is at most polynomially longer than the theorem. However this result doesn't really help real world programmers either...

I'm sorry—I really am trying not to offer what you could reasonably interpret as an ad hominem argument—but this is just a muddle. You seem to be conflating "spec" with the entirety of a Coq development, and making no distinction between general-purpose developments and portions that are specific to a particular proof. No one is suggesting that Coq and its standard libraries are, by themselves, going to be of a lot of use in formalizing web applications, but that observation rather obviously has no bearing at all on the relative complexity of a useful specification for a web application such as LtU and, to stick with the comparison suggested, a Rails reimplementation of LtU.

The problem is that a programmer is going to take exactly that into account when deciding whether to use a formal specification or not. The advantage of a formal specification language is that it is potentially more expressive than an executable language. It is clear that we can devise a formal specification language for real world problems like implementing a web app that is as expressive as any executable language that we have. What formal specification needs in order to be successful is to be significantly more expressive than any executable language that we have. What I've seen in web apps is that the specifications are mathematically so trivial that it would not be hard to make nearly all of them executable. So it will be hard to come up with a specification language that is significantly more expressive than all executable languages. This also raises the question whether we should perhaps focus our efforts in making more and more expressive specifications executable, instead of making more and more expressive non executable specifications checkable? The latter has the disadvantage that a programmer will have to do both the non executable specification *and* the executable implementation or in other words the proof. If the specification is not much easier to write than the implementation then he is doing almost double the work, and in practice he will choose to use a slightly less expressive but executable "specification" language (i.e. programming language).

Through the Curry-Howard Lens

Jules Jacobs:

This also raises the question whether we should perhaps focus our efforts in making more and more expressive specifications executable, instead of making more and more expressive non executable specifications checkable? The latter has the disadvantage that a programmer will have to do both the non executable specification *and* the executable implementation or in other words the proof. If the specification is not much easier to write than the implementation then he is doing almost double the work, and in practice he will choose to use a slightly less expressive but executable "specification" language (i.e. programming language).

I believe I see your point now, and I agree: if there isn't a significant development process in your endeavors that, as Adam Chlipala puts it, "deserves to be called 'proof,'" then maybe a proof assistant isn't the right tool, but if we don't want to give up strongly-specified functions, then we still (it seems to me) need dependent types, so maybe a switch to Agda, Epigram, ATS, Idris... is warranted.

Spec can (and sometimes must) be light

Your two points are reasonable, though I think over-pessimistic. Here's why.

1. Nearly all problems aren't as well or easily defined as sorting.

You don't need to hunt for "total specification"; a partial specification can already help a lot for confidence. When you use Quickcheck for example, you usually just "check" for simple properties of your code, not functional correctness. But checking that already helps a lot for the confidence in your code. Checking "light specifications", eg. relatively simple algebraic properties, may in practice catch all the bugs in your program.

Let me make an even stronger claim : you *will* in practice catch all bugs in your program with a partial yet reasonable specification. Because Coq-like verification is so difficult in practice, that one it's done, even for a simple specification, you got your eyes on the code a hundred times, and you deeply understand what it does -- and not what you think in does; formal proof is *very* good at hunting delusions.

It's still true that there are problems we don't know how to specify nicely. In this case, if you want a high level of confidence, there is no way out, you need to work more on the specification side -- whether it is formal specification that can be published in a paper and implemented in a proof assistant, or a huge set of unit tests that in practice overspecify your expected behavior. Specification is like implementation: not free, you have to work on it. But I agree getting good specifications may not be a common competence among programmers, and is still very much research.

If you do have a spec, it's likely to be much more complicated than implementing the software in the first place.

Then you can go for a partial spec. But I find your claim here quite dubious; good specifications try to abstract from the implementation detail. There may be software where the implementation complexity is so shallow that the code and the proof are equally complex -- I certainly wouldn't want to work on that, sounds boring -- but I don't really see how you could have something "more complicated".

Finally, I would like to insist on something: the real argument is not that formal proof will make your software "provably correct", but that if you have spent enough effort on the formalizing your program, it will in practice run without bugs the first time. Which is actually a very reasonable argument, and avoids all that "yeah but the compiler or the runtime or ..." discussion.
However, I still find Matt M's objection about checking your specification interesting. We have developped good engineering practices to work on software, but we're not so experienced with formal specs. Maybe there are engineering principles to develop on proofs and specifications, to help gain confidence in them? I have already tried to provoke discussion about this, with very moderate success.

I agree completely, partial

I agree completely, partial specification helps a lot in catching bugs. Especially if the checking can be mostly automated I imagine that bugs caught per invested effort will be high. Unit testing is a form of partial specification that has caught on in the mainstream.

But partial specification is not what people are arguing here, and you will probably have a hard time convincing them of partial specification. You lose the total correctness guarantee. You also lose compositionality. Ideally if a procedure C calls a procedure P, then C should only depend on the specification of P for C's correctness. But if you only have a partial specification of P, you also need to rely on the implementation of P. Or in mathematical language: if the proof of theorem C makes use of theorem P, this proof should only depend on the theorem P, not on the details of the proof of P. Not good from a theoretical standpoint.

There may be software where the implementation complexity is so shallow that the code and the proof are equally complex -- I certainly wouldn't want to work on that, sounds boring -- but I don't really see how you could have something "more complicated".

In an ideal world it will not be more complicated. But implementation languages have been optimized for expressiveness for a long time, but coding the same libraries and frameworks in a proof assistant still needs to be done.

Also it so happens that most of the code written *is* boring (if that is your definition of boring) that's my point :) The effort and risk is in getting the specification right, not in implementing the specification correctly. A requirement of your application might be that it is easy to use. You can't formally specify this. Once you have written a formal specification of how your app looks and how one interacts with it the hard work has been done. The rest, implementing the specification is most times trivial. So instead of trying to make the step from specification to implementation less error prone we should try to make the step from requirement to implementation faster. That way you can experiment more with different user interfaces and achieve your requirement. So you see this kind of software doesn't have to be boring: there is interesting work to be done when going from requirements to specification or implementation.

Take Google search. The objective is to return search results that the user wants (or more pessimistically, the search results that maximize ad revenue). This is not something that Coq understands. Once you have a formal specification (that is known to be efficiently implementable) of the results that Google should return for a given input, the most difficult and risky work is done.

Finally, I would like to insist on something: the real argument is not that formal proof will make your software "provably correct", but that if you have spent enough effort on the formalizing your program, it will in practice run without bugs the first time. Which is actually a very reasonable argument, and avoids all that "yeah but the compiler or the runtime or ..." discussion.

Is the metric of "is it correct when you run it the first time?" actually useful? An actually useful metric is which method of programming produces the best result in the amount of time available. This method will most certainly include actually running your code before you deploy it. Even if you do use formal specification testing is still necessary to see if your formal specification actually implies the requirements. It is easy to forget to formalize something, for example forgetting to formalize performance in a video decoder. I'm sure that as the amount of time available goes to infinity, formal specification becomes the method of choice.

The standard library happens

The standard library happens to have a sort predicate and a permutation predicate? Seems like something was designed to showcase how to write the spec for a sorting function.

I actually *would* expect the stdlib to have a sort predicate if the language could express one: it'd be damning not to provide basic data structures and algorithms over them, and more so to have them uncheckable. Specifying that the algorithm is the intended one -- quick, binary, etc., was a great observation -- is a funny point here.

Despite Paul's reaction, Jules isn't the only one with this sentiment. For example, Philip Wadler, in a (constructive) criticism of the functional programming community wrt mainstream adoption and commercial programming, views (or viewed) the main success story of FP being the small domain of writing theorem provers rather than these other things that we worry about. Users help define a target, so it's been a great spiral of advances, but not oriented where many would hope.

Instead of just explaining away the technology in terms of being immature and 'coming,' it's worth considering how to actually get there. The work at MSR in another thread (F* and its many previous incarnations) is great not only in getting the various theories to work together but also in getting them to work on problems many practitioners care about (e.g., their paper about securing browser extensions).

FWIW

I agree with this. I was just thinking that F* might be the thing that makes me look at Mono again.

Apart from that, I do think there's value in pushing past what elsewhere I called the "Curry-Howard Lens" to the other side of the proof-assistant/programming language divide. That is, I believe the answer to the original question "Can we write code that works the first time?" is still "yes, by writing strongly-specified functions." It's just that as a practical matter, these strongly-specified functions (using dependent types) might be better expressed using Idris, Agda, ATS, Epigram... than Coq. I believe this is essentially Jules' point also, when he writes about "executable specifications" rather than "unexecutable specifications" that code must be "proven" to conform to.

Note that sort checks

Note that sort checks whether a list is sorted. What I meant to express is that not counting the code of the sort and permutation definitions towards the specification of quicksort gives you an overly optimistic view of how easy it is to write specifications, especially if you're not writing algorithmic code; then it is unlikely that the specification is as simple relative to the implementation as is the case for algorithmic problems.

From Proof to Program

Yes, I see your point (I'm slow, but I usually get there eventually...) :-)

I think this is a good point. Basically, there are some activities where you can reasonably expect a fair amount of your development effort to be on "proofs," which are erased at code extraction time. Formalizing programming language design is a good example of this: there are a lot of proofs involved whose nature is not computational and therefore don't survive extraction, and as Adam Chlipala points out, their structure doesn't mirror the structure of the code, so it's worth having them be "different" than the code.

On the other hand, for things like web applications, there's little that you could reasonably call "proof" that doesn't mirror the structure of the application, so you're probably better of using something that is self-consciously a programming language, albeit one with dependent types. Personally, I'm intrigued by Idris.

Personally, I'm intrigued by

Personally, I'm intrigued by Idris.

Then post about it...

Trust and formal developments

The issue of what constitues trust and how to convince oneself that a specification captures exactly the intended properties, let alone that a formal proof (a derivation in a calculus) constitues a valid proof, especially when using an advanced type theory as a meta language has many facets. One paper that I found particularly helpful in coming to grips with this problem is

Pollack, Robert: How to Believe a Machine-Checked Proof. In: Sam- bin, G. and J. Smith (editors): Twenty-Five Years of Constructive Type Theory. Oxford University Press, 1998.

It's very much a sociological/human problem in addition to being a technical/theoretical one as ultimately truth boils down to a sufficiently large number of (knowlegdeable) people accepting a specification and a proof that some other "artifact" satisfies the specification in a way that is expected or compatible with the goal that one wants to achieve with a particular effort in certified program development.

There is also an article that deals with some of the points raised in the above paper specifically in the context of Coq, but I do not remember the exact title and authors right now. A quick search should be able to provide the answer.

Proof of quicksort

I am mindful of the celebrated formal correctness verification of quicksort, which was subsequently found to have only 20 substantial bugs...

Link?

Link?

It's sometimes possible to

It's sometimes possible to write a program that works correctly at its first run, especially if the program is small. But often you are wasting some of your time, because to produce a fully correct program you have to read it again and again, to avoid syntax errors, type errors, etc. Compilers and IDEs are able to do this faster and more reliably than you. For practical programming it's better to leave the syntax test and some of the type testing to the compiler, and use your brain to look for higher level problems that the compiler/interpreter can't find.

Depends on the language

My usual process was to write the code (including preconditions), write the tests, go back and reread and fix the code (and tests if necessary), then run everything.

I don't recall ever writing anything in C or C++ that worked the first time. Occasionally I would get all the algorithm's correct but there are just too many ways to make a small mistake that wasn't obvious. In Ada I wrote stuff that worked first time fairly often for two reasons. 1) The compiler caught a lot of inconsistencies and 2) when you did something "clever" you often had to be very explicit which helps in catching errors. In Java I was able to do it more often than in C but a bit less often than in Ada. Python is not quite as good as Java in my experience. DISCLAIMER: I'm talking about getting things right the first time not productivity!

As I see it the hardest issue is correctness, getting the software to do what it should do. But in my opinion there is a second issue, validity, that is just about making sure the program is meaningful but not that it has any particular meaning. My experience suggests to me that the biggest difference in languages or at least compilers --in terms of getting things right the first time-- is how much support they provide for checking validity.

The problems I encountered with correctness often resulted from not understanding the desired behavior, forgetting an important aspect of the problem, or failing to appreciate some interaction between parts of the system. All issues where I don't really see languages being able to make a big difference.

Software being part of a larger system

Isn't it so that the software is usually part of a bigger system that many times include one or many humans in some sort of integrating feedback system? Proving that the system as a whole will work is then dependent on knowing what a human will or won't do in any situation, or that the system will work no matter what the human does.

Testing could then be seen as a way to establish trust in that the software is "compatible with human behavior/psychology".

I think doing formal verification on a full-system scale (that include multiple concurrent actors, possibly humans) is beyond what is possible today, no?

For example, would it be possible to write code, on the first try, that gave guarantees that this didn't happen:
http://en.wikipedia.org/wiki/Accidents_and_incidents_involving_the_JAS_39_Gripen#February_1989
http://en.wikipedia.org/wiki/Pilot-induced_oscillation

Reusable specificiations

Is this what "Compositional Correctness" should amount to?

Do I need to re-specify qsort when actual running time is important for my system's overall correctness (since I assume that is not something the spec refers to)? If I move to a parallelizing compiler that maintains the semantics of the language, do I need to change the specification, if running time is an issue?

Indeed

Having seen what some of the specifications of quicksort look like for verification, I'd like to see what they look like for something like Timsort, which has a complex implementation based on informal specifications like "sorted sequences are used untouched" and generally outperforms quicksort on real-life data.

Compositional by Construction

At the moment, you'd need to adjust the specification of quicksort if you want to reason about time, space, linearity or uniqueness, stability, laziness, incremental computation, optimizability, reactive or concurrent adjustment of the list (sorting a live query?), job control or interruption (and safety thereof), various forms of parallelism, distribution, recovery after disruption or node failure, and so on.

This isn't inherently a problem of Coq, but it does severely limit the utility of the current library of proofs.

I believe that the most viable solution is to specify a 'substrate' (a language or programming model) that ensures the compositional properties you want. Then you build a 'library' of proofs for desired operations on this substrate. You get for free (though, not necessarily efficiently) anything you proved in terms of those operations.

If we want widely reusable proofs, we would need to model them in a very ‘ambitious’ but ‘simple’ substrate - i.e. one with a large number of compositional properties, yet that could (at least in some variations) be encapsulated in a pure function similar to the ST monad. Functional programming by itself isn’t sufficient, nor is monadic programming.

I would propose Reactive Demand Programming (RDP), which allows compositional reasoning about commutativity, idempotence, resilience, concurrency, eventual consistency, optimizability (continuous compilation or specialization of slow-changing behaviors via reactivity; duplicate elimination and ad-hoc content distribution networks via commutativity and idempotence), latency and predicative timing, resource management and scheduling, job control, ad-hoc coordination of independent systems and agents (via anticipation), incremental computation, reactive live queries and control, visible security (object capability model, but with automatic revocation membranes), and distribution (orchestration with point-to-point connections, no central routing through app).

I also use a 'Vat' substrate that helps me reason about concurrency, batching, local determinism, global snapshot consistency, real-time, and performance isolation. The 'Vat' concept itself is atop a 'Host' substrate to eventually support reasoning about distribution, points of disruption, serialization, and relative clock drift (at the moment, though, my Host model is just a stub with a clock, and there is no communication between hosts). These would also be decent substrates.

I plan to eventually model RDP in Coq (or maybe F*, now that I’ve heard of it) but I’m not convinced about legacy integration at the moment. I’d like to support some visible demonstrations of the model, since I think seeing is believing to more people than a Coq proof.

The true question

I think the true question raised by Pescio in his post is: how we evaluate a design method?

Today many fans of Test Driven Design are stating that a set of tests is more important than architecture, modularity, and so on. No one (I think) would deny the importance of a test phase, but TDD bases the entire design on tests.

If I understand correctly, Pescio’s experiment wants to show that “test,test,test” is not the only way to get software that works (and each context may require a different method). So, to me, he is asking: have we a criterion which can help us rationally evaluate the quality of a design method (and so the quality of design that it produces)?

Correctness (verifiability in some contexts) is an important aspect of quality, but quality has many aspects (extendibility for example).

So we must ask ourselves if TDD or other design methods are improving the way we address requirements, changes, etc. Does TDD help us write better software? Is “writing test first” more important than “designing architecture first”?

Pescio says that “a real economic theory for software design” should help to minimize the cost of changes basing design on “a number of useful, small abstractions that do not change at all, or rarely do, because they're in frictionless contact with the nature of the problem”

related thread, if you have time to kill

...from the software craftsmanship list. check the whole thread if you dare. basically somewhat pie-in-the-sky arguments in favor of figuring out how to avoid having anything that has to be locked down, that can't be changed easily later, that would be 'architecture'.