Cost of provably-correct code

Some of you my find this discussion (especially the cost estimates) interesting.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


I don't believe that provably correct code, has any useful meaning without provably correct specifications.


...provably correct specifications are called "code with passing tests," since there's no way to prove a specification correct (that is, that it specifies what the author intends)!

Specifications and correctness

I don't believe that provably correct code, has any useful meaning without provably correct specifications.

Why? It says that the code is correct relative to the spec. When there is a bug - i.e. a test fails to meet your expectations - you know what to blame. Since a spec usually is on a much higher level of abstraction, the search space is reduced significantly, potentially by several orders of magnitude.

And of course, there are domains for which provable correct specifications are possible. Compilers, for example.

To my mind it is about

To my mind it is about compartmentalizing risk. Sure, you cannot know for certain that your specification is exactly what was wanted or needed, but then you also may not know whether the CPU has a flaw in it, or whether the OS actually works exactly as intended. There are innumerable reasons why things could, in fact, go wrong. The reason to prove code correct is that you can then at least be sure that the code not doing exactly what the developers intended (where their intentions are laid out in a formal spec) is not one of them - if problems do occur you can focus your attention on things like the spec, the hardware etc.

It's also worth noting...

that most of the time when our computer systems have errors, it really is a fault in an application program, rather than the OS, or the CPU, or even the spec. Complaining about the problem analysis has a long and storied tradition in software engineering, but mostly the faults lie not within our specs, but within ourselves. Pretty much any error-reduction strategy that has a chance of working has to focus largely on coding errors, rather than analysis errors. (Plus, fixing coding errors is easy. There are so many of them, you hardly have to aim!)

I don't believe that

I don't believe that provably correct code, has any useful meaning without provably correct specifications.

The term doesn't appear to be defined in the cited conversation, either. It's an annoying term, because it facilely implies 100% confidence. But there might be something there anyway:

The desired example is this--suppose you have a compiler. The specification is a formal, mathematical, denotational semantics of the source language. Your compiler provably generates code that obeys the semantics. Of course the specification may be buggy, or the proof-checker might be buggy, but that's less likely than a coding error. Is the compiler "provably bug-free"? Of course not. Is the set of proofs better than a large set of unit tests? I don't know. I imagine proofs would be better, if the two were equally feasible (obviously not yet).

Now suppose you want to make a change, introduce an optimization in the compiler. You can make the change, update the proofs, and you are assured that your change doesn't break anything in the formal spec. Is that valuable?

If we had a way to formalize and prove high-level requirements, we could weigh the costs and risks to decide which requirements to bother proving. It could only make software reliability better. Certain very common requirements, like "it doesn't crash" and "it doesn't contain stupid type errors", happen to be mathematically tractable; many compilers now basically prove them for every program. That could be taken further. It could also be generalized and opened up so that users can define and prove their own rules.

Is the set of proofs better

Is the set of proofs better than a large set of unit tests? I don't know. I imagine proofs would be better, if the two were equally feasible (obviously not yet).

This is obviously a matter of the degree of certainty, or assurance, that you need for the unit. If running a very small selection of tests is adequate for your purposes then clearly unit testing is far more feasible. If you require a very high degree of certainty, say correctness over all possible valid inputs, then exhaustive unit testing will likely be far more time consuming and expensive than a proof. In between is a sliding scale of assurance level which variously make the addition of static types, design by contract, and formal specification attractive options as the easiest way to achieve that level of assurance. It's all a matter of how certain you wish to be.

What I would like to see is programmers actually provide, to the users, some indication of the degree of assurance they are willing to offer for code. That may be a lon time coming however.

No exhaustive testing

If you require a very high degree of certainty, say correctness over all possible valid inputs, then exhaustive unit testing will likely be far more time consuming and expensive than a proof.

In particular, since there is no such thing as "exhaustive unit testing", because the set of possible inputs typically is infinite.

Exhaustive Testing

One of the papers that reference the original QuickCheck paper do exactly that. They exhaustively test each piece of code.

I think it was done in Clean, but I'm not completely sure.

Wouldn't this be proof by exhaustion / exhaustive unit testing?


As long as you don't provide a link to that follow-up paper I cannot tell. ;-) There certainly is no such claim in the original QuickCheck paper. They only say that random testing has a high likelihood of covering all execution paths. But that is not at all the same as exhaustive testing. For starters, how do you know that you've made the right choice of execution paths?

The cost number...

...seems bogus to me -- $20,000/line? That's either crazy, or I got to get me a job writing provably-correct code. I could write a hundred lines of code, prove it correct, and retire! :)

Anyway, these days one thing I do is to do a correctness proof, and use that to guide the construction of the code. It works like "Okay, for the proof to go through, f(a,b) must equal f(a,b-1) + f(a-1, b), so that must be part of the definition of f!"


20k/line would require 200M for a brand new, 10kloc project, which doesn't do much by itself. I wonder if this is actually $20k per each line added to an existing, mammoth codebase. Still, the figure seems quite high.

Not *that* much...

There was an article on the people who write space shuttle software for NASA and IIRC the figures are not that far off -- I can't for the life of me find the article now, so I'm going by memory here. IIRC the entire shuttle code base was on the order of 10kloc, and every single change to a single line in the code base would be accompanied by a document of hundres of pages detailing the impact of the change on all the other code in the system. Such attention to detail is going to have an enormous cost in terms of man-hours. How they arrived at those 10kloc in the first place is another matter -- I don't recall the article saying anything about that.

Anyway, when you have billions and billions of dollars worth of hardware riding on the correctness of the software, $200M seems relatively cheap.

Not necessarily a good example

The process for developing shuttle software is extremely heavyweight, decidedly manual, and involves much more than just proving correctness. It wouldn't surprise me if that process cost a lot more than normal software development.

On the other hand, you have places like Praxis High-Integrity Systems, which claims to be able to to produce "correct-by-construction" code for a cost not that much higher than code produced using other techniques. Praxis relies heavily on the SPARK toolkit for automated code verification. I notice that the original post that was the source of this story included a response by someone from Escher Technologies - their Perfect Developer toolkit allows automated code analysis as well - refuting the claim that provably correct code would cost a fortune.

Some other kind of programming

If one has to supplement every line of computer source code with pages and pages of text, the main product of the programming activity is the text and the computer source code is more like manually generated object code.

Another way to drive up the costs of each line of source code is to throw away a lot of code: prototypes, cancelled projects, multiple implementations competing against each other, etc.

interestingly enough...

...nasa did not do apply proof technologies in any of the recent space missions. see this article for a non-too-technical explanation of test methods in the likes of static analysis and regression testing which have been applied.

on the other hand, automated software certification is a big topic in research there. for example see An Empirical Evaluation of Automated Theorem Provers in Software Certification for hard empirical data that shows that near 100% automation is possible with current technology, as the commercial plug in the referenced discussion claims.

the hot question might be if cost/benefit ratio is the final reason why classic techniques have been preferred over theorem proving in these recent projects.

I do agree that the price is

I do agree that the price is exaggerated, but I guess the question is then what proportion of programmers have the technical background necessary to construct provably correct code?


Although not directly related to cost, I found these quoted arguments against provably-correct code interesting:

Indeed, consider SQL injection attacks. They didn't exist 5 years ago, because no one had thought of them. Same with XSS bugs. Same with printf format string attacks. All of them are examples of processing user input without validation, but they are all really big classes of such, and they were discovered to occur in very large numbers in common code.

What Dana is trying to tell you is that some time in the next year or so, someone is going to discover yet another of these major vulnerability classes that no one has thought of before. At that point, a lot of code that was thought to be reasonably secure suddenly is vulnerable.

To me this argument sounds like it comes from the Penetrate and Patch mentality. Thoughts?


It's certainly a reasonable belief given a certain set of assumptions: software being written in non-memory-safe languages with weak type checking. Once you start addressing those assumptions, most of the point is mitigated dramatically. If I write a program in Flow Caml and deploy it on OpenBSD, I wonder what my actual exposure would be?


With good engineering practices (proper modularization and security) and tools (languages and code verification) exposure can be reduced to near zero. With good engineering, even if one part of the program breaks, the rest of the program is isolated and secure. With good tools the chance of programmer error is minimized.

not sure

The older works on INFOSEC taught us to handle arguments properly, control information flow, isolate/restrict untrusted code, properly authenticate, separate code/data, etc. Designs that allow for SQL injection require violating enough security engineering principles that they probably wouldn't have been possible if people were following such principles. I think it's more a problem of extending insecure legacy applications and development approaches, then inheriting their weaknesses.

Penetrate and patch they often try to create something, then wait to see if there's flaws, fix, and so on. I think what gives rise to SQL injection is fundamentally worse as we can be certain (based on the past) that adding garbage to garbage will result in more garbage, but someone does the INFOSEC equiv anyway.

Big Theorems

With correctness proofs, you often run into the problem that the theorem you aspire to prove is approximately as lengthy and complicated as a well-written but unproven implementation of the program itself.

For example, imagine the sort of theorem that a word processor would need to obey -- there would be sub-theorems concerning how input, display, printing, formatting, etc need to be handled.

If your theorem is as complicated as a simple but unproven implementation of a program, then the former is no better than the later, because you can just as easily have bugs in your theorem (e.g. causing the wrong things to be proven) as bugs in your program (causing unexpected runtime behavior).

Bang for your buck

When it comes to correctness proofs, I think it's important to focus on how critical correctness is, and what confidence you need in the correctness of a particular application. Having made those assessments, you are then in a position to decide what kind of proof of correctness you require. This is not unlike any other engineering project, where the level of analysis and testing that is done depends on the criticality of the system or subsystem.

For many applications, you might be able to gain sufficient confidence in the system through testing (Theorem: the system does what I want; Proof: by scientific induction on test results). But there may be some critical parts of the application that warrant more rigorous attention. In the case of a word processor the UI and display might be adequately verified by testing, but you might want to provide a much more rigorous proof that a document will never become corrupted in the event of a crash (god knows I've often wished that Word would provide such a guarantee).

Of course, automated tools such as type systems, extended static checkers (e.g. SPARK), and counter-example generators (e.g. Alloy), model-checkers, and theorem-provers can also help alleviate the burden of more rigorous proofs of correctness. Which of these tools get used, and how much of the system they are applied to, will also depend on the level of confidence you need in system correctness, and the costs involved in the use of the tools. Not that these tools are infallible either (as you rightly point out). But they can give you a lot more confidence in the correctness of a design than you might otherwise have. And that's really the point: there's no binary decision between proving correctness or not, there's a sliding scale of confidence in correctness.

Redundancy is of the essence

If your theorem is as complicated as a simple but unproven implementation of a program, then the former is no better than the later, because you can just as easily have bugs in your theorem (e.g. causing the wrong things to be proven) as bugs in your program (causing unexpected runtime behavior).

That is true. However, the redundancy alone is likely to uncover many bugs, because it is unlikely that you introduce exactly the same ones in both formulations.

Embrace and Extend

Tim Sweeney: If your theorem is as complicated as a simple but unproven implementation of a program, then the former is no better than the later, because you can just as easily have bugs in your theorem (e.g. causing the wrong things to be proven) as bugs in your program (causing unexpected runtime behavior).

Yep. This leads me to two thoughts:

  • Oleg Kiselyov and Chung-chieh Shan are on to something with Lightweight Static Capabilies. In particular, points 1 and 2 from the abstract seem relevant: construct a tiny domain-specific kernel of trust, proving its properties formally, then rely on an expressive type system to extend that proven trust safely to the rest of the application.
  • The theorem and "simple but unproven implementation" dichotomy might be artificial: the Coq theorem prover, for example, supports not only extraction of programs from proofs, but also of just modules. This would seem to dovetail nicely with the prior point. Of course, the catch is that your target language has to be an extraction target, which is great for us O'Caml/Haskell/Scheme programmers today, but not so much for us C++ programmers. Call it "an important future direction." :-)

Update: Mark Miller, characteristically, makes my first point much more eloquently here.

But there is still an advantage.

I agree with what you said, but I think there is still an advantage to provably-correct code: logical contradictions may be caught at compile-time, rather than at run-time. Of course this level of proof does not mean the specs are correct, but it will help catch yet another class of errors.

Old-fashioned proofs

Is nobody mentioning the down-to-earth correctness proofs that are fundamental to computing? Any book on algorithms or paper by Knuth or Dijkstra is chock full of practical proofs that don't even have to be verified by a machine because they make such convincing arguments by themselves. These aren't proofs of whole programs in the modern sense but they are proofs of many necessary parts of them.

Surely some celebration and cheering would be appropriate :-)

Possibly interesting link: EWD1051 "The next forty years".

Modulo Knuth's Realism

Along the lines of "I have only proven it correct, not tested it." The idea that something doesn't have to be verified because they make convinving arguments themselves misses several points: First, there can be a lot of these things, and humans have better things to do with their time than to keep checking. Second, humans can make mistakes even on very simple arithmetic. Simple human-driven proofs were a stupendously important place to start, and it is good for humans to exercise their brains and understand what is going on, but it seems to me they aren't really sufficient in the modern world.

Exact citation

The exact citation is Beware of bugs in the above code; I have only proved it correct, not tried it. See the last item in Knuth's FAQ.


Apart from the humor value in Knuth's quote, it's also important to remember how massive your Trusted Computing Base is even if you've proven an algorithm correct: the implementation in language X has to be correct; the compiler for language X has to be correct; the runtime for language X has to be correct; the OS on which the runtime runs has to be correct; the hardware on which the OS runs has to be correct... I've always understood Knuth to be making a "ha ha, only serious" comment.

Update: I meant that Knuth didn't mean to imply that the proof was useless, but that the stuff that couldn't be proven could still hurt you. If we had an automated proof with extraction to code in a language with a formal semantics and a certified compiler targeting an OS such as Coyotos, chances are we really wouldn't have to test the routine so developed, because the unproven TCB would be... um... the laws of physics?

Proving the universe correct

[...] we really wouldn't have to test the routine so developed, because the unproven TCB would be... um... the laws of physics?

It's not so much the "actual" laws of physics, but our theories about those laws, which of course we can never prove correct — although we can test that they're sufficiently reliable approximations. Proofs can be useful and important, but it all comes down to testing in the end...


I keep forgetting to remind people that in the presence of a phase distinction, e.g. compile-time vs. runtime, proof-at-compile-time and testing-at-runtime are dual: proof-at-compile-time can prove the absence of some undesirable properties that testing can't prove the absence of, while testing can get asymptotically near proving the presence of desirable dynamic (hence unprovable at compile time) properties, depending upon how good the test coverage is—and there are some nice tools to help increase test coverage, there's the success of randomizing test paths, etc. One ignores either phase at their peril. So my semi-tongue-in-cheek comment above obviously ignores the testing-for-dynamic-semantics aspect. One of the more interesting aspects of using a type system with dependent types is: how much of your semantics is still dynamic in such a system, assuming that either you've got inference of principal typ(ing)s or have written out very specific annotations? Still, some of it obviously will be, else your program would simply normalize to a value (42!) at compile time.

"Dynamic properties"?

I'm afraid I don't get it. How can a property - in the logical sense (and that's what you care for wrt testing) - be dynamic? It may be context or time dependent, but that just turns it into some form of implication or modal statement, it does not make it "dynamic".

Language Design Terminology

Good point. I meant "dynamic semantics" in the sense that the term is used in programming language design, where the design is broken into "static" and "dynamic" semantics. This definition is admittedly informal, and you're quite right that it's possible to model the dynamic semantics in an appropriate logic. I just haven't seen a language design that actually does, and even if I had, the point remains that this semantics exists on the runtime side of the phase distinction, so testing for desired results, particularly, e.g. in the presence of data from external sources (user input, etc.) remains necessary.

Actually I would say that

Actually I would say that acording to Gödels incompletness theorem there would always exist properties in a programing language that you cant prove, and you might want to test those. (in theory)

Real Programming Languages vs. Algorithmic Information Theory

Practical programming languages are designed not to run into the incompleteness theorems: they can compute anything that can be computed. But the question as to what a language that was designed to run into the incompleteness theorems—indeed, to be able to say new things about them—would look like is interesting. Thankfully, we don't have to guess; Gregory Chaitin developed a dialect of Lisp to support his books on Algorithmic Information Theory, which can be found here. Basically, Chaitin works with self-delimiting bit strings and has a "try" operator with a time-limit to cut off evaluation. He then constructs proofs out of these blocks, among others. Fascinating stuff.

The programmers

It's not so much the languages, but the programmers, though languages usually do encourage "safer" practices, e.g. structured control constructs. General recursion, a feature most languages have, and the only (looping) control construct of Haskell is about as "free" as you can get, so I would not say that languages are designed not to run into incompleteness. As far as I can tell, short of language implementation, the problem domain itself usually is rather distant from incompleteness. Albeit, Stephen Wolfram suggests that in reality universality is much more prevalent than we normally think.

rather obviously...

The TCB would then be the OS, the drivers, the hardware and any grey areas between the hardware and software on the target platform. Good news is that there's been plenty of work on verifying each of these. ;)

how about a realistic estimate?

Scavenging wisdom from old IT and INFOSEC papers is a hobby of mine. I have no idea why these cost estimates are that high. Smith's "Cost profile of a highly assured secure operating system" regarding LOCK project. LOCK combined type enforcement, assured pipelines, controlled capabilities, software security kernel, trusted software, and a hardware enforcement mechanism. There was also COTS integration and emulation work.Pretty complicated compared to other early efforts.

"Tasks associated with the A1 formal assurance requirements add-
ed approximately 58% to the development cost of security critical software. In exchange for these costs, the
formal assurance tasks (formal specifications, proofs, and specification to code correspondence) uncovered
68% of the security flaws detected in LOCK’s critical security mechanisms."

That's a 58% increase. (Flaw detection varies per project.) Read the paper for more specifics down to labor hours per project activity. (Great data!) The A1 costs are pretty reasonable considering they were using very primitive technology with few worked examples of A1 security engineering.

The GEMSOS A1 platform, which I think OP's article implies, cost $15 mil to develop and NSA said they spent $50mil evaluating it. I ignore the other as it's probably govt waste. DO-178B type software is supposedly around $10,000 per line of code (incl evaluation IIRC). I think LOCK was around 20-25mil total but don't quote me. Productivity (LOC/hr) was 9 regular and 1.8 for TCB code. The VAX security kernel doesn't give costs: their continuing to fund the project and release additions at least annually implies it was acceptable. However, Lipner did say one of the reasons they cancelled it was that it took 2-3 quarters to do a major change on it due to A1 requirements. Dramatically slow time-to-market often= death in commercial IT space.

Another reason the quoted comparison is meaningless is that high assurance coders write their code differently, keeping costs and verification in mind. Code designed for verification was usually low lines of code, simple enough to apply formal verification, modular/layered, and designed for reuse in future projects where possible. Even today, with DO178B and things like CompCert, they keep many of these principles with the result is a bunch of components that end up in other projects, further reducing costs.

The final point some other commenters made already. Modern techniques can give you many of the benefits with few of the costs. Typesafe languages, isolation tricks like NaCl, autogenerated parsers/protocol libraries, usable functional programming, productive verified programming environments like SPARK, advanced static analysis of low-level code and sometimes problem specific platforms/components all make it SO MUCH EASIER to design software/hardware with few issues. Especially if you keep it simple.

Conclusion: (1) the often high claimed estimates were inflated and included evaluations most firms can ignore; (2) with right strategies, real products were made with reasonable cost vs assurance tradeoffs, but horrible time to market (see Lipner on Shipping vs Security); (3) we have much better tech to work with, much of it free. (C) People looking to build robust systems with plenty of "from scratch" code should consider current "Correct by Construction" and "Provably Secure" research, tools and products.

Secure vs. Evolvable

Crispin's cost numbers are a bit misleading. The misleading part is that there is a shift of costs that the number doesn't account for: lifecycle cost is reduced by verification. While the development cost per line is higher, the lifecycle cost is not. As with any other cost comparison, you can win the discussion by shuffling the money into appropriately selected buckets and then applying a naive test. Crispin certainly doesn't do this intentionally, but that is, in effect, what is happening.

That said, there are a couple of legitimate issues hiding in what he is saying. The first has to do with where cost is allocated in the development process and how that relates to risk. An awful lot of software products either never see the light of day or fail early. One effect of formal (or rigorous) methods is to front-load the cost of development, with the effect that failure costs a lot more. That increases risk.

The other point is that verification relies on a relatively fixed specification. During my stint at Microsoft, I had a brief conversation with Eric Rudder on the problems that arise from this. Eric opined that verification was irrelevant to Microsoft operating systems and software. The problem, in Eric's view, is that these systems need to be able to evolve nimbly in order to be commercially successful. Verification produces high-reliability results, but at the cost of "freezing" the specification. To the extent that this impedes change, the price of "winning" the verification may be loss of commercial success. While it's true that Microsoft operates by churning standards rather than by excelling at them, and that this makes a fixed specification counter-strategic, there is a lot of truth in what Eric is saying.

When we enumerate successfully verified systems, it's worth noting that almost all of them are fixed-function systems whose requirements have involved very slowly - perhaps even glacially. While those systems have been highly reliable in the field, the security results are much less clear - those systems can be broken. The reason for this is simple: reliability is something you can write a spec for. Security vulnerabilities are what happens when somebody else looks at your spec in a way you failed to consider, or takes advantage of something that you failed to notice. If you didn't notice the question, you can hardly verify that you have the right answer.

We need a middle ground. Something that lets us build more accurate software without giving up the capacity for nimble change.

Secure and Evolvable

Security isn't as much about 'correct' code as it is about comprehending and controlling propagation of potential damage (whether it be malign or due to error). This still provides room for evolution, for nimble changes, albeit constrained to securable frameworks and architectures. With security, the code isn't necessarily verifiable as correct, but at least it isn't broken in terrible ways.

Correctness often requires a more global view - more 'context' because code can only be correct in a given context. Unfortunately, the techniques that support verifiable code often require a white-box global view, which hinders evolution for intermediate components.