Finding and Understanding Bugs in C Compilers

In Finding and Understanding Bugs in C Compilers Xuejun Yang, Yang Chen, Eric Eide, and John Regehr of University of Utah, School of Computing describe Csmith, a fuzzer for testing C compilers. The hard part was avoiding undefined behavior.

Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to find compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our first contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undefined and unspecified behaviors that would destroy its ability to automatically find wrong code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers.

Two bits really stuck out for me. First, formal verification has a real positive impact

The striking thing about our CompCert results is that the middleend bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users.

And second, code coverage is inadequate for ensuring good test thoroughness for software as complex as a compiler.

Because we find many bugs, we hypothesized that randomly generated programs exercise large parts of the compilers that were not covered by existing test suites. To test this, we enabled code coverage monitoring in GCC and LLVM. We then used each compiler to build its own test suite, and also to build its test suite plus 10,000 Csmith-generated programs. Table 3 shows that the incremental coverage due to Csmith is so small as to be a negative result. Our best guess is that these metrics are too shallow to capture Csmith’s effects, and that we would generate useful additional coverage in terms of deeper metrics such as path or value coverage.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

No software, alas

Unfortunately, the link to the software is not active.

Yes, sorry about that.

Yes, sorry about that. We'll get it released sometime in April. The paper on the web is a preprint and we didn't anticipate so many people seeing it before PLDI!

Wow. This is very

Wow. This is very interesting stuff.

Blog

I reckon that John Regehr should use this as an excuse to post a link to his blog as its been an interesting read for months with updates on their results. If he doesn't do it quickly someone will beat him to it... ;-)

Many of the bugs we found

Many of the bugs we found cause a compiler to emit incorrect code without any warning. Most of our re- ported defects have been fixed, meaning that compiler implementers found them important enough to track down, and 25 of the bugs we reported against GCC were classified as release-blocking.

I had a fun chat with Moh Haghighat about his fuzzer for Intel's compiler a year or two back. It seems that such tools will indeed generate many bugs that demonstrate buggy compilers... but compiler writers aren't interested in them (and I remember reading an LLVM thread saying the same thing, I think to someone in John's group: they didn't want dumps but triagable bug reports). Finding many bugs and finding fixable bugs are pretty different! (Unclear meaning, unclear defect, social considerations, etc.).

The paper discusses providing shorter programs, but that, according to at least those two cases, is not sufficient. Moh demonstrated one big step towards automatic triage: automatically guessing the problematic pass / settings.

This matches my short

This matches my short experience working in this area for a company. The biggest challenge with that is noise (false positive), the next biggest challenge was demonstrating that the bugs found were malignant and worth the price of the tool. You need a really good sales team to convince management that these bugs are scary, the engineers are often just going to roll their eyes and go "so what."

I had a much better experience fuzz testing the early JVMs with the Kimera project. In that case, we found cases where the JVM verifier would crash, and were able to turn those into exploits (e.g., dumping out the contents of the client's environment variables to the web).

Sean and Leo, I think you

Sean and Leo, I think you exaggerate the situation. >95% of the hundreds of bug reports we've sent to GCC and LLVM have been fixed. With respect to these compilers there is no significant problem with false positives or unfixable bugs.

Commercial compiler teams approach bug reports a bit differently, asking the question: Do we have strong evidence that an important customer is impacted by this bug?

What are your thoughts on AI that can heal broken programs?

Like Stephanie Forest's work.

Actually I wrote a blog post

Actually I wrote a blog post on this exact topic :)

http://blog.regehr.org/archives/383

Quick summary is that it's cool work but compiler bugs are typically nasty enough that I don't buy that their technique will work. I'd love to be proven wrong.

Vague

I think your example of regulating buffers is vague.

For example, is it okay for an AI program to understand the notion of "place" in a spatial programming language and optimize the placement of parts of a program across the Internet to minimize propagation delay if it notices two data centers at two ends of the globe talking to each other and requiring the obviously-many hops for round-trip communication?

Machine learning algorithms for data center load balancing already do this at a much smaller scale.

Quick summary: It's cool that you are reading closely related work to what you are doing, but disappointing that you are closed-minded about wanting someone else to prove you wrong. Prove yourself wrong.

My own conclusion is that I would rather write a machine learning system that solves the problem at the right level of abstraction, guided by human expertise. But then again, I am a fan of what I call the Kasparov Principle: Weak human + machine + better process is superior to a strong computer alone and, more remarkably, superior to a strong human + machine + inferior process.

I can't fix the vagueness

I can't fix the vagueness since I don't understand what's vague about what I wrote. Certainly machine learning for placing computations across a network is a great idea, if it works.

I generally do try to prove myself wrong. But it doesn't work when I'm right.

Okay

How about you e-mail Stephanie your thoughts and suggest a collaboration of some sorts? This is what I meant by "Prove yourself wrong." Why read a paper and write a blog post about it, and go all the way up to the point of saying it would be great if these tools work together, but not actually go forth with that idea. It's like your blog post culminated by singing that Tal Bachman song, She's So High.

They have continued to work on the idea since the ICSE 2009 paper you referenced. She gave some highlights in her SPLASH 2010 keynote, including a student who optimized assembly programs using the techniques.

As for spatial computing, I think my example needs to be sharper so that you can better understand it. I apologize for my own vagueness here.

Point taken. I'll think

Point taken. I'll think about emailing them.

John, I don't see much of a

John, I don't see much of a difference in the bar you need to reach, and perhaps you have reached it. In an OSS project, developers have more discretion about what to fix and perhaps there are many other developers who are able to fix bugs. Also, since everything is out in the open, they might have more motivation to fix those bugs.

But still, a bug report that comes from a user directly is probably going to have an impact on at least one user. A bug report that comes from a tool will more likely not have significant user impact, although they do provide good ideas of where to look for user vulnerabilities. In a world where perfection is unachievable and bug fixing is expensive, you must prioritize.

We often feel that even one bug is one too many, but I'm much more partial to Rinard's view on this. Rather than try to achieve the lofty goal of perfection, we should learn how to deal more gracefully with imperfection. However, this approach is difficult to imagine for a compiler, you'd like your successfully generated code to be what you intended it to be. It would be interesting to see what would happen if bugs were intentionally injected into a compiler, what would break and how?

Rectifiers are a good idea with a bad communication protocol

Rectifiers are a good idea with a bad communication protocol.

What about rectifying the

What about rectifying the programs that were generated by a bad compiler? Anyways, it seems rectification would not care whether the bug was produced by a human or compiler.

This is interesting but

This is interesting but sounds hard. I seem to need either a model of bugs or a model of bug effects, and it's hard to come up with these in the case of compiler bugs.

Model of bugs

These exist. Are you familiar with a popular model called Delta Debugging?

Or perhaps I misunderstood what "model of bugs" and "model of bug effects" entails.

We use Delta heavily in

We use Delta heavily in testcase reduction; in fact we have done (I think) four separate Delta implementations for C code, each with different strengths and weaknesses. Our PLDI paper had no space for this material but it's fair to say that about 40% of the overall problem we tried to solve (making C compilers better) is about testcase reduction.

My view of Delta is that it works well because it has no models. Delta is just a fancy name for a simple search algorithm.

Could make for a good blog post, then!

Thanks for your thoughts.

I have a draft blog post

I have a draft blog post sitting around somewhere called something like "Delta Debugging is Just Greedy Search". Hopefully I'll find time to finish it up.

I think it might make more

I think it might make more sense to rectify the compiler input, as opposed to its output. For example, if side-effecting expressions are buggy, we could use CIL to rewrite the code to avoid these.

It would be pretty easy to do the experiments here: take Csmith output and run it through various source-to-source transformations and see how this affects the observed incidence of mistranslations.

Sean I think you are saying

Sean I think you are saying that compilers are somehow different than other programs in the sense that (at least in some aspects) users and developers have a very low bug tolerance. Maybe not for floating point, for debugging information, or for diagnostics, but we really do expect well-formed integer code to be translated successfully.

Compilers are mathematical objects like crypto functions -- they need to just work.

Compilers are buggy as get out

I would love to have a fuzzer like the one you describe for any of the compilers I've worked on. I'm sure it would have helped with the overall robustness.

On every single one of them, there were plenty of cases where I ran across code that was fundamentally and horribly broken. Small things like calling lub instead of glb or testing lt(a,b) rather than lt(b,a), and big things like building a set that is potentially infinite. Such bugs were present despite all of them being in production use by many users. Martin Rinard is right, including for compilers.

Could you give more detail

Could you give more detail about "Martin Rinard is right, including for compilers"?

I'm just trying to figure out if you're saying something more interesting than "compilers are buggy."

Compilers are mathematical

Compilers are mathematical objects like crypto functions -- they need to just work.

Crypto functions are pretty straightforward when compared to compilers. A compiler will invariably have state and multiple passes, the language compiled will have a non-trivial type system, even if the type system can be proved correct, proving correct a reasonable implementation will be impossible.

Compilers will have lots of bugs no matter how hard we try to be careful coders. And yet, we will continue to use those compilers. Yes, we need them to be perfect, but they are never going to be perfect, now what?

I suggest doing some kind of empirical study on this. Insert some serious worst case flaws into a compiler that don't crash but generate executable code, and just document how the compiled programs fail. Does the world end? Or can life go on even when the compiler is obviously broken? What kind of flaws need to exist before real programs are significantly affected? I've always been surprised with Rinard's results when they do this kind of study, I don't think the conclusion is inevitable.

Compilers are mathematical and mechanically verifiable

A compiler will invariably have state and multiple passes, the language compiled will have a non-trivial type system, even if the type system can be proved correct, proving correct a reasonable implementation will be impossible.

Compilers will have lots of bugs no matter how hard we try to be careful coders.

I think the CompCert project tells a different story. It is possible to have provably correct not-so-simple compilers for not-so-simple languages.

There is a huge scientific litterature on the semantics of program transformations, compilation, etc. It is a well-specified problem with decades of verification work; iirc., the initial ML project lead by Robin Milner applied the LCF system to compiler verification.

It is reasonable to ask for programming languages that are also well-defined, in a sense that is amenable to formalization. I think that providing a formal semantics for a programming language will become a growing trend, in academia at least.

Of course, languages that where not designed with formal precision in mind can horribly difficult to formalize. The present paper makes this point about C++0x (is it a gentle criticism of compiler verification, or C++0x?). Yet people are working on formal specification of various part of C++, and making reasonable progress.

For a wider view at your question, I think current languages and programs are extremely fragile wrt. compilation errors. I can't remember a single program I've written where the failure of a boolean test couldn't completely fail the program (all programs using an "assert or die" approach have this property). I would guess that a *randomly* chosen boolean test failing would also seriously endager the overall correctness. Most programs don't have redundancy mechanisms that "physical" engineered systems tend to use, even implicitly, to support partial failures. And it has been shown that even the redundancy techniques, when applied naively, do not increase confidence that much (cf. the Knight & Leveson experiment).

There is a niche where

There is a niche where formal precision is desired, and I guess those compilers are perfect. However, most languages that people use are not formally specified nor do they have formally correct compilers.

A standard compiler should achieve some amount of accuracy so that it works in practice, maybe it is something like 99% correct. The 1% left represents corner cases and underused dark alleys of the language. My question then, what breaks in practice when you have serious bugs in an otherwise reasonable compiler?

What breaks?

Lots of stuff breaks, but mostly sanity. :) That, and a lot of people can lose money.

Way back when I was a professional coder, my manager and I made a contest out of finding bugs in the Java compiler and VM. I started late, so we ended up 4 to 1.

The most maddening thing about these bugs is that they're very hard to pin down. Most programmers implicitly trust their compiler. It wouldn't even occur to them that something might be wrong with it.

For example, when we had a certain insurance claim routing problem reported, it took our team a week to figure out that it was the Java compiler's fault. The report started at the grunt programmers, who spent four days on it. Frustrated, they passed it to my manager. It took him two hours to figure out that it was a string concatenation problem that only showed up in very specific circumstances, some of which were apparently unrelated to the problem code. It took him another half day to verify it was a compiler bug, work around it, and report the problem to Sun.

Those grunt programmers weren't our most productive, but they could have spent their four days doing something more useful. Moreover, a member of our support staff and a member of our clearinghouse's staff spent days hunting down the misrouted claims. All were at least delayed; most misrouted claims were surely lost. We couldn't account for the ones that were misrouted more than a few weeks before the first report.

Certain kinds of software need to be more trustworthy than other kinds. Compilers should be the most trustworthy, because 1) they build all the other software; 2) we put so much trust in them already; and 3) it's so hard to figure out that some weird problem is the compiler's fault.

We fuzz tested the JVM

We fuzz tested the JVM verifier back in 1998 and found bugs that got us (well, at least my boss at the time Brian Bershad) in the New York Times. Actually, it was very easy back then (I'm sure they fuzz test now, its very cheap), but I would guess today there are still bugs their to find. Our approach was just reimplementing our own verifier according to spec, then use comparison testing to find bugs in our implementation and eventually in the reference implementation.

Bugs are definitely bad, but they are kind of inevitable. We still should find them, but we also need to think about how we can compilers (and the generated programs) resilient to bugs in general. This means making it easier to identify and diagnose problems when they occur.

Yes, we need them to be

Yes, we need them to be perfect, but they are never going to be perfect, now what?

"Never" seems like a pretty strong proposition. All this work under discussion is in service of making compilers perfect. We seem pretty far along, so how can one conclude "never" given the current state of affairs?

Historical evidence does not necessarily imply this conclusion anymore than the historical evidence implied that human flight was impossible before the Wright brothers did it. CompCert may very well be to Wright brothers as a future verifiable compiler construction suite will be to 747s.

All this work under

All this work under discussion is in service of making compilers perfect.

I assume that it is impossible to ensure that any non-trivial system is perfect, there will be bugs either in the spec or in the implementation. I don't think any researchers in the systems field will believe that perfection is attainable. So I stand by "never" as being pretty accurate in this case.

CompCert may very well be to Wright brothers as a future verifiable compiler construction suite will be to 747s.

Some problems are fundamentally unsolvable. I think a more accurate comparison would be with the halting problem, which is well known to be unsolvable. So we admit perfection is impossible to achieve and resort to heuristics.

Some problems are

I assume that it is impossible to ensure that any non-trivial system is perfect, there will be bugs either in the spec or in the implementation.

What does "bugs in the specification" even mean? What does "perfect" mean? I wouldn't call a verified spec and program that doesn't quite solve the desired problem buggy. Rather the programmer's mental model was buggy, and that's a class of bug we'll never eliminate through an external tool.

Some problems are fundamentally unsolvable. I think a more accurate comparison would be with the halting problem, which is well known to be unsolvable. So we admit perfection is impossible to achieve and resort to heuristics.

No, a statement analogous to the halting problem would be, "it's impossible to build a verifiable compiler construction tool for ALL possible languages". I agree with this statement.

You are saying, "it's impossible to build a verifiable compiler construction tool for ANY set of languages". That's a much stronger statement, and I disagree with it.

You are saying, "it's

You are saying, "it's impossible to build a verifiable compiler construction tool for ANY set of languages". That's a much stronger statement, and I disagree with it.

No, I thought I was clear on meaning any language that is useful. Of course, you can build a very formalized simple language for tasks where you want full assurance, like say for stuff that NASA does. However, even NASA probably doesn't do that very often, its just too expensive and its not really where their major pain point (rather, bugs in software are).

I believe we will achieve a

I believe we will achieve a verifiable compiler for a useful language within the next 20 years--one that really is correct for all inputs (though we may not be able to prove it for all inputs) and also performs nontrivial optimizations. The progress on this front in just the last five years is tremendous.

Note however, that does not mean people will use this useful language or this compiler.

No, I thought I was clear on

No, I thought I was clear on meaning any language that is useful.

I disagree also that useful languages are not part of the set of languages this tool could handle. Certainly not ALL "useful" languages, but enough. The rest would require extending the proof search manually.

Of course, you can build a very formalized simple language for tasks where you want full assurance, like say for stuff that NASA does. However, even NASA probably doesn't do that very often, its just too expensive and its not really where their major pain point (rather, bugs in software are).

But you're shifting the goalpost here. The context was "never", so not just the current context, but no future context could possibly have a toolset where a verified compiler could be built with reasonable effort. You really think NASA will *never* end up verifying their code, and there will *never* be a toolset which makes verification a reasonable effort?

Given the pace of evolution in verification and lightweight formal systems, this seems overly pessimistic. 10 years ago we would never have considered formalizing web interactions or web UIs that ensure semantically correct markup, yet you can now do both with reasonably simple Haskell and OCaml libraries. These require no little to no additional effort on the user's part over writing HTML or PHP directly.

CompCert

I don't understand. Are you not aware of the [CompCert] work? Are you claiming that C is "not useful"? Are you implicitly considering that the tiny unspecified parts of CompCert (mostly parsing the input text, and printing the ASM instructions) make it not formalized enough?

[CompCert] http://compcert.inria.fr/compcert-C.html

I certainly don't think C is the best language we can have, nor the most useful for current task. But I'm a bit startled to still read, now that we have a fully certified compiler for a widely used language, that certifying a compiler is not realistic.

I agree that specifications may have bugs. Hopefully they're much shorter and simpler than the specified program; this is at least true for compilers.

Remark: I don't claim that CompCert will solve every problem C programmers may have wrt. their toolset. It does not support longjmp for example, and doesn't say anything (for now, this is being worked on) on multi-threaded programs. I also don't think CompCert is the end of it all wrt. compilers: there are other alternatives developping quickly, and neat techniques, rather than final product, still being worked on (eg. Adam Chlipala's work). Still, that's one very serious proof of concept.

Specification Bugs

What does "bugs in the specification" even mean?

Specifications can have all the same bugs software can have - including the trivial ones: syntactic, typographical, copy-and-paste. Specifications may be contradictory, contain design bugs. Large systems might be 'specified' by four-hundred page manuals with contradictions over 200 pages apart.

There is something of a regression problem in software verification... i.e. who watches the watchmen? The notion of 'executable specification' would not avoid need for software verification; instead, you would end up needing a meta-spec.

The only real goal is to minimize non-essential complexity and semantic noise, and to support just enough redundancy in how you express concepts that the computer can help you discover errors (whether those errors are in your software or your spec).

Sean mentions 'non-trivial systems'. To me, that says you've got large or deeply layered specifications with a LOT of essential complexity, and most likely: your specifications are partial, your requirements analysis is incomplete, your foresight is imperfect, and nothing but the version history is set-in-stone. This creates a non-trivial problem that your software and specifications must be somewhat 'adaptive', and is the whole reason we seek loose coupling and modularity.

However, language are often 'trivial' systems. I don't mean 'trivial' in the sense that anyone can think of them - we are often bound by invalid assumptions that only seem invalid in hindsight. Rather, I mean 'trivial' in the sense that languages have a small, finite number of primitives with well-defined semantics. That isn't true for all languages - Perl or PHP, for example - but a spec for Brainfuck or Whirl or SKI combinators is trivial enough that you could prove it contains no contradictions, no 'specification' bugs, and prove an implementation correct. (The much more troublesome part is proving the verified language 'valid' - that it solves a useful problem. Or 'competitive' - that it's sufficiently better than the incumbent to justify learning a new language and reworking the old code.)

So while I agree with Sean for software systems in general, I'm not convinced his argument applies to language design and compiler verification.

Aside: John Regehr's blog provides 6 levels of bug.

the programmer's mental model was buggy, and that's a class of bug we'll never eliminate through an external tool

Maybe not eliminate, but as language designers or API developers we can mitigate the 'buggy' mental models of developers, help teach them more effective mental models that have been proven.

Constraints can help. For example, one might develop a compiler under a constraint of maintaining type-safety after each optimization pass. (Andrew Megacz's Generalized Arrows are an interesting study here.)

I've experimented with a capability-oriented approach where one of the early transformations is to provide all the ambient 'beneath notice authorities' via capabilities. These authorities include resource consumption - allocation, scheduling, energy. They also include effects such as indeterminism or delay. Then I am constrained to maintain these capabilities across optimization passes. To actually run the program, I can 'link' these capabilities into the virtual machine, and specialize based on the language implementation.

The use of capabilities in this manner forces me to more explicitly consider issues of time, space, etc. during optimization passes, which helps my mental models. It should also allow me to take better advantage of non-functional annotations in the code, e.g. to achieve real-time embedded programs.

Constrained by implementation

Specifications can have all the same bugs software can have - including the trivial ones: syntactic, typographical, copy-and-paste. Specifications may be contradictory, contain design bugs. Large systems might be 'specified' by four-hundred page manuals with contradictions over 200 pages apart.

Contradictions in the specification is not much of a risk with formal verification - if you can produce an implementation of the specification, there were no contradictions.

The other sorts of errors are more of a risk, especially if those local errors are introduced in translating from an informal to a formal specification.

Gaining confidence in specifications

It is clear that formal (in the sense of : written in form of some source (code ?) that a formal theorem checker can verify) may have "bugs", that is, may not in some situations reflect correctly the intent of the specifier. This is inherent to the informal-formal gap between our image of the system and what we precisely describe to the computer.

It is exactly the same problem as bugs in the source code, only it has different effects. A buggy specification, most of the time, won't be "provable" : you won't be able to prove that your reasonable program verify the specification, or (in the general case of theorem proving) you won't be able to write the proof of the formal statement. Happily, specifications and implementations tend to be written in totally different ways, which means that specification bugs and implementation bugs (which would make the software verify to the specification) rarely coincide.

Admitting that we are working in a domain where the specification is well-specified (we know well what we want to achieve) and an order of magnitude simpler that the certified implementation, what should we do to gain confidence in the *specification* ?

For software, we have testing, which is a reasonable way to gain confidence. For specifications, I think some for of "testing" would be possible. As you mentioned, providing a satisfying implementation is a kind of "coherence test" for the specification. In the other direction, I think that providing modified versions of the software that (provably) do *not* verify the specification is an useful case of "hardness test" (we check that the specification isn't trivial, ie. doesn't accept anything). Finally, if we have devised a lighter specification in the past (for example, instead of total correctness, we have proved that the program terminates, or returns a positive result, etc.), we can show that the final specification implies the lighter specification; this mean that we can trust the final specification to specify at least all the guarantees of the simpler, lighter specification.

Are there other kind of tests that could be performed on specifications to gain confidence ?

I think we could also say that having two independent implementations verifying the specification improves confidence in the spec. We could also hope for spec. to be shared and reused, provided in abstract libraries that would be used by different people for different purposes (more eye balls etc.). But those come with time, and are difficult to achieve for a single (group of) developper that wants to improve confidence in his verification work.

Usability

For any code which has further clients, in the context of a larger verification effort, the most natural test is simply to get on with writing the rest of the program, and see whether clients of this code can be proven to meet their own specification using the exported specification.

This should uncover trivial things like providing a "commutativity" lemma which says "f(x,y) = f(x,y)", and if a "wrong" specification is implementable and sufficient to verify the rest of the program, it's a weak sort of "wrong" (perhaps performance problems?).

For a complete program, you can do the usual sorts of acceptance tests.

Contradictions in the

Contradictions in the specification is not much of a risk with formal verification

Agreed. If you can express your program (or formal spec) in a language designed to either syntactically prevent or statically analyze for contradictions, then (barring some soundness problem of the expression framework) you'll not have any contradictions in your result.

Of course, there will be valid specifications that you cannot effectively express or verify in this language. Fortunately, language designs and optimization passes are unlikely to be excluded.

You don't have consistency for free, but when exhibiting a model

I don't think that was Brandon's point, and I don't think that eg. Coq could be described as "statically analyzing for contradictions in a specification" : a specification in Coq may be a tautology (it is true of all programs) or just false (it is wrong of all programs), and its the latter case that would make a "contradictory specification".

What Brandon meant is that, once you have at least one implementation verifying the spec., you have shown that the specification is not "false" (contradictory). This is similar to building a model of a theory to prove its consistency.

(I have stayed a bit fuzzy in this discussion of what a "specification" means. I'm considering a system when you can have both programs and proofs. I consider a specification to be a predicate about some programs (a mathematical statement parametrized by programs), and a certification of a program, or correctness proof, to be a proof that the predicate is true on this program. Of course, more general notions could be considered (specification => theorem, certification => proof), but I think this special case is a useful starting point for a discussion of the "confidence" we have in specifications, as we can drag intuition from the "confidence" we have in the underlying specified programs.)

a bit fuzzy what a

a bit fuzzy what a "specification" means. I'm considering a system when you can have both programs and proofs

Compilers and Coq and human developers ask the same question: "Can I build a program that does what is described in this input text?"

The fuzzy, human classification of one blob of text as 'specification' and another as 'program' is based on heuristic estimates of purpose and context. We might mix program and specification together with techniques such as assertions, preconditions, postconditions, type annotations. We may combine program and specification more intrinsically via use of partial functions and dependent types. The distinction of specification and program is subjective and relative (but still useful).

The relevant point is: some text may contradict other text. Contradiction requires at least limited redundancy - e.g. two expressions describing or constraining the same property. Can we discover these contradictions? can we prove we found them? We need formality to do so.

Brandon was perhaps thinking 'Coq' when thinking of formalism, but that's just one language. My own point, to which Brandon was responding, is that 'formal specification' is just another word for 'program', albeit in a language designed to discover contradictions early in the software life-cycle(rather than at runtime).

I understand, thanks for

I understand, thanks for clarification.

The reason, I think, why a distinction between "program" (meant to be executed, ie. interesting for its operational behavior) and something else (specification, types, or even a naive implementation that you don't execute but that you prove equivalent to the program) is useful, is that it mirrors the distinction between the program and the implicit intent of the programmer(s). There is a discontinuity between the executed program and the intent, but all techniques that allow you to shrink the gap will increase confidence.

I reported two concrete

I reported two concrete instances. You somehow overcame the hurdle I believe I witnessed for your or another fuzzing tool for LLVM, but there's a hurdle none-the-less. I also witnessed the triage problem for Microsoft and Macromedia/Adobe compiler teams -- even without fuzzing. Not having it is more of surprise: fixing such a magnitude of bugs takes a lot of effort (unless they're shallow, redundant, etc.).

Introspecting on *why* the bugs got fixed is fascinating (and shouldn't be considered a critique against your work -- I don't know how it compares to other compiler fuzzing tools). The security and concurrency community have benefited from such methods, in part, because of their critical nature: even then, they struggle with the hurdle, such as the preference for dynamic analyses that provide concrete examples (as in your approach). Even then, showing the bugs (w/ concrete trigger) isn't enough in itself. Jeremiah Grossman's keynote at W2SP 2010 was eye opening: clients paid his firm to fuzz their sites for security holes, and even after reporting them, they weren't fixed for years if at all. Research in the line of Philip Guo's examination of Windows kernel bug triage provides insight: were the bugs simple (arguably what would be hit by fuzzing), bundled, a relationship between submitter and reviewer, at the right time in the time table, etc.?

As another concrete example, consider the memory model community that has been and will likely keep finding huge swaths of bugs in compilers using some really great techniques (Madan's PLDI paper is particularly great!). However, a common reaction is non-caring. If you fuzzed for those bugs and found 1000... how much effort would we expect to be devoted to fixing them? Do we hook an open source developer interested in memory models (e.g., due to prestige) or make a case for say ARM developers to provide patches as they're more likely to get hit by it? Or should we expect the same reaction as to your bugs?

In contrast with the

In contrast with the experience you report with C, random compiler testing seems to be effective and practical for CL compilers. Paul F Dietz's test suite [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.5188] (but particularly its random tester) has proven very useful for many implementations. As an SBCL dev, I know that it has provided us with reasonably useful test cases.

The fuzzer may be more useful than usual because it also includes a pruning pass. Of course, it also helps that a compiler bug rarely results in an actual core dump, since compilers tend to be written in CL and can thus actually be rewritten piecemeal, at runtime.

Blah

Too many excuses from Intel.

It boils down to provenance, right?

2010 LLVM Developers' Meeting

There are slides and videos[Desktop(180 MB)|Mobile(140 MB)] available of the presentation(Hardening LLVM with Random Testing) given by Xuejun Yang, one of the authors of the paper. There are also other slides/videos available from the 2010 LLVM Developers' Meeting page that people might find interesting.