A few billion lines of code later: using static analysis to find bugs in the real world

Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. "A few billion lines of code later: using static analysis to find bugs in the real world", Communications of the ACM, Volume 53, Issue 2, February 2010, Pages 66-75.

How Coverity built a bug-finding tool, and a business, around the unlimited supply of bugs in software systems.

This is a fascinating piece by Dawson Engler & co. on their experiences in commercializing their static analysis research through Coverity. It's an entertaining read, with many interesting anecdotes from various customers. But it also contains a number of useful insights about the difference between a research tool and a commercial product, the kinds of static analyses that do and don't make sense in a commercial context, and the multitude of problems caused by the lack of programming language standardization:

Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

There's a lot of useful information in there for anyone interested in industrial-strength static analysis. There are also a lot of worthwhile things to keep in mind if you're designing a programming language, and want to make sure it's as friendly as possible to future static analysis tools.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The illusion of languages and the reality of implementations

Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

Ha, I've heard that quote before, maybe in some slides from a Coverity presentation or something. It is quite fitting! I wanted to try Coverity on a large program I worked on, to see if it was at least possible to write checks for a locking-related issue that introduced crash bugs into every second release. However, I found out that even Edison Design Group's front end (which Coverity uses) doesn't support Borland C++, and certainly not ancient versions of Borland C++.

Not necessarily

They mention in the article they transform the input before feeding it to EDG because there are a lot of compilers EDG doesn't handle well.

binary analysis?

The article is really funny and probably anyone who has ever attempted code analysis on larger scales could sympathize :-)

I wonder if, in light of these problems, other tool-makers have attempted binary analysis? I know the security analysis community has tried this route with some success (but then, a buffer overflow is probably easier to detect in assembly than in code ;-)

Re: binary analysis?

I wonder if they have tried to transform object code or assembly back to C code and then analyze it?


The article does mention that they've had far fewer problems with analyzing Java and C#,

...since we analyze the bytecode they compile to rather than their source

Whether or not you'd get the same advantages by analyzing object code or assembler is difficult to say. Once you get outside the desktop there are so many different architectures out there to contend with that you might just end back in the same boat you were in with C. In contrast, the JVM and CLR make for a pretty limited set of targets.

that's me

anyone who has ever attempted code analysis on larger scales could sympathize

The article had me cracking up. I worked on a multi-faceted static analysis system, a little before Coverity started up. I and one or two others put together a bug reporting system similar in application to what the article describes. Of the difficulties mentioned in the article, we had encountered or addressed several of the basics.

It was gratifying to read that Coverity has a team of engineers dedicated to one aspect (parsing language extensions) of what we were attempting.

[edit] Also, it's quite impressive how they tailor their parse system and bug identification to meet customer needs. It's not easy to cut out a checker that has a tendency to false positives, when you know that it is also good at pointing out true bugs.

Thanks for the nice

Thanks for the nice endorsement!

Tom Reps' company does binary. Which is cool for sidestepping syntax issues. The flip side, unfortunately, is that you lose a lot of semantic information (not always easy to reconstruct), making error reporting/finding more difficult. For example, we even hate the C preprocessor cpp since it systematically boils away useful semantics. E.g., MAXBUF_SIZE becomes 255, etc.


Interesting article indeed. I'm impressed how relaxed they seem to cope with all the absurdity they encounter.

Needless to say (in this forum at least) that it's still depressing what incredible amounts of intellectual and monetary resources are wasted on problems many (most?) of which wouldn't even exist if people just used civilized languages. Point in case: I counted about 13 examples of bugs in the article, and arguably only one of them could occur or go unnoticed in a language that is properly specified, typed and memory-safe. And even this one bug (the dead for loop) would probably be avoided in a language encouraging more high-level abstractions (like HOFs).

Safety v. convenience

problems many (most?) of which wouldn't even exist if people just used civilized languages

These tools are intended to analyze existing code with little or no additional effort. To the extent that untyped, memory-unsafe languages are more convenient than languages with strong typing and safety guarantees, static analysis tools are useful.

Of course, at some point it pays off to refactor/rewrite code in order to take advantage of stricter features: as the article points out, the output of static analysis tools can be overwhelming to the point that customers refuse to upgrade to newer versions which will report more issues.

Prolonging life

Yes, agreed. But my worry is that seemingly helpful tools like the one described may have the perverse effect of prolonging the life of broken code bases and languages - especially if these tools are pragmatically tuned to fit the mindset of the more ignorant among their users, as described in the article. Maybe it would be better if these artifacts collapsed sooner rather than later?

Also, obviously, people continue starting new projects in utterly inadequate languages, and the availability of tools like this will likely feed the belief that that is a good idea.

inadequate tools

I rather doubt that things like Coverity will make much difference in either direction. The societal preference for inadequate tools is rather deep, I'm afraid, and a static analysis tool here or there barely makes a ripple.

I wonder if the coming mainstream arrival of F# will do something to change the world. In the meantime, I console myself by using better languages in my own work.


Fifteen years ago, the dominant programming language for commercial use was type-unsafe, memory-unsafe, concurrency-unsafe, and had a specification filled with words like "implementation dependent" and "undefined". A bunch of people hated it.

Today the dominant programming language for commercial use is type-safe, memory-safe, has a reasonable-if-not-great concurrency model, and a set of very solid and reliable specs. A bunch of people hate it.

Progress is occurring. It just takes a bit.


The cynic in me says that given the invariant we must have been fixing the wrong problems for fifteen years...

Andrew, I think you are

Andrew, I think you are right (and this comes from least cynical part of me).


Ehud, Andrew, any thoughts on how to fix the right problem?

Some rambling thoughts (no guarantee of quality)

Given that many smart people have been trying to "fix" a recurrent / constant Software Crisis for many decades I would suggest that there are only two reasons why the wrong problems have been fixed:

1. It is very hard to establish what the right problems are.
2. When we do find the right problems they don't appear to have a solution.

Personally I fall partly into category one, and partly into category two I'm afraid :) I read Dave's comment as a description of the move from C to Java for commercial systems. It does occur to me (now writing this second comment) that in this specific scenario there is a third, much more likely reason:

3. The "bunch of people" is different in the two cases.

Many of the issues that people had with certifying C code have been solved, as Dave pointed out, by trying to certify Java code instead. In exchange for greater safety we seem to have sacrificed performance (in a shrinking set of cases) and conciseness (although possibly it just seems that way because I write bad Java code).

So you could see Java as the logical evolution from C that allows people to write imperative code in a familiar style, have some inbuilt primitives for organisation and modularity, without sacrificing the "simplicity" of an imperative / procedural language. As a result of picking a different tradeoff between performance and safety, a different set of people dislike the language. In some sense this evolution is predicated on several assumptions :

1. A mainstream language must be imperative otherwise people won't understand it.
2. A mainstream language must be imperative otherwise it won't be fast enough.

Accepted wisdom on the first point has definitely shifted over fifteen years, and will continue to shift over the next fifteen years. This is partly what I meant by the "wrong set of problems". Giving somebody access to mutable state is like handing them a chainsaw and telling them to trim the flowerbed borders - it's a powerful tool, but not necessarily the best one for every job. This is now a commonly recognised axiom for language design. There has been substantial progress in increasing the set of problems that have a practical solution in a declarative language, and in finding new ways to hide state in imperative languages using declarative constructs. We can hope that these two approaches meet in the middle somewhere.

The second point point is harder to nail down, although there are results in specific contexts where declarative languages can win. Our last paper showed that wrapping a functional language in "imperative camouflage" for the programmer actually beats a real imperative language hands down. But we are only looking at small and simple programs at this stage so it is very unclear how the approach would scale to "real" programs.

different take

i think it is to a large degree a result of the economics of the situation: there isn't enough pressure to do it better the vast majority of the time.

2 cents

There are many issues, and I am certain I don't have a very original insight to add. But fwiw, I think one thing is that the problems are usually related to programming-in-the-large (which is not sexy to study, and the solutions that have been suggested tend to be even worse than the original problems), while at the same time it seems more and functionality can be achieved by programming-in-the-small. This leads to approaches that are based on extending what can be done on this level (~expressiveness), while real issues having to do with the challenges of programming-in-the-large remain to a large extent unaddressed. I think a lot of the advances that were mentioned are tremendously important, by the way, but they distract from SE issues that must be addressed. One great thing that happened, and should not be overlooked, is that more and more we can assume garbage collection. That helps both in-the-small and in-the-large programming.

can assume garbage collection

Garbage collection is a great advance for some application areas. I've been a happy user of GC'd languages for 35 years where appropriate. But this GC comment disappointed me... My day job is embedded systems, mostly medical devices. I thought the thread was pointing PL researchers in directions that would help me!

The development tools and languages for deeply embedded processors are abysmal. These systems often can't justify using dynamic heap memory allocation at all, never mind GC. Training programmers in Ada, even when it is available, seems to be a non-starter. The processor manufacturers are not behind it; they promote C, and my programmers know C.

So, we use rules and tools to try to tame C. We tried to use Coverity, but after several days of effort with on-site help we could not get it working with our (very standard for embedded ARM development) IDE.

Research languages for safe embedded software don't seem to be making it to market, whether open source or commercial. PL researchers should be paying attention to these failures. Lives are at stake, literally. Embedded software is everywhere: just wait until you wake up in an ER or ICU and see all the programmed devices around you programmed in C.

Just use Ada.

Just use Ada.

I'm not that pessimistic

I'm not that pessimistic. I think if we were still using the same language as 15 years ago, things would be much worse. So there must have been a few "right" problems been fixed. And in 15 years time we are hopefully using even better languages on a broad scale, and still there will be people hating them (for good reasons). I think we are eating up any kind of progress in this area fast (and I don't mean research progress), and will hit boundaries again, but on a new level.

totalitarian principle

Couldn't agree more with Ingo — this is the funniest piece that CACM has published in years. I saw someone reading it on the bus the other day. Waited for the guy to start cracking up. He never did.

Hilarity aside, serious students of applied murphology will immediately recognize this paper as an important contribution to their field of study. The entire piece can be summed up in two words: Gell-Mann's Dictum. (Co)verily, I say unto you, Whatever isn't forbidden is required. The paper is chock full of examples demonstrating the (co)verity of this principle. Exhibit A:

Many (all?) compilers diverge from the standard. Compilers have bugs. Or are very old. Written by people who misunderstand the specification (not just for C++). Or have numerous extensions. The mere presence of these divergences causes the code they allow to appear. If a compiler accepts construct X, then given enough programmers and code, eventually X is typed, not rejected, then encased in the code base.

I wonder what bus that was...

I wonder what bus that was...

the bus

Not the enterprise kind. Just a regular MBTA one.

I think I can "meta" that...

If a compiler accepts construct X, then given enough programmers and code, eventually X is typed, not rejected, then encased in the code base.

If a compiler^H^H^H^H^H^H^H^Hstatic analysis tool accepts construct X, then given enough programmers and code, eventually X is typed, not rejected, then encased in the code base.

So I think we need to tweak Gell-mann's dictum slightly...

Given any construct sufficiently complex, then the variant, no matter how wrong, that passes all checks in the build system becomes The Pattern to Copy and Paste throughout the system.

Why? Because any time something vaguely similar recurs, someone will say, "We fought that one for ages, the build system kept rejecting it. Eventually we got the tool to shut up... copy the code in file X."

Moral of the Story: The Silence of the Tool after a long battle does necessarily imply correctness of the code. Often it merely implies someone has found a way to switch off or side step all warnings.

alternative summary

Re: comment-56514:

The entire piece can be summed up in two words: Gell-Mann's Dictum.

An alternative two-word summary would be "social processes".

One of which...

...is to observe, with Jim Horning, that "a proof is a repeatable experiment in persuasion," and to leverage the de Bruijn criterion so that the code that the society has to come to terms with (however they go about doing that) is small and simple enough to be tractable.

That is to say, we shouldn't be surprised when even one person elects not to trust a completely opaque decision procedure, let alone when a society of such people do. But that tells us nothing about the value of proofs, and a great deal about the value of opaque decision procedures.

having to explain

The article says,

At the most basic level, errors found with little analysis are often better than errors found with deeper tricks.


we have to explain what [the checker] did to the user

This made me wonder whether similar criteria should apply to compiler messages regarding type errors. What would the consequences be for type systems? Full H-M type inference would be a "deeper trick" in my opinion.

However, the social structure regarding type errors is different, because the developer deals with them on the spot. Static analysis errors often show up later, when the build manager or QA person checks the code.

All to true...

... I have experienced this phenomenon myself. I found myself spending days explaining why thread races my tool had uncovered were indeed races.

Alas, the fact that the tool user cannot understand the bug, does not make it any less of a bug.

Removing the deeper analysis merely sells more tools, it does not remove the bugs.

It's one of the reasons I'm passionately _against_ multi-threaded programming.

I have proven time and time again, humans (myself included), are simply not smart enough to do it right.

Tools should deliver an

Tools should deliver an example that tickles the bug, and makes it show itself. I guess this makes building tools even harder, though...

Sigh! No...

... typically it would require the ability to get some fairly exotic hardware to trigger events in a certain sequence.

Sadly, it is easier to spend a day drawing diagrams.

PetraVM has a tool called

PetraVM (http://petravm.com/) has a tool called "Jinx" which slips underneath the operating system and does real-time analysis to find shared data, then forces programs onto those paths. The net effect: turn on Jinx on a regular Windows box, and suddenly concurrency errors start manifesting on that box, e.g. MySQL usually dies within three minutes. The most important property IMHO is that there are no false positives--any outcome that happens on a Jinxed box must be a real potential outcome. All of this according to a presentation I saw last month since I haven't used the tool myself unfortunately, but perhaps a static analysis tool could do something similar in its repro without needing exotic hardware.

what alternatives

what do you see as the least-unviable alternatives/approaches?

Unfixable in a static analysis tool...

...perhaps fixable in the Language.

It's much easier to write code, especially C/C++, especially multi-threaded code, than to understand it.

The only hope I see is to create languages / libraries that deemphasize power and emphasize understandability / simplicity.

Which is why I hang out on LtU.

Because I really believe if anything is going to dig us out of the mess we poor sods in industry are in, it's going to come you guys.

Which reminds me, I must spend a bit more time with Pestov's factor language again.

Not explicitly stated in the

Not explicitly stated in the article, but somewhat interesting: I was told Coverity is mostly using largely lightweight methods, as opposed to the symbolic execution techniques the authors and groups like Koushik Sen's have been pioneering for the past ~decade. This feeds into the comment above that many of the checks are irrelevant for most languages.

There is a bit on this

There is a bit on the analysis "weight" buried towards the end: basically, if the user sees the workings of your analysis in some form, then there is a harsh pressure to keep things simple. If the analysis is completely invisible, you can go wild. As one result, the analysis for bug finding is very constrained but the analysis for false positive suppression is much more advanced.

To make this more specific: when you find bugs you can't just report "bug on line xxx", your analysis tool (obviously) has to say in some intuitive way why it believes the bug is valid, what has to happen to cause the bug, etc. What this means is that each time you call some complicated subroutine in your tool to find an error you're going to have to articulate what this routine did to the user. This articulation is not always trivial. And manually replicating your subroutine in their head makes many users tired and increases the probability they mark a true bug as a false positive.

On the other hand, suppressing an infeasible path is invisible since the user never sees it. You can get as intense as you want for such things, with the caveat that your tool has to be fast enough to eat the code base in an overnight run (typically). E.g., Scott McPeak, Andy Chou (and others) have done some very nice work on scalable, path sensitive analysis to remove infeasible paths that is much beyond anything we did in the lab.

As another problem: even simple analysis can be difficult if it is too weird. E.g., there is a bunch of statistical stuff we could do to find (I believe many) more errors, but don't since users often don't understand what is going on. Hopefully some progress can be made on this front.

But, you are right that the approach is not as heavyweight as dynamic symbolic execution ala our later KLEE/EXE work, Sen's work, the many msoft projects. On the other hand, I much prefer diagnosing static reports as compared to test cases. Seconds vs minutes (or more).

I'm definitely taking this

I'm definitely taking this from the perspective of research directions. I thought Patrice Godefroid is getting his stuff used -- curious as to what the tipping point was.

At a higher-level, this suggests we need better metrics for an analysis. E.g., does a tool increase time-to-bug for a typical deployment, or only find atypical edge cases? Such a perspective is common in cloud/cluster/server systems (e.g., nine nines) but the analysis community uses either simplistic or more archaic but formal metrics.

There's been a lot of focus on underlying tools in this space, but I don't get a good feel for the true problems, just local minima of particular tools. Even finding bugs by categories is not necessarily useful: it would be interesting to compare one of Ben Liblit's collaborative bug logs to bugs turned up by various tools.

Build Extraction Tool

The work that the article described on extracting knowledge of the build processes sounded like it could be of practical interest. I mean, a tool that could look at a source tree and generate, for example, a list of the sequence of compiler invocations with the associated flags, argument, and environment variables used in each one would be nice to have even if it *only* covered the most common 90% of practice.

ptrace / strace

I got the impression it was based on some variant of strace (which I believe is based on ptrace). (Which mind you, is a very smart idea!)

Build Extraction Tool

In about May last year I read about Vesta which is a software configuration management system that includes a purely functional programming language for specifying build rules, with an interesting execution model that avoids unnecessary rebuilds. At its lowest level it has a neat mechanism for automatically extracting dependencies. I thought it would be useful to have this feature without Vesta's other baggage, so I posted a sketch of how to do it to my blog. There followed an interesting discussion including at least two examples of existing implementations of my idea.