How to Build Static Checking Systems Using Orders of Magnitude Less Code

How to Build Static Checking Systems Using Orders of Magnitude Less Code, by Fraser Brown Andres Notzli Dawson Engler:

Modern static bug finding tools are complex. They typically consist of hundreds of thousands of lines of code, and most of them are wedded to one language (or even one compiler). This complexity makes the systems hard to understand, hard to debug, and hard to retarget to new languages, thereby dramatically limiting their scope.

This paper reduces the complexity of the checking system by addressing a fundamental assumption, the assumption that checkers must depend on a full-blown language specification and compiler front end. Instead, our program checkers are based on drastically incomplete language grammars (“micro-grammars”) that describe only portions of a language relevant to a checker. As a result, our implementation is tiny—roughly 2500 lines of code, about two orders of magnitude smaller than a typical system. We hope that this dramatic increase in simplicity will allow developers to use more checkers on more systems in more languages.

We implement our approach in µchex, a language-agnostic framework for writing static bug checkers. We use it to build micro-grammar based checkers for six languages (C, the C preprocessor, C++, Java, JavaScript, and Dart) and find over 700 errors in real-world projects.

Looks like an interesting approach with some compelling results, and will make a good tool for the toolbox. See also the Reddit thread for further discussion.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Simple checkers; bad tooling culture?

I think the results in this paper are interesting, but I suspect that they also highlight something about the state of the software projects they checked (Linux, OpenJDK, Dart JDK, LLVM, Firefox, Node.js) that is problematic. If such simple checkers could find many bugs, they should already have been found by the checkers used by the project. This means that the projects have an inappropriate development process in terms of using static analysis tools.

There could be several reasons for what that would be case:

- impracticality: maybe there actually exists no (open source) tool that is robust enough to handle these "real world projects", or the tools are currently too hard to use or integrate in workflows
- bad culture: maybe developers don't think of using static analysis tools to improve their software, or rely on others to do it after-the-fact

We have evidence that some software houses use static analysis tools as part of their development workflow -- Facebook uses Infer, the Windows kernel has home-grown annotations and some static checking, John Carmack mentions using static analysis tools in-house.

I think that both "impracticality" and "bad culture" are in play here. For impracticality, one can wonder who is to blame. There seems to be a communication/cooperation failure between academia and the FLOSS community here, where does it come from? Maybe researchers are not good enough at ensuring that their tool are actually usable and making contacts with developers. Maybe developers should not expect tools to be usable from the get-go, and invest a bit more effort giving feedback to make the tools usable on their project -- I know some people do excellent work on that front.

Static analysis tools should provide constant feedback during development, just like continuous integration which is widely used today, and fuzzing which is becoming an established process in some communities (eg. the Linux kernel).

This means that the projects

This means that the projects have an inappropriate development process in terms of using static analysis tools.

Not necessarily. This paper describes how the authors had trouble reproducing their own results due to the complexity of the analyses involved. The projects exhibiting so many bugs could just be victims of over complicated and unreproducible analyses.

Sort-of echoing naasking's

Sort-of echoing naasking's comment here, but...

If such simple checkers could find many bugs, they should already have been found by the checkers used by the project.

I think one of the points of this research was this type of check a) could discover things not usually discovered by "tradional checkers"[1], b) wouldn't force you into a particular coding style required by the checkers (see [1]), and perhaps most importantly, c) is easy enough to customize such that you can do a lot of customized checking for your project. That last bit is a huge deal because -- as most of the people around here know -- static type checking can only do so much. There's always a lot of project-/company-specific bit of coding "style" (correctness) that should be enforced.

(Btw, if I'm going to be a bit of a party-pooper, this looks incredibly similar to what Coccinelle has already done to a large degree -- and even supports "fuzzy patching". Granted: Coccinelle doesn't support "any language". Am I missing something?)

[1] Point being that you have to follow absolutely insane rules to actually have them check stuff for you rather than just yielding spurious false positives.

A lot of these problems also

A lot of these problems also wouldn't even exist in a more modern language.

Unit consistency checking

Hmmm. This approach, because it's so easily customizable, may be applicable to unit consistency checking - which ought to encode universals, but tends to grow slightly different hair in almost every program I've done it to. It's usually pretty simple, and I've discovered that doing this (by hand) catches MANY bugs in simulations of systems that have unit consistency properties.

Mostly these are physical systems or simulations. For example if someone multiplies a distance by a velocity and calls the result a time, the unit consistency check ought to cry 'foul' catching a semantic error where someone mistakenly multiplied instead of dividing. Or when a distance in meters is divided by a velocity in centimeters per second and the result gets called a time in minutes the unit consistency check ought to cry 'foul' catching an error (or two errors) were different units of the same measurement quality were mixed.

More abstract unit consistency checks (such as the dimensionality of vector and matrix cross products etc) are usually already caught by existing type systems, but common or physical unit consistency checks are uniformly neglected by programming languages. In fact I can't think of any exceptions right now, so in most programming languages it would require extralinguistic annotations to do formal checking.

When I do it by hand I usually tag variable names with expected units.

F# unit of measures

Common or physical unit consistency checks are uniformly neglected by programming languages. In fact I can't think of any exceptions right now.

To my knowledge, the most easy-to-use system for unit checking in a programming language is F# support for unit of measures.

others

Used in practice?

Is the feature usable in practice in these languages, is it actually used? Because I read also on stackoverflow that the simplest way to convert a distance from miles to meters with Boost units is

si::length distance_in_metres = static_cast(
    distance_in_miles * boost::units::imperial::mile_base_unit::unit_type()
);

Compare that with the F# code

(* this definition is in the library, not user code *)
let meter_per_mile = 1609.34<m/mi>

let distance_in_meters = distance_in_miles * meter_per_mile

I could see managers forcing their teams to use Boost units for additional safety guarantees, but would people have the willpower to make this trade-off on their own? (Well I guess they're used to programming in C++ already.)

P.S.: I'm not just trying to make fun of Boost here. In my experience there is a large difference in practice between what is *encodable* in a given programming language and what actually gets used -- the first makes for interesting blog posts full of tricky ideas, the second is actually indicative of what really raises the quality level among a language community. It is sensibly harder to get reliable information on the latter than on the former, so it makes sense to ask explicitly.

Are F# units of measure used

Are F# units of measure used in practice?

Hm

It's a bit hard to tell, but a github search at least return some examples of many different people defining their own units in their programs.

(I cannot search for `boost::units` this way as Github search interface is too limited, it won't accept literal strings and ignores my colons. Debian Code Search is very cool, but merely tells us that there is essentially no code using bost::units or F#'s units packaged in Debian.)

word

Right on. I dunno.

In my experience there is a

In my experience there is a large difference in practice between what is *encodable* in a given programming language and what actually gets used

+1.

This is a typical "expert" local minimum and it's a shame that it's not called out more often because it's such an easy trap to fall into if you don't actively search for other languages/paradigms. (Cf. "Turing Tarpit".)