In Finding and Understanding Bugs in C Compilers Xuejun Yang, Yang Chen, Eric Eide, and John Regehr of University of Utah, School of Computing describe Csmith, a fuzzer for testing C compilers. The hard part was avoiding undefined behavior.
Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to ﬁnd compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our ﬁrst contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undeﬁned and unspeciﬁed behaviors that would destroy its ability to automatically ﬁnd wrong code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers.
Two bits really stuck out for me. First, formal verification has a real positive impact
The striking thing about our CompCert results is that the middleend bugs we found in all other compilers are absent. As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot ﬁnd wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible beneﬁts for compiler users.
And second, code coverage is inadequate for ensuring good test thoroughness for software as complex as a compiler.
Because we ﬁnd many bugs, we hypothesized that randomly generated programs exercise large parts of the compilers that were not covered by existing test suites. To test this, we enabled code coverage monitoring in GCC and LLVM. We then used each compiler to build its own test suite, and also to build its test suite plus 10,000 Csmith-generated programs. Table 3 shows that the incremental coverage due to Csmith is so small as to be a negative result. Our best guess is that these metrics are too shallow to capture Csmith’s effects, and that we would generate useful additional coverage in terms of deeper metrics such as path or value coverage.