Vellvm: Formalizing the LLVM Intermediate Representation for Verified Program Transformations

Vellvm: Formalizing the LLVM Intermediate Representation for Verified Program Transformations by Jianzhou Zhao, Santosh Nagarakatte, Milo M. K. Martin, and Steve Zdancewic, POPL 2012

This paper presents Vellvm (verified LLVM), a framework for reasoning about programs expressed in LLVM's intermediate representation and transformations that operate on it. Vellvm provides a mechanized formal semantics of LLVM's intermediate representation, its type system, and properties of its SSA form. The framework is built using the Coq interactive theorem prover. It includes multiple operational semantics and proves relations among them to facilitate different reasoning styles and proof techniques.

To validate Vellvm's design, we extract an interpreter from the Coq formal semantics that can execute programs from LLVM test suite and thus be compared against LLVM reference implementations. To demonstrate Vellvm's practicality, we formalize and verify a previously proposed transformation that hardens C programs against spatial memory safety violations. Vellvm's tools allow us to extract a new, verified implementation of the transformation pass that plugs into the real LLVM infrastructure; its performance is competitive with the non-verified, ad-hoc original.

This obviously represents huge progress in marrying the theoretical benefits of tools like Coq with the practical benefits of tools like LLVM. We can only hope that this spurs further development in practical certified software development.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

What exactly can it do?

This sounds like a very good approach.

I've never heard of any of the people in the project, or ever thought of UPenn as a formal methods place. I see this is from this year's POPL: did anyone attend their talk? "[R]easoning styles and proof techniques" suggests to me that they are interested in assertions/contracts/whatever for C, but besides the use of "SoftBounds", which I don't recall having read about and which shares three researchers, in the heart of the paper (section 6), I didn't see any real discussion of this in the paper. The theory developed in the paper seems fu=ocussed on the notion of "well-formed machine state", so might not much generalise in terms of what it can prove about LLVM code much beyond the memory model of SoftBounds.

Re: What exactly can it do?

What exactly can it do? It can marry the theoretical benefits with the practical advantages.

UPenn

Charles Stewart: I've never... thought of UPenn as a formal methods place.

That's interesting! Home to Benjamin Pierce, Steve Zdancewic, Stephanie Weirich, and no doubt others I'm forgetting? Authors of this Software Foundations course? Developers of this locally-nameless mechanized metatheory library for Coq? :-)

Ribbing aside, UPenn is huge in the PLT space.

New England ++

Somehow, I thought of Benjamin Pierce as a New England sort of person, I suppose because of NEPLS. Yes, UPenn is a significant formal methods and theory of programming place.

My penance should be to make a map of formal methods around the globe or something.

I for one would welcome the

I for one would welcome the result of such a penance!

No Worries

It was actually a helpful reminder that those of us who have read TAPL, use Coq, and follow mechanized metatheory developments religiously are an even rarer breed than I had realized. :-)

LLVM gets even more attractive

It looks like LLVM is getting some great attention from formal methods enthusiasts. That makes LLVM more attractive as an implementation technology; if you ever decide to do a formal check of part of a compiler, you can potentially use some of the UPenn Coq files.

One thing I found odd in the formalization is the treatment of undefined memory reads. The authors go out of their way to allow inlining code that reads from undefined memory multiple times, giving the specific example that xoring a register with itself does not necessarily yield zero. If I were going to check an optimizer I had written, I think I would want the opposite checker results. I'd want such an inliner to be considered unsafe, and I'd want xoring a register with itself to reliably yield zero.

Undefined semantics

Nit: please post author names with paper references.

Several LLVM operators have platform-defined or even undefined semantics for some values, e.g. shifting an int of width K by K or more bits. How does this work handle those situations?

POPL Paper

Oops

I believe you've reminded me of the desire for attribution in stories before. Thanks for the gentle reminder. :-)

While you're at it, you

While you're at it, you could also give the year (this is very useful) and, for the few people that take that into account, the conference/journal that accepted it. That would here be "POPL 2012".

Done

Thanks!

One often sees undefined

One often sees undefined memory reads intentionally occurring in optimized code where there is a speed premium for doing things in machine word sized units on machine word boundaries - e.g. computing a hash function of some arbitrary length string. In these cases it might be faster to always read the last "word" of data and then mask off the invalid part.

Right, but you get a definite value back

Undefined reads make sense. What I don't see is the treatment of them, especially the way they infect temporary variables. The authors give the example of (x xor x) not necessarily being zero.

I'm apparently not reading

I'm apparently not reading the linked paper the same way as you. I think the xor example was being held up as an example of original llvm semantics getting something wrong, and they basically proposed to fix it by substituting the operation "initialize to zero" for an undefined read. (this was a reply to lexspoon - apologies for the mis-threading).

LLVM Undef

Thanks for the interest in our paper! The intent was not to say that the LLVM is "wrong", but to give a semantics that adequately models the intended behavior of the LLVM. We based our semantics (in part) on the discussion of 'undef' values as described in the LLVM documentation. We emphasized the xor example in the paper because it does seem counterintuitive.