Milawa: A Self-Verifying Theorem Prover for an ACL2-Like Logic

Milawa: A Self-Verifying Theorem Prover for an ACL2-Like Logic

Milawa is a "self-verifying" theorem prover for an ACL2-like logic.

We begin with a simple proof checker, call it A, which is short enough to verify by the "social process" of mathematics.

We then develop a series of increasingly powerful proof checkers, call them B, C, D, and so on. We show that each of these is sound: they accept only the same formulas as A. We use A to verify B, and B to verify C, and so on. Then, since we trust A, and A says B is sound, we can trust B, and so on for C, D, and the rest.

Our final proof checker is really a theorem prover; it can carry out a goal-directed proof search using assumptions, calculation, rewrite rules, and so on. We use this theorem prover to discover the proofs of soundness for B, C, and so on, and to emit these proofs in a format that A can check. Hence, "self verifying."

This might help inform discussions of the relationship between the de Bruijn criterion (the "social process" of mathematics) and formal verification. I think it also serves as an interesting signpost on the road forward: it's one thing to say that starting with a de Bruijn core and evolving a more powerful prover is possible in principle; it's another thing for it to actually have been done. The author's thesis defense slides provide a nice, quick overview.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I'd like more theorem provers like Milawa

I am a close follower of this work. I learned of the project when I was still an undergrad. The best aspect of this work is the strategy for proving soundness at the different layers. Every complicated justification step that was added to the proof checker had a slow and fast version. One constructed a justification in the current layer's allowed derivation steps, and the other was proved to work the same way. The existence of these slow versions allows any skeptic or curious George to inspect a more rudimentary proof. More importantly, they guide others that wish to build their own self-verifying theorem provers to prove their methods sound. The next step here I think is to write a self-verifying theorem prover for a higher order computational logic - preferably in a language that people would want to write code in such as Scheme.

Why isn't first order enough?

Could someone with more experience then me explain why it is that we can't embed our subtheorems as theorems? For instance the paper uses commutativity as an example. Would proving (IMPLIES (por A B) (por B A)) not give a useable commutativity for some reason? It seems to me as though 1 application of substitution is always enough for this.

Higher order logics can

Higher order logics can provide arbitrarily shorter proofs of the same theorems, and proofs that are more similar to the ones that people use in mathematics.

Re: Why isn't first order enough?

The approach of proving a useful theorem then just instantiating it is frequently used in Milawa to build shorter proofs. The simplest examples of this can be found at the start of Chapter 6 of my dissertation.

Unfortunately, the commutativity example you give is not a well-formed formula in Milawa, where the only valid atomic formulas are equalities.

Since Milawa's logic is both first order and sentence style, it is relatively difficult to work with. As Josh mentioned, higher order logics typically permit much richer theorems and don't have to worry about the "shape" of formulas as much. A good resource might be Harrison's "Metatheory and Reflection" paper; see in particular the section that discusses metatheorems versus higher order theorems.

the no–meltdown theorem…


While resident heavyweights are working their way through 554 pages of your dissertation, can I engage you in a bit of cocktail-party–style small talk?

Donald MacKenzie, in Mechanizing Proof, described Boyer's ambition like so (p.273):

Boyer's dream — not a hope for his lifetime, he admitted — is that mathematical modeling of the laws of nature and of the features of technology will permit the scope of deductive reasoning to be far greater than it currently is.  “If you have a nuclear power plant, you want to prove that it's not going to melt down…. One fantasizes, one dreams that one could come up with a mathematical characterization … and prove a giant theorem that says ‘no melt-down.’  In my opinion, it is the historical destiny of the human race to achieve this mathematical structure….  This is a religious view of mine.”

If that's the end game for formal verification, where does Milawa fit? For a possible answer, let's turn to Software Verification and System Assurance — a paper mentioned by your comrade-in-arms Steven Obua — where Rushby lays out the following points as the main hazards to the soundness of formal-verification guarantees:

  1. … the basic requirements or assumptions for the system may have been misunderstood or established incorrectly.

  2. … the requirements, assumptions, or design may be formalized incorrectly or incompletely:

    • Elements of the specification may be inconsistent: this renders the specification meaningless and it becomes possible to prove anything.
    • Elements of the specification may be just plain wrong: although logically consistent, they do not correctly formalize the requirements, design, or assumptions.
    • The formal specification and verification may be discontinuous or incomplete.
  3. …a theorem prover, model checker, or other mechanized formal analysis tool employed in the verification may be unsound.

Upon examining all these hazards, Rushby concludes (emphasis mine) that

… the possibility of flaws in the verification system itself is the least of the three concerns: although all the major verification systems have had some flaws, no false claim has “escaped” and been used in a larger context, to my knowledge.

On p.460 of your dissertation, you present a (rather short) list of “reasoning errors fixed” in ACL2 in the past 12 years. Do these fit Rushby's characterization of most such errors as flaws that never resulted in a false claim escaping into the wild?

(I have a couple more questions of a similarly non-tecnical nature. Not sure if I'll find the time to write them up.)

In the meantime, let me say that although I'm very unlikely to slog through the Milawa-specific nitty-gritty details of the dissertation, I really appreciate the work you put into discussing the state of the art. It's a really nice overview of who's doing what in the world of formal verification. (I couldn't help but notice that MacKenzie, Obua, and Rushby are all mentioned in your References.)

It seems subtle

Hi El-Vadimo,

At Centaur, where I am now working, two hard problems are (1) how to ensure that our mathematical model a hardware module is accurate w.r.t. its RTL description, and (2) how to ensure that our specification of this module is correct. These problems will be the hard no matter which theorem proving system is used.

Because of these sorts of things, I largely agree that a theorem prover is unlikely to be the weakest point in any typical formal verification effort today. But I also think that, in no small part, I have this opinion because modern theorem provers really tend to be quite reliable.

  • In LCF-style systems like HOL Light, very little code must be trusted (well, beyond the operating system and ML compiler, anyway). Here soundness is a fundamental feature, paid for in the run-time overhead of constructing theorems with primitive inference rules. To be fair, this cost is often not prohibitive.

  • Meanwhile, given the large amount of code in its reasoning functions, the ACL2 system has really had remarkably few soundness bugs over the years. This is certainly a sign of the experience, skill, and care of the system's authors. Even so, there is a real cost to this reliability; ordinary users can't change the theorem prover's code, and all ACL2 development must be carried out by the system's authors, causing a kind of bottleneck. A notable consequence of this is that theorem-proving research can be difficult in ACL2: there may be no easy way to tell whether your changes and extensions are sound, or to convince of this.

  • (So, what about Milawa? In is similar to LCF-style systems in that it has a small kernel and soundness is a basic feature. But Milawa's proof representation can be understood and reasoned about, making it possible to "grow" the system by adding more powerful rules. Eventually you can achieve a somewhat ACL2-like ability to rewrite clauses and so forth, without needing to justify your work in terms of primitive inferences.)

Above, I say "today" because, today, when I prove theorems, I am usually thinking through the proof in my head or on paper, and then getting ACL2 to follow my lead. In such a mode of interaction, I might actually notice if the proof seems to be too easy. (And I do not know of any case where some kind of "false theorem" was proven in ACL2 and believed.)

But a lot of work now seems to focus on getting more automation into interactive theorem provers, whether SAT and SMT solvers, arithmetic procedures, or what have you. Perhaps tomorrow I will just "mash buttons" to get my proofs to go through, and once that begins to be the case, I am probably much less likely to see problems. Indeed, how do you know your SAT solver is right?

I also say "typical" because, well, some folks might be worried about whether the user of the theorem prover is honest. If you think the user is trying to actively find and exploit soundness bugs, you will naturally care more about the soundness of the prover. But this all seems rather far fetched.

Glad to hear you are enjoying the dissertation so far,


trusting your SAT solver

Re: comment-60093: … how do you know your SAT solver is right?

It depends, I suppose. If you're making an assertion about your model and the SAT solver finds a refuting counterexample, then it is exceedingly likely that the SAT solver is right. And if the model is small enough (and it has to be, for otherwise your SAT solver would choke on it) and the SAT solver's front end can visualize the counterexample for you, then you can probably convince yourself fairly quickly that the counterexample is correct.

On the other hand, if the SAT solver says, Everything you asserted about your model is correct, then yes, you're right, you don't know for sure. You may not have asserted enough. Or there may be a bug that prevents the solver from refuting your assertions. Or the finite model is not big enough. So no, you don't get certainty, but you do get some degree of increased confidence, do you not? Sure, the lack of certainty is a little disturbing to me but not in the same degree that it is to, say, Boyer.

(Of course, if the counterexample is too difficult to understand, it may not be believable despite being correct.)

Instead of asking, "how do you know your solver is right?", let's ask, If your prover says you're wrong, what do you make of it?  How well does Proof-Directed Debugging work? (Thanks to Frank for the reference.)


I'm afraid I have not said things well. I was really not trying to suggest that SAT solvers are unreliable or anything like that. I agree that it is easy to convince yourself that a counterexample is valid. In fact, before showing them to the user, we usually automatically double-check any counter-examples reported by our SAT solvers when we connect them to ACL2. And, although I am not very familiar with the area, I think some SAT solvers can even emit "proofs" in various fashions when they make UNSAT claims, and that these proofs can be checked separately, ideally by programs which can themselves be verified.

Here is what I was trying to say: When I use a tool like ACL2 to prove a theorem, really what I am doing is (1) figuring out for myself why the theorem is true, and then (2) programming the theorem prover (mainly via rewrite rules) how to find the proof. Because of this, I am to some degree "aware" of what I want the system to do and how difficult the proof ought to be. I know how I want to decompose the problem, and what the big steps are. And I can evaluate how well the theorem prover is doing what I want it to do. This is, to me, almost a definition of traditional, interactive theorem proving. In contrast, when I think about more highly automatic approaches, like SAT, I think about not understanding---indeed, not even warning to understand---many details of the system I am verifying that would be crucial in an interactive proof effort (i.e., what are the pipe stages in this floating point adder? what do its registers mean?) Instead, I just say the property I want to establish and ask whether the tool thinks the implementation satisfies it. In this more automatic style of theorem proving, I probably have no idea how the proof ought to go. Really this is a wonderful feature, but a potential downside is that I may be less likely to notice if something goes wrong.

I think it is entirely possible to develop reliable SAT solvers and resolution provers and other automatic procedures. Indeed, several approaches are apparent: (1) you could try to make your procedure build LCF-style theorem objects like the rewriter in HOL, (2) you could try to reflectively verify your procedure like the rewriter in Milawa, (3) you could try to check the work of your procedure with an external checking tool which you then verify, like ivy, or (4) you could just try to be really careful and write something that you think works, like the rewriter in ACL2.

I think my main point is still that, at least in part, we agree "theorem provers aren't the weakest link" because they have been designed to be reliable.


On p.475 of the dissertation, we find the following discussion (emphasis mine):

We now consider the expressivity of our programming language. With the exception of ACL2, most general-purpose theorem provers use logics which are considerably more expressive than ours. Notably, our logic lacks static typing, quantifiers, and higher-order functions.

In Computer Aided Reasoning: An Approach, Kaufmann, Manolios, and Moore assert that these limitations “can be overcome without undue violence to the intuitions you are trying to capture,” and indeed the ACL2 and NQTHM systems have been successfully used in many wide-ranging hardware and software verification projects, as mentioned at the beginning of Chapter 2. It may also be that the relative simplicity of our formulas has played a role in our success in reasoning about proofs.

And yet, doing without higher-order functions has not been particularly easy. Our tactic system is quite convoluted and is much less flexible than that of a true LCF-style system. This is largely due to our inability to dynamically produce validation functions. A great number of our theorems and functions—for introducing new “types,” for recognizing the validity of steps in traces and proofs, and so on—have a boilerplate feel. It seems like much of this could be made easier using higher-order functions and a typed logic.

Furhtermore, according to MacKenzie,

At Austin, Boyer and Moore had joined a computer science department becoming prominent in formal methods: in 1984, for example, Edsger W. Dijkstra joined the faculty. Boyer and Moore began to gather around them a group of graduate students who used their prover to perform a series of increasingly impressive proofs. This group's frequent face-to-face interaction in Boyer and Moore's graduate course, in their weekly logic seminar, and in other activities was key. Using their prover was hard, and few people were able to pick up the requisite skill without personal contact with the prover's designers: “Almost all of the successful users of [the prover] have in fact also taken a course from us … on proving theorems in our logic.”

Is it fair to say then that an aspiring verificationist who has chosen Milawa (or ACL2) would find herself in a particularly dimly lit recess of the de Bruijn corner of the Turing tarpit — a place where everything is formally verifiable in principle but only simple systems are formally verified in practice? For example, a very bright fellow named Peter Dillinger found it hard to prove that the following function computes the Fibonacci sequence:

  (define (fib n)
    (f 1 0 0 1 n))

(Here, f is the helper function defined in this post.)

If, as you say, “Milawa's logic … is relatively difficult to work with,”   then most people would be better off going with Coq, no?

Lots to think about


I'll try to respond to everything, though not quite in order...

Most people would surely be better off using Coq instead of Milawa. (The same
could probably be said of any mainstream prover.) Coq and its libraries are
the product of years of effort by many talented people, whereas Milawa is a
single student's dissertation project. It's probably best to think of Milawa
as a prototype that demonstrates a new approach to developing trustworthy
theorem provers. As far as its actual capabilities, it is at best an
"academic-strength imitation of the industrial-strength ACL2."

But you also seems to ask whether Coq is a better choice because, in my own
words, "Milawa's logic is relatively difficult to work with." I think the
answer to this is no. The main reason I regard Milawa's logic as difficult to
work with is that it is a sentence-style logic, and when writing derivations by
hand, sentence-style systems seem harder to work than natural-deduction style
systems. I think it was probably a mistake to use a sentence-style logic, and
it has probably caused me to do a lot of extra work. If I were to start over,
I would consider useing a sequent-style logic instead.

Even so, I don't think this has any real impact on the end user, because he is
not trying to build proofs with primitive inferences. Instead, he calls upon
the rewriter and other verified tactics that do not deal with formulas. This
is actually one of the neat things about Milawa: it's true that we start off
using a stupid sentence-stle formulation with very limited rules, but we can
eventually operate on clauses (which are kind of like sequents) and we can
treat customized, high-level objects (e.g., equivalence traces, rewrite traces,
and ultimately whole proof skeletons) as proofs.

A related question might be, would one be better off going with Coq instead of
ACL2? Here I am the wrong person to ask, since my background is almost
entirely with ACL2. For my part, I really like how simple ACL2's logic is, and
how it resembles functional programming in Lisp, and I find Coq's type system
to be pretty complex. But a good Coq user would probably have the opposite
opinion. (A possibly useful paper is Lamport and Paulson's, "Should your
specification language be typed.")

The right logic and system to use probably depends a lot on what you want to
verify. To try to illustrate this starkly, some newer logics seem to have all
kinds of special support for reasoning about bindings, and this is apparently
very useful when your goal is to prove something about the type system of a
programming language. But these features are probably not very useful when
your goal is instead to show a circuit implements its specification.

It seems to me that higher-order logics are probably a better choice if you
want to prove theorems from traditional mathematics, and I certainly found it
difficult to implement a "proper" tactic system in a first-order setting. But
I think I would hesitate to make any claims beyond what I say on p. 476,
"Perhaps, similarly, ... the greater flexibility of higher-order systems is
desirable when writing theorem provers." I also think I should point
out and try to re-emphasize the important caveat that you quoted: "It may also
be that the relative simplicity of our formulas has played a role in our
success in reasoning about proofs."

Regarding learning to use theorem provers, I began learning ACL2 on my own, but
only for a few months before going to UT. At UT, I learned a lot from the
authors and other students. I think "Computer Aided Reasoning: An Approach"
(Section 2.7) offers a good analogy for the effort involved in learning to use
ACL2. "How long does it take for a novice to become an effective C programmer?
It takes weeks or months to learn the language but months or years to become a
good programmer... How long does it take for a novice to become an effective
ACL2 user? It takes weeks or months to learn to use the language and theorem
prover, but months or years to become really good at it." The full quote
carries the analogy further.

I have not tried to prove that F implements FIB. I know that Peter is very
capable, so if he found it to be a difficult proof, then I will take on faith
that it is difficult. Just glancing at F, I find that I have no intuition
about what the variables of F are supposed to be, or what invariants are
necessary as it recurs. I suspect if I were to attempt this proof, most of my
time would be spent trying to figure out how F works and the right invariant to
prove. I would be surprised if the "extra step" of having ACL2 check my proof
would be significant compared to the cost of doing the proof "by hand," and
therefore I would be surprised if this is really any easier to prove in Coq or
some other prover than it is to prove in ACL2.

I think a "Turing tarpit" is the wrong conclusion to reach from F. Whatever
the particular difficulties of this proof were, it does not strike me as
something that threatens the success of real ACL2 projects. After all, even in
several years of working with ACL2 on a wide range of projects, I have never
needed to prove F is FIB. (And, had I needed to, I think it would have been at
worst an annoying distraction, rather than some kind of insurmountable

It is very common to struggle when doing a new kind of proof in ACL2. But
these are also good opportunities to learn how to improve your proof
automation. When I was working on Milawa, once my reasoning strategy had come
together, I was often able to prove new, complex theorems without needing to do
anything except type them correctly.