Static Typing for a Faulty Lambda Calculus

Static Typing for a Faulty Lambda Calculus. David Walker. Lester Mackey. Jay Ligatti, George A. Reis, and David I. August.

A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorrect program execution by altering signal transfers or stored values. While the likelihood that such transient faults will cause any significant damage may seem remote, over the last several years transient faults have caused costly failures in high-end machines at America Online, eBay, and the Los Alamos Neutron Science Center, among others. Because susceptibility to transient faults is proportional to the size and density of transistors, the problem of transient faults will become increasingly important in the coming decades.

This paper defines the first formal, type-theoretic framework for studying reliable computation in the presence of transient faults. More specifically, it defines lambda-zap, a lambda calculus that exhibits intermittent data faults. In order to detect and recover from these faults, lambda-zap programs replicate intermediate computations and use majority voting, thereby modeling software-based fault tolerance techniques studied extensively, but informally.

To ensure that programs maintain the proper invariants and use lambda-zap primitives correctly, the paper defines a type system for the language. This type system guarantees that well-typed programs can tolerate any single data fault. To demonstrate that lambda-zap can serve as an idealized typed intermediate language, we define a type-preserving translation from a standard simply-typed lambda calculus into lambda-zap.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Wow! One thing I'm not clear on...

What an interesting paper. I hadn't even considered using type theory to ensure fault tolerant (FT) computation, though in retrospect it seems obvious. I had previously considered FT as achievable only via independent machines and network protocols, but seeing how a simple type theory can transparently scale this abstraction right down to the language level running on single CPUs is an eye-opener.

One thing I'm not clear on: why do the authors say that there is more work to be done to harmonize compiler optimizations with the FT algorithm? It seems to me that FT must simply be the last stage of the compilation pipeline after all optimizations which remove *unnecessary* redundancies are complete, as the FT simply adds the *necessary* redundancies. How can you possibly get better than that while maintaining the FT properties? The computation needs to be run identically three times for the final comparison to be valid.

Focus on the wrong thing

I think this is a really cool paper but I would have liked the paper to have a somewhat different focus. To explain what I mean let me first give some context. The authors are developing a compiler which generated fault tolerant code. The problem they are trying to solve is that of compiler correctness. They want to increase their confidence that the programs the compiler spit out really are fault tolerant. To this end they develop a calculus for use as an intermediate language and a type system which makes sure that the intermediate programs in the compiler maintain fault tolerance as compilation goes along.

Given this context the authors have their focus very much on the type system. Every theorem and lemma (OK, maybe not lemma 11) says something about the type system, even the ones that are independent of the type system. For me the interesting thing is the compilation function which translates a lambda calculus term into lambda zap. This translation is such that it preserves the semantics even if one fault occurs. And this is true, no matter if the programs are well typed or not. But in the paper they only prove that semantics is preserved given well typed terms, which is a much weaker statement.

I still like the paper, but the end-to-end reliability result they have is really independent of the type system. The type system just helps them developing a correct compiler. I wish they would have focused less on that and more on the actual end-to-end reliability result.

Anti-Moore's law?

(tongue in cheek)

Does this mean that as core densities go up, redundant computation needs to as well, so there's really no progress? :)