archives

Failure-oblivious computing

There've been a couple of threads recently about safety-critical code (Rules for, and in real-time Java). Safety-critical code necessarily includes careful handling of failure situations. Or does it? Here's another take on failure handling, from the opposite direction:

Enhancing Server Availability and Security Through Failure-Oblivious Computing (Rinard et al., 2004) was originally presented at OSDI '04, but hasn't previously been featured here:

We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply discards invalid writes and manufactures values to return for invalid reads, enabling the server to continue its normal execution path.

We have applied failure-oblivious computing to a set of widely-used servers from the Linux-based open-source computing environment. Our results show that our techniques 1) make these servers invulnerable to known security attacks that exploit memory errors, and 2) enable the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors.

The paper includes descriptions of how this technique was applied with good results to servers such as Apache and Sendmail, as well as to client programs such as Pine, Mutt, and Midnight Commander.

The paper also raises concerns about the potential for such techniques to create a bystander effect (although the term moral hazard might be more appropriate here), influencing programmers to be less careful about error handling because they have a safety net.

This work was performed on programs written in C, and there's a temptation to think that the approach is only applicable to memory-unsafe languages. However, there's a connection here to the approach used by some of the classic shell scripting languages and their descendants such as Perl, in which certain kinds of failure are silently tolerated in the interests of keeping the program running. The approach could also have potential applications in other memory-safe languages, providing the potential for higher-availability programs, as noted in the paper's conclusion.

While most language implementors aren't going to rush to incorporate failure-oblivious approaches in their languages, the positive results obtained from this work are thought-provoking, and could inspire other less traditional but effective ways of handling failure.