Failure-oblivious computing

There've been a couple of threads recently about safety-critical code (Rules for, and in real-time Java). Safety-critical code necessarily includes careful handling of failure situations. Or does it? Here's another take on failure handling, from the opposite direction:

Enhancing Server Availability and Security Through Failure-Oblivious Computing (Rinard et al., 2004) was originally presented at OSDI '04, but hasn't previously been featured here:

We present a new technique, failure-oblivious computing, that enables servers to execute through memory errors without memory corruption. Our safe compiler for C inserts checks that dynamically detect invalid memory accesses. Instead of terminating or throwing an exception, the generated code simply discards invalid writes and manufactures values to return for invalid reads, enabling the server to continue its normal execution path.

We have applied failure-oblivious computing to a set of widely-used servers from the Linux-based open-source computing environment. Our results show that our techniques 1) make these servers invulnerable to known security attacks that exploit memory errors, and 2) enable the servers to continue to operate successfully to service legitimate requests and satisfy the needs of their users even after attacks trigger their memory errors.

The paper includes descriptions of how this technique was applied with good results to servers such as Apache and Sendmail, as well as to client programs such as Pine, Mutt, and Midnight Commander.

The paper also raises concerns about the potential for such techniques to create a bystander effect (although the term moral hazard might be more appropriate here), influencing programmers to be less careful about error handling because they have a safety net.

This work was performed on programs written in C, and there's a temptation to think that the approach is only applicable to memory-unsafe languages. However, there's a connection here to the approach used by some of the classic shell scripting languages and their descendants such as Perl, in which certain kinds of failure are silently tolerated in the interests of keeping the program running. The approach could also have potential applications in other memory-safe languages, providing the potential for higher-availability programs, as noted in the paper's conclusion.

While most language implementors aren't going to rush to incorporate failure-oblivious approaches in their languages, the positive results obtained from this work are thought-provoking, and could inspire other less traditional but effective ways of handling failure.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Oblivious code or oblivious designer?

I think this paper makes a common mistake when discussing the handling of sofware errors and failure: it assumes that an error handling policy is global and independent of the problem domain.

If you are designing a handler for indidvidual http requests or mail transmission, where the cost of botching or being unable to complete each individual operation is low and where human awareness of the problem and intervention is not required, this oblivious strategy has some merits.

If you are working on a particular part of your system that causes more severe consequences upon failure, or needs to let someone know what happened (even if only for debugging purposes), this approach won't work.

Ultimately, how individual error conditions are handled should be part of the requirements and design considerations, and this requires individual decisions to be made.

Designing a compiler that makes a blanket policy decision for you is robbing the software designer of a significant design flexibilty and a rightful responsibility.

Not the paper

I'm not sure that the paper makes a mistake regarding error handling policy; it's just that any reasonably simple version of the approach is necessarily global and independent of the problem domain. Note that the emphasis of the paper is on applying these techniques to existing programs, and it specifically points out that using the technique during development may be a bad idea. But if you have an existing program which has memory errors, and if this technique helps with those, it may be far more practical to apply this technique than to locate and fix the memory errors.

The idea of failure-obliviousness as a built-in feature of languages is something which the paper is careful to advise against. However, a point I made in the story post is that we've already seen similar failure-obliviousness in some languages, which can have both positive effects (robust programs) and negative ones (unpredictable programs). I think it's interesting to consider that approach from a slightly different angle.

The idea that error handling "should be part of the requirements and design considerations, and this requires individual decisions to be made" is certainly the conventional wisdom. Thorough error handling is one of the less tractable aspects of programming, though, and reaching beyond conventional wisdom in some cases could be fruitful.

Moral Hazard

Where I work, we have millions of lines of C code compiled for the HPUX environment, one that is incredibly forgiving of common memory errors. Needless to say, we're still running increasingly dated HPUX servers at greater cost than the equivalent Linux/x86 counterparts.

In production, moral hazard sometimes equals costly mistake.

Non-termination

One possibility for handeling an error is certainly to ignore it, fix the state and try again immediately or later. However, this might simply lead to another type of error such as non-termination.

It's in the paper

The authors are aware of that.

Essentially, they perform bound checks and discard out-of-bounds writes, and for out-of-bounds reads they have a stream of predefined values to return. Bounds checking + termination prevents some security problems, but termination itself could be a security problem (e.g. denial of service). With invalid reads honored by the read-pool, there is a probability that the program may reach a stable state again.

(The term "memory error" in the abstract made me think of hardware memory errors. Now, that would have been something...)

Oh;

Oh; should have read the paper.

[Community Standards] A reminder

Oh; should have read the paper.

Not to single you out, Hank, but I think in light of other discussion going on about the health and future of LtU, it is worth making a broad reminder for everyone.

It is a good rule of thumb to assume that actually having read the content of the referenced item in a story, at least at a superficial level, is a pre-requisite for responding to that item.

A response off the cuff is most likely going to be a non-sequitur, and will likely just contribute to more noise and less signal on LtU.

Thanks for future consideration of this principle!

is this level of policing

is this level of policing really necessary?
[later clarification: the poster apologised and is hardly dragging the tone down with repeated questions / refusal to listen. this contant lecturing is tedious and i don't think it helps. we've listened to your various lectures, we've learnt that we are not worthy. now can you return the favor and shut up for a while?]

You have to start somewhere

now can you return the favor and shut up for a while?

Andrew, as we embark on our experiment in community standards, we are bound to occasionally disagree about what is OK and what is not.

However, I would hope that the tone I used in expressing my opinion was more civil and collegial than yours.

If you interpreted my message in some other spirit, my apologies for my part in the misunderstanding.

Re: You have to start somewhere

Marc,

I appreciate your efforts. I know policing is a painful thing to do, but necessary if the community isn't going to degrade into comp.lang.misc. Leaving it up to just Ehud and Anton will not scale.
I thought your remarks were very polite, kind of a "gentle nudging".

I recently read an interesting online book about running an open-source project. This section is very appropriate, where it talks about "setting the tone". It talks about reigning in rude behavior, but I think it applies equally well to any standard you want to maintain.

Thanks

Thanks for the vote of confidence, Jeff!

look, pal, i didn't choose

look, pal, i didn't choose you to be community police officer. and my original post was rather more direct.

again - i think you're stepping over the bounds of what is necessary.

Rights = responsibility

look, pal, i didn't choose you to be community police officer

Guess what, Andrew, if you have been paying attention to what's going on, you may have noticed that we are ALL expected to speak up when we think things have gone off the rails.

You may disagree with my position on any given call, but being uncivil is explicitly not countenanced by the new policy, and I'm not sure what you hope to accomplish by it.

If you have issues, take them up with Ehud and Anton, who work very hard to provide this medium for us, or at least express them cogently here so that we can discuss them.

I would just as soon be a lazy lurker who takes advantage of all the excellent references and does nothing in return for the community.
However, if we all do that, we won't have a community to take advantage of.

Your call...

Redirect

If this discussion needs to continue, please let's move it to the Community enforcement discussion topic I've just created, or to a new thread in the Site Operation Discussion forum.

(We now return you to our regularly scheduled programming...)

hardware memory errors

The term "memory error" in the abstract made me think of hardware memory errors. Now, that would have been something...

There's actually been some work done at Stanford on so-called Software-Implemented Hardware Fault Tolerance. The technique involves inserting various kinds of checks (runtime signatures that get checked at branch points, redundant instructions that get cross-checked against each other, and periodic scrubbing of memory protected by error-correcting codes) automatically during the compilation process. An overview of the SIHFT approach can be found here (previously linked in the thread on safety-critical java).

Addressing different problems

Safety-critical code necessarily includes careful handling of failure situations. Or does it?

I think it's important to differentiate between safety in the face of failure, and availability in the face of failure. Safety-critical code is, by definition, code where safety is paramount. That is, bad things must not happen. In fact it may be preferable to do nothing, rather than to do the wrong thing. The failure-obliviousness approach seems to be more focused on availability (or "liveness") - doing the wrong thing every now and then is OK, so long as something happens. The authors explicitly address this point in their paper:

One of the reasons that failure-oblivious computing works well for our servers is that they have short error propagation distances — an error in the computation for one request tends to have little or no effect on the computation for subsequent requests.

In a safety-critical system an error in the computation of a single output value may be sufficient to cause the whole system to fail.

Or does it?

Or does it?

Those three words were intended entirely tongue-in-cheek. Sorry not to have made that clearer. The connection to safety-critical code is simply that failure obliviousness is roughly at the opposite end of a spectrum of failure-handling strategies from the careful failure handling required by safety-critical code.

Related paper: "Exploring the Acceptability Envelope"

I read and discussed a related, more recent, paper in a seminar class this spring. I'll relate here a couple of points I made about it at time. Obviously, these are just my opinion:

  • This approach mighe be considered unethical according to the ACM code of ethics, section 1.2, especially paragraph 3. "Manufacturing" legal memory reads seems particularly suspect.
  • The essential claim of the paper is that writing correct programs is too hard and, therefore, we should abandon this as a goal; furthermore we should redefine success to make ourselves feel better about our previous failure.
  • I believe there is still some non-inherent difficulties we impose upon ourselves by using broken programming languages
  • So, while there's no silver bullet, reliable program correctness is still possible; we just need to abandon languages and techniques that make programming more difficult than it has to be.
As you probably gathered, I'm opposed to what I view to be the underlying assumptions of the paper. OTOH, I do think there are some good ideas to be picked out. Continuing in the presence of an unexpected situation (rather than aborting) may be the "right" thing to do in some cases, especially situations involving human interaction, ie, GUIs. I'm personally very skeptical about failure-oblivious computing in a non-interactive setting, however.

creating a problem for the solution

the paper fails to recognize that an existing technique like exception handling provides all the advantages of failure-oblivious programming.

it is secure with a consequent throwing policy, provides high availability with consequent exception handling and also figures a near minimal adoption cost if the desired net result is safety in the extent of the proposed scheme.

more so, exceptions are superiour because they still allow for controlled error paths, and that for a small increase in adoption cost if the system design is well adapted for exceptions. the decision for uncontrolled continuation against abortion is still in the hand of the designer which is critical for many applications where denial of service is less fatal than faulty operation.

the conclusion that, contrary to failure-oblivious programming, agressively thrown exceptions may decrease availability, which is plain wrong, shows that the authors did not consider the potential of established error handling methods.

that being said, i see the successful application of such a technique for the leagues of existing legacy code.

i may have misread the paper's implication, but encouraging failure-oblivious programming per design over proven, superiour techniques seems wrong to me.

In addition, not in the stead of?

(Before I get started, I'm pretty new to LtU, although I've been reading for a long time; also, I *think* I understood the paper, but I might not've.)

I don't think the authors are proposing that memory errors be solved solely by making up values (indeed, they even specifically point out that this would really work only in applications where each high-level 'thing' a server does is fairly independent of each other- HTTP requests being one, but numerical calculations not being such a case), but that this technique be used as a last resort to all the regular error and bounds-checking and exception-handling that is usually done, so that if something *does* slip by, at least there's a decent chance that it will be taken care of.

[...] but that this

[...]
but that this technique be used as a last resort to all the regular error and bounds-checking and exception-handling that is usually done, so that if something *does* slip by, at least there's a decent chance that it will be taken care of.

the paper specifically suggests that failure-oblivious programming has the potential to allow ignoring memory access faults.

the analysis is put in the context of posix c systems, which traditionally fail hard on illegal and soft on out-of-bounds accesses. it ignores the fact that other systems (e.g. any bounds-checked array implementation, windows SEH [to a certain degree], java [yuck] NullPointerException, etc.) do not fail at all in such circumstances but allow for controlled error handling.

in systems consequently applying such techniques, there is no way that such failures 'slip' through. failure-oblivious programming seems to be orthogonal to other techniques which already serve this purpose.

by the way, there is a lot of existing scripting languages that have long implemented failure-oblivious programming by providing fake values on error conditions. none of these are known as particularily fit for the papers claimed applications, like high availability or security. just some food for thought...

bystander effect? ==> risk homeostasis

What is here called the "bystander effect" or "moral hazard" is discussed under the term "risk homeostasis" @ Damn Interesting. In summary, people have a level of acceptable risk, and adjust their behavior based on what they know of their environment to be at that level of risk. Antilock brakes "cause" people to drive faster in bad weather, etc.

it's true

It's also known as the "Law of Conservation of Pain". The big question is: would such moral hazard be beneficial?

Security and new attacks

The enhanced security was all against attacks on the original programs -- I missed any mention of new attacks this change might open up.

using out-of-bounds pointers

My commentary on the paper is too long for my time now, so I'll just note an odd opinion in one section. I strongly disagree with the following if it means what I think it does:

We note that two of our servers (Pine and Midnight Commander) use out of bounds pointers in pointer inequality comparisons. While this is, strictly speaking, an error, the intention of the programmer is clear. To avoid having these errors cripple the Bounds Check versions of these servers, we (manually) rewrote the code containing the inequality comparisons to eliminate pointer comparisons involving out of bounds pointers.

Are the authors claiming it's an error to compare two random pointers? Comparing pointers is never an error just because of the pointer values. If I read this right, the authors don't like server code that compares pointers when either pointer is technically not in mapped space.

There's nothing wrong with that. Until you dereference a pointer (to read or write) you can use any pointer value you like and treat them almost like integers for arithmetic purposes. Does the paper suggest this isn't so? I really hope not. (I've had other programmers make the same objection about my code in the past, based on the faulty belief it's not legal to make a pointer to space before or after allocated space, even if you don't access the memory location.)

My standard boilerplate for looping over arrays almost always uses a pointer to the element following the last in an allocated array; the pointer (cursor) for an item in the array must be less than this value, so a pointer comparison occurs once per loop test. Even if dereferencing bytes after the array would cause a segfault, it's still legal and normal to use the address in arithmetic. Do the authors object to this usage?

C89 did not define what

C89 did not define what happens with general out-of-bounds pointers, but explicitly did allow pointing just past the end of the array, so your code there is OK. (I got this from P.J. Plauger's fine book The Standard C Library, not the standard, which I haven't read.) Apparently all bets are off outside those ranges in portable code because there are some weird pointer representations and other architectural quirks, e.g. in embedded systems.

Comparing pointers is never

Comparing pointers is never an error just because of the pointer values. ... There's nothing wrong with that. Until you dereference a pointer (to read or write) you can use any pointer value you like and treat them almost like integers for arithmetic purposes. Does the paper suggest this isn't so? I really hope not.

I believe this is the case, (see this post to comp.lang.c.moderated with message-id: <clcm-20020802-0022@plethora.net> (searchable on google groups)).

Essentially, if you have

type* foo, *bar;

and stuff happens so foo is invalid, bar is not, then equality tests:
foo == NULL, foo!= NULL, foo == bar, foo != bar are all valid code.

Arithmetic tests invoke undefined behavior, so foo++, foo >= bar, etc are all undefined. On some architectures (i.e. ones with a nonlinear address space) pointers can have a very odd format and can not be treated as integers.

A pointer to one past the end of an array is always valid for comparison, though it cannot be dereferenced.

-Ed

Pointer as object identity?

Until you dereference a pointer (to read or write) you can use any pointer value you like and treat them almost like integers for arithmetic purposes.

I suspect there might be cases where comparing pointers as integers leads to an error.
Consider objective system - if you create an object, squirrel a pointer to it, then the object is deallocated, and a new one created at the same address (all this possibly over the execution of a big program), is it an error to compare the first pointer with the second and infer that they point to the same object?

I realize this scenario is based on a dangling pointer, and is impossible with garbage collector or other memory-tight mechanizm, but then would we still be talking about pointers and not about references?

might sound complex (sorry)

I know I let myself into this by putting the word almost at the start of "almost like integers" without clarifying how they are different. (In my defense, I was lazy and it's a can of worms. :-) Launching into a discussion of alignment and scaling, etc, made me blanch. My point was really that objects referenced by pointers in C don't need to exist for legal use of the pointers, if you don't try to use the objects at the nominal addresses.

Andris Birkmanis: I suspect there might be cases where comparing pointers as integers leads to an error.

Yes definitely (since the critical almost proviso is now gone :-). It's not sensible to compare all pointers, even if it might be legal and non-crashing. This might be a good time to bring up a notion of two (or three) different classes of error:

  • statically illegal (doesn't compile)
  • dynamically illegal (compiles, but runtime crashes)
  • always legal, but nonsense results

I'd interpreted the paper's definition of error as not legal (one of the first two listed above) based on their runtime enforcement of acceptable pointer values.

So when I said it was okay to compare pointers I meant always legal even though nonsense can still result when comparing some pointers, to which I assume you're alluding.

Besides a desire to refer to the (virtual) array member just after the last in a vector, I also sometimes have need to calculate addresses of things where they might be, even if they are not yet or not currently in memory, or not currently assigned to any definite range of address space.

In other words, it should be both legal and sensible to do address arithmetic on object addresses if the addresses are those where objects might feasibly be placed, even if they are not currently placed in those locations.

valid pointers

Rys David McCusker: My point was really that objects referenced by pointers in C don't need to exist for legal use of the pointers, if you don't try to use the objects at the nominal addresses.

In other words, it should be both legal and sensible to do address arithmetic on object addresses if the addresses are those where objects might feasibly be placed, even if they are not currently placed in those locations.

Actually, i don't believe that it is (and i think Edward made this point earlier). The c89 standard has this to say:

3.3.6 Additive operators
...
Unless both the pointer operand and the result point to a member of the same array object, or one past the last member of the array bject, the behavior is undefined.

That is, pointer arithmetic is only valid within an array, or one past the array. Anything else invokes undefined behaviour (potentially dynamically illegal, by your hierarchy), although i believe that it is unlikely to cause any trouble on common architectures.

[on edit: apologies if i came across as being too perscriptive, i was just trying to point out that what seems sensible in c may not be entirely legal]

okay I'm done

I've no vested interest in commenting on the c89 standard, or on motivation for such phrases in the standard to preserve air for odd architectures, or on variance between standards and practice, or on expected similarities in capabilities in assembler and C.

Thanks for making clear where the paper authors were probably coming from. (Now I'm going back to my own knitting.)

Very bad approach

A reasonable way to deal with unexpected failure is to use a safe language, let it throw exceptions, and put exception handling and logging around self-contained complex sections so an unexpected error doesn't abort the whole program if it doesn't have to (i.e. if it didn't corrupt shared data fatally).

The worst approach to unexpected errors is to pretend that nothing happened and just return garbage data. An error should be logged and shown to the user, so it can be fixed later, without causing more immediate inconvenience than it has to. Ignoring errors lets them remain undetected, and from time to time the corruption will manifest and cause loss of data.

Fixing the C language to become safe is doomed to failure.

CERT

Fixing the C language to become safe is doomed to failure.
Secure Coding in C and C++ Robert C. Seacord 2006

Fixing the C language

Fixing the C language is not going to happen. The primary goal of the C standards body is not to break existing code. The idea behind "Secure Coding in C and C++" is fixing C language programmers.

There is also a new effort at www.securecoding.cert.org to define and establish secure coding practices for C and C++. Please take a look, and perhaps contribute a rule or recommendation.

These rules are for C/C++ beginners

Many of the recommendations and rules (e.g. the whole section about arrays) are very basic C/C++ knowledge (e.g. that sizeof(arr) is the size of a pointer rather than the size of an array if arr is declared as a pointer, or that you may not free the same pointer twice).

I expected more coding standards which would make C and C++ somewhat safer by unlearning some bad habits, instead of facts that every C and C++ programmer must know anyway and should have never thought otherwise.

Beginners who learned C/C++ by trial and error or by looking at existing programs and might had inferred false assumptions may find such place valuable. But I suppose that when learning systematically, these rules are already known when the respective language constructs become familiar.

I still claim that promoting safer languages leads to a better quality of code than trying to write correct software in C.

safer languages vs. writing safe software

Some of these rules are very obvious, many are not. Take a look at the integer section. I've been giving a three hour presentation on integers for about a year now and I can tell you that most C/C++ programmers do not understand the rules that govern their behavior.

Now the surprise... I am not going to disagree with you that promoting safer languages leads to better quality code. There are, however, many reasons for selecting a progamming language and safety and security are not always at the top of the list. As a result, we need to help developers who are required to write in C and C++ follow safe/secure coding practices.

I also think that the most convincing argument for using a safer language is "sure you can write safe/secure code in C or C++.... just read these 6 volumes and apply these 2,000 rules and recommendations and you should be better off"

Failure Oblivious or Crash Only?

Which would be safer?

My instinct is the sheer brute force simplicity of Crash Only Software would win.