Why do computers stop and what can be done about it?

by Jim Gray, via Joe Armstrong and apropos Crash-only software.

An analysis of the failure statistics of a commercially available fault-tolerant system shows that administration and software are the major contributors to failure. Various approaches to software fault-tolerance are then discussed -- notably process-pairs, transactions and reliable storage. It is pointed out that faults in production software are often soft (transient) and that a transaction mechanism combined with persistent process-pairs provides fault-tolerant execution -- the key to software fault-tolerance.

Link

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Kinds of problem

What I find interesting here is considering what kind of problems scare you the most and then focusing on how to reduce those. In some applications the most scary thing could be to break invariants on complex data structures (e.g. a compiler generating the wrong code) and then a static type system could be a really big winner. In other systems the worst problems will be mistakes by users and in that case it might be best to focus on writing a good user-interface. There doesn't have to be only one answer of course.

What kind of programs do you write and what are the problems that worry you the most?

Static vs Dynamic

As a coder, I personally worry more about static or nearly-static constraints (e.g. loop/object invariants, lazy-linking or lazy evaluation etc.)


As a researcher, I am concerned about problems arising in big (distributed) systems. Say, the respect of some communication protocols between not-necessarily-fully-trusted components. In this case, static checking is just not possible. On the other hand, it is possible to perform most checks statically and to add assertions manually/at compile time to dynamically enforce the respect of type contracts whenever necessary. I believe that the best known examples of dynamics+static typechecking are

  • Proof Carrying-Code
  • Typed (un)marshalling, as implemeted in Java.

Typed (un)marshalling in Java

Typed (un)marshalling, as implemeted in Java.
What did you mean by that, RMI? I am honestly surprised by the statement, as I stumbled upon class-loading idiosyncrasies more than once.

Ah, now I re-read your post, and it seams that you gave the best known examples, and not the best known examples.

(Un)marshalling

I must admit that I do not know the mechanisms of RMI in Java, so I do not know whether we understand the same thing by marshalling.

Marshalling = serialization. Java's (un)marshalling is typed insofar as the type of objects is written to the stream along with the content of the objects. When an object is unmarshalled, its type is checked, resulting in a ClassCastException if the type is wrong.

In OCaml, by opposition, as the binaries contain no type information, no such thing happens. If the program believes that it is reading an character string while it is reading an object, well, type safety is lost, anything can happen (as in C++).

If I recall correctly, by contrast, Acute offers typed (un)marshalling.

As for the notion of "best known", you got it right the second time. I'll try to be more understandable next time :)

(De)serialization

When an object is unmarshalled, its type is checked, resulting in a ClassCastException if the type is wrong.
The default (standard) (de)serialization in Java writes/checks only name of the class as the type of the object. While better than nothing, I find this to be inconsistent with the usual1 notion of type in Java (a pair of class name and class-loader). Not sure what would be a better solution, though.

I mentioned RMI because in its more narrow context Java provides more checks than in serialization in general.

I'll try to be more understandable next time :)
Don't worry, I will not use this as a proof of Sapir-Whorf Hypothesis :-)

on edit:


1 - I mean usual for runtime, as in compile time a class name is used alone. I regard this inconsistency between runtime/compile time type checks as a major smell.

I see

Ok, thanks for the lecture :)

What should the presence of the class-loader change, though, once the object is in memory ?

Several things

Every object has a reference to its class, which can be thought as a pair of class name and class loader. There can be no different classes having the same name and class loader, so this influences type equality, for one. Also, class loaders are units of security. Also, class loader of the class of the object is the default to be used to load the classes it refers to, so it does influence the future of the computation even after actual loading is finished.

PS: I feel my contribution to this thread is becoming more Java-specific than I wished. Sorry.

until they upgrade

with persistent process-pairs provides
> fault-tolerant execution

What scares me about these architectures is all the difficult to handle cases when doing a live upgrade.

The persistent process pair is also constrained by the speed of the hard drive which means they aren't really usable in full glory by a lot of systems.