Rules for Developing Safety-Critical Code

In the June 2006 Issue of IEEE Computer (Volume 39, Number 6) Gerald J. Holzmann of the NASA/JPL Laboratory for Reliable Software authored "The Power of 10: Rules for Developing Safety-Critical Code" on pages 95-97.

I don't have an online link to the article text, but it can be summarized as:

Rule 1: Simplify Control Flow Banishing Recursion.

Rule 2: Set a fixed upper bound on all loops, excluding non-terminating event loops.

Rule 3: Banish Dynamic Memory Allocation and Garbage Collection.

Rule 4: Restrict each function's size to around 60 lines of source code.

Rule 5: Make liberal use of assertions to test any condition that can't be statically guaranteed.

Rule 6: Use static scoping to hide "data objects" to the greatest extent possible.

Rule 7: Check all return values and caller supplied parameters.

Rule 8: Banish any significant macro use (like "token pasting", "variable argument lists", and "recursive macro calls") beyond header file inclusions and simple defintiions.

Rule 9: Banish handles, macro-driven pointer manipulation, and function pointers while restricting pointer use to one level of dereferencing.

Rule 10: Continuously recompile all code with all compiler warnings turned on and ship no code until all warnings are eliminated and it passes strongly typed static analysis.

One gets the sense that these strictures are informed by life in the C/C++ discourse community, but they do raise deeper questions of whether the dynamic world and functional programming in general can support Safety-Critical Code. Could literate programming techniques be leveraged to further improve the reliability of such code?

Could we in effect replace these 10 rules with:

Rule 1: Code in Scheme, Haskell, or F#.

Rule 2: Embrace Literate Programming.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Reducible to One Rule

1) Quit programming in C.

Amen, brother!

Amen, brother!

Um, no

Could we in effect replace these 10 rules with:

Rule 1: Code in Scheme, Haskell, or F#.

Rule 2: Embrace Literate Programming

It looks like the authors are looking to enforce hard termination, progress, and resource usage gaurantees, which vanilla Scheme and Haskell certainly can't give (don't know about F#).

Come to think of it, I don't really believe that Scheme has any of the features I would expect in a language for safety critical programming, other than a highly-skilled average programmer level (which can admittedly cover a multitude of sins). Worse, it includes a bunch of features I wouldn't let anywhere near a nuclear reactor monitoring system or anti-lock brake controller (continuations first and foremost).

On a meta-level, when someone says that proper programming practices can be reduced to simply "use language X", they're showing precisely the sort of obliviousness that indicates that they shouldn't be working on a safety-critical project.

I hope I didn't just feed a troll.


It looks like the authors are looking to enforce hard termination, progress,...

Any time you look at the debate going on within a PL (as opposed to across PLs), you get arguments that are applicable only within that language. Take the first rule about recursion. That particular recommendation sounds ridiculous if you were talking about languages like Scheme, ML or Haskell. But then they have a little thing called tail call optimization, so recursion can be done without causing the resource usage to explode. From a C perspective, recursion can cause some unpredictable results. So, even though some of these recommendations might not make sense for other PLs, they have to be judged on how they impact code written in C.

I agree that Scheme and Haskell might be out of place in the real time domain, but as Ehud says, Ada is probably the competitor in this space for getting both the guarantees and the required safety.
I hope I didn't just feed a troll.
Just as a meta discussion point, I think Peter did the proper thing, just like the recent post that asks the question of why FP matters. Specifically, an external reference was given that could act as the focal point of the discussion, independent of the opinions that are expressed on the subject. In other words, even if the opinion might be controversial, the thread has some grounding that goes beyond that. If he had just started a thread and expressed an opinion that C should be abandoned in favor of some other language, without a reference to an article, then I would interpret it differently.

Online version, Targeted at C

I've located an online version of the article.

The advice is targeted at C. From the introduction:

At many organizations, JPL included, developers write most code in C. [..] For fairly pragmatic reasons, then, the following 10 rules primarily target C and attempt to optimize the ability to more thoroughly check the reliability of critical applications written in C.

Also I'm not sure about the definition of safety-critical. The advice given seems to lean towards real-time systems. For example, this remark:

Memory allocators, such as malloc, and garbage collectors often have unpredictable behavior that can significantly impact performance.

I can imagine safety-critical systems where a delay for garbage collection is acceptable.


On a more serious note than my first reply, these are all good points. It's perhaps worth reminding people of MLKit, which is explicitly intended for real-time programming and is the poster child for "region inference" as a strategem for avoiding GC. However, more recent work suggests that region inference can lead to up to a 5x (!) increase in working-set size, so "real world" systems tend to need both region inference and an infrequently-used real-time GC to keep the working-set explosion manageable.

In any event, unsurprisingly, even with tongue removed from cheek, my reaction remains that it's well past time to think about alternatives to C for "safety-critical" work (indeed, my reaction to the thought of C for "safety-critical" work is a combination of horror and revulsion).

Hokay that puts a different slant on things...

  • The OP stated that these were "Rules for Developing Safety-Critical Code". Reading the original article it is clearly, "These are the minimal best bang for your buck, easily verifiable rules for writing (not designing) safety critical coding in C"
  • the OP misstated and severely truncated several of the rules from the article.

Within the severely narrower scope of the original article, I agree with it.

Within the larger scope of "Rules for Developing Safety-Critical Code" there is, as I said, very little that I can agree with.

The Definition of Safety-Critical Code & A Modest Proposal

In the OP I was trying to summarize the full article without outright copying. I tried to compress the rules as much as possible and apologize for any potential inaccuracies or perceptions of a desire to present my personal opinions as having any greater intrinsic merit than those of anyone else here.

I agree with your characterization of how the original article should be described.

I think what tweaked me the most was the implicit notion that Safety-Critical Code = Severely Resource Constrained Real Time Code.

Consider an application to determine the correct dosage of radiation to administer to a patient. Would this be Safety-Critical? I would say yes. But the application could probably afford to take hours of real time and be run in an environment with virtually unlimited primary and secondary storage.

In this sort of scenario, it is getting the right answer that matters most and IMHO a higher level language could greatly improve the odds of achieving this end. Indeed, depending on the time scales at play, even airplane, flood management, or nuclear reactor control system with hundreds to thousands of lives in the balance might fall into this category for some classes of decisions.

Likewise, I am not yet convinced that a more feature rich language like Scheme couldn't be used with sufficient rigor and self-discipline under an appropriately optimizing implementation to perform acceptably and reliably in a Safety-Critical Resource Constrained setting.

Indeed, I get the sense that some people don't believe it is possible for a program NOT to have errors and that they take the insolubility of the halting problem as a proof that no program can be shown to halt. When dealing with Safety-Critical Code, the question shouldn't be whether it was developed in a language in which something could go horribly wrong, but rather whether the actual code in question could go horribly wrong which could happen under C even if we adhere to the rules in the article and which might not happen under Scheme even if we violated them.

So although I do recognize that it may be harder to perform a static analysis of a higher level language than a lower level one, I don't think that a higher level language will lead to code that is inherently less trustworthy than that written in a lower level language.

But every interesting implementation of a higher level language seems to have a license caveat that it not be used for anything Safety-Critical!

So here is a modest proposal:

What I really would like to see would be side by side comparisons. Couldn't we identify a Safety-Critical Task and invite teams of programmers from LtU to compete at solving it using *any* tools of their choice. Then see which team's implementation in which language of choice has the fewest bugs and best worst case resource consumption.

Otherwise, we will never get beyond opinion.

Then if, as I suspect, a team using a higher level language like Haskell or some cutting edge Ada variant wins, we might want to rethink whether feature restricted C variants are the right tools for all such tasks which seemed to be the position espoused in the article.

I think that's a good idea,

I think that's a good idea, but it's hard to do in practice. For toy problems, there's likely not to be a difference. What's needed is a large enough problem that we can see if Haskell (for example) hangs on when you're dealing with a dozen programmers and hundreds of thousands of lines of code. Right now, it looks like Erlang is one of the few higher-level languages to be beaten on in that way. I guess it's worth mentioning that Erlang came out of exactly the kind of testing you describe.


..some cutting edge Ada variant...

It's probably worth noting that Praxis High-Integrity Systems, one of the UK's major developers of safety-critical code, actively uses and develops the SPARK variant of Ada.

Tried and tested

some cutting edge Ada variant wins

I think it is important to realize that the Ada world is doing these things for many years now, and thre result is many successful production systems. Even SPARK isn't a new cutting edge idea, but a tried and tested set of tools.

This is highly important, since you wouldn't want your mission critical system to be a test case for new and untested technology.

Feature restriction

Other have mentioned SPARK, which is one of the leading languages for safety critical software. It's worth noting, however, that, far from being a "cutting edge variant of Ada", SPARK is a feature restricted subset of Ada95. Similrly the likely candidate for a good saftey critical Haskell type language would be HasCASL which uses a feature restricted subset of Haskell. How easy the code is to analyse really matters for safety critical code. As you move up the scale of criticality higher level languages, which allow for much more efficient, clear, and succinct expression, look better and better. There is a point (and I can't say where that is) where the curve tips and the rigour required in terms of analysis means feature rich languages start getting harder to analyse. When you get to real safety critical code where you expect completely rigorous machine assisted mathematical analysis then having a feature light language looks ever more appealing.

How about the Van Roy and Haridi Approach?

Could we have our cake and eat it too if we use a kernel language approach like Van Roy and Haridi advanced in Concepts, Techniques, and Models of Computer Programming?

In other words, could we achieve rigorous analysis for Safety-Critical Code by translating higher level features into a smaller kernel language more amenable to automated reasoning instead of simply dispensing with them altogether?

It seems to me that restricting features at the language level will just force programmers to re-implement some them at the program level on an ad hoc basis which might conceivably lead to more rather than fewer errors. The comment on the radar trace example would be a perfect case in point.


Actually, the Spark Ada restrictions (if I recall correctly) make the language strictly less expessive than full Ada, so its more than just having a minimal core that can encode all the other constructs (precisely the opposite in fact).

Of course. The point is that

Of course. The point is that some constructs are problematic if you need to be able to reason about timing and/or space. That's why OO features are problematic, for example: a call depends on the type of the argument (beacuse of dynamic dispatching) and so reasoing about timing becomes a problem. BTW, we had a couple of threads on SPARK which probably contained more details.

Just going over my little list of Rules for Safety Critical....

...and tool choice didn't feature near the top at all.

Sort of like a competition to design the best, safest heating system for the Titanic.

My rules may well apply in terms of choosing or designing and building safety critical tools.

I think the answer they are

I think the answer they are looking for is Program in Ada. (see this for example)

And this for an entertaining

And this for an entertaining reading.

Haskell, Scheme, etc., don't solve all problems

If your goal is correctness of the source code above all else, then Haskell makes a lot of sense. But in engineering, especially for realtime embedded systems, you need more than that. You need guarantees about things like:

* The system will not spend nondeterministic periods of time in system routines (especially dynamic memory allocation).
* You won't run out of memory because of hazy issues, like fragmentation in a system without virtual memory. (And with Haskell you have to be very careful to avoid memory blowups caused by lazy evaluation.)

In many cases you *can* use higher-level languages, but there are lots of cases where you can't.

Rule #1: provide good complete spec

And starts by writing the tests from the specification, fixing the specification when the writing of the tests show that it is unclear.
Reviewing the tests as you'd do for a program to ensure that everybody agree on the test and the specification.

And then and only then start the programming phase, reviewing the code as soon as possible on each part, to ensure that no sloppy code is accepted.

The language doesn't really matter as long as you have experimented developpers in the language and that the review ensure that the readability of the code is good.

The rules given above are just implementation rules, the process matters more than the implementation rules.


Experience of using a lightweight formal specification method for a commercial embedded system product line

These guys talk about how they could deliver an embedded product in record time, using the strategy of getting a great spec first. They consider the problem of doing specs that are actually consistent and correct, not only complete. Then they show a method for doing good specs, a method that led to a tool: Statestep


Recursion is a big problem too. In pure functional programming it seems impossible to abandon recursive code. But since it's hard to impossible to calculate and verify the upper memory bounds for general recursive algorithms, recursion can't be allowed in safety-critial code.

With tail recursion/optimization this can be avoided but is it really feasible to disallow all non-tail-recursive code? Or are there other loopholes out of this dilemma?

And, when do you know it is/not tail-recursive?

It has been a while since I've used FP full time, but I sure don't recall getting nice warnings when I wrote something that was stupidly not amenable to tail-call optimization.

If it was a big issue with FP...

...then it'd be easy enough to have the compiler spit out whether TCO was applied or not to any particular function. That is, tail call is something that can be statically analyzed, so getting the tools to report that property is well within technical feasibility. As for why that's not done yet, well, FP is not in big demand in the real time embedded space, so it's not a big priority.

More generally, inductive reasoning is the main form of logic for analysis about program reliability. Although the tools for formal verification have a way to go, it's still within reason to expect that the ability to reason about code is perhaps our number one criteria for judging reliability. In the sense that most inductive proofs are stipulated in terms of recursion, I find that the call to limit recursion may give us higher reliability in terms of resource usage, at the cost of making the code harder to follow.

FP and embedded systems

This is a probably a good time to mention Malcolm Wallace's PhD thesis on functional programming and embedded systems. In particular, Wallace says in his conclusion that

In the introduction (§1.2) it was mentioned that a functional language may need to constrain all recursion to be tail recursion. All the looping combinators we have used are indeed tail recursive, so although the language has not enforced the constraint, the applications have. The question of whether to formally require such a constraint remains open however.
For safety critical systems I would assume that relying on programming conventions that use only tail-recursive constructs would be feasible (reliance on convention seems to be the approach taken when using C), but the ability to provide guaranteed enforcement of tail-recursion would be preferable. It's interesting that Erlang encourages, but doesn't require, tail recursion.

Tail call with respect to what?

An expression can be in a tail position with respect to some other expression, most importantly wrt. a function body. But with nested functions only whether it's a tail call wrt. the innermost function is decidable statically. For other functions it depends on how the nested function is used. But the innermost function is not necessarily the most important function in this respect; it might be a humble lambda which participates in a loop only indirectly, and being in a tail call position wrt. the lambda doesn't imply that the outer loop runs in constant space.

This matters when higher order functions are used a lot. Which happens to correlate with the importance of tail calls—that is, with functional style. At least in my style in my language annotating tail calls would be useless. It's usually easy to see which calls are in a tail position, and in cases where it's not, it often depends on the semantics of functions used so the compiler would not know that either.

Not to mention that forcing tail recursion at the cost of building a temporary structure on the heap is advantageous only in runtimes which put artificial limits on the stack size.

Unless I'm mistaken...

...TCO is a compiler function, so I'm assuming that just as the compiler can figure out which return paths can be optimized, it can also figure out which ones it can't optimize. The HOFs are also compiled, as they are not dynamically generated in languages like ML and Haskell, so I would assume that the compiler also has static knowledge. Of course, not being familiar with the internals of these compilers, I'm just engaging in speculation. I'm guessing that TCO in these static languages is not something that is figured out at runtime. For the dynamic languages like Scheme, the proposed solution is to put a contract on the return that would throw an exception if the return is not TCO. Just like contract-by-design, you'd probably want to disable the exception throwing in the final production code, but at least you have something that could be tested.

Not to mention that forcing tail recursion at the cost of building a temporary structure on the heap is advantageous only in runtimes which put artificial limits on the stack size.
The problem domain at hand is in real-time embedded software where the constraints on response time and resource usage (like the stack) must have some sort of limits.

I wonder if it's generally

I wonder if it's generally possible/feasible to move recursion altogether into a relatively small set of certain functions. Like map, fold etc. capture recursion in iteration over certain data structures while the function which does the work don't need to be recursive anymore (only in respect of the iteration of course).

If this is possible the writer of those function could supply information about the memory usage depending on the maximum size of the incoming data and this could allow the compiler to statically calculate a upper bound for the memory usage.

Well out of my domain...

...but I know that such things are being investigated in the Hume PL which attempts to combine FP and static resource analysis.

Tail call property dependent on context?

Qrczak: An expression can be in a tail position with respect to some other expression, most importantly wrt. a function body. But with nested functions only whether it's a tail call wrt. the innermost function is decidable statically. For other functions it depends on how the nested function is used.

I don't understand what your getting at. Can you give an example? Why should the tail call property depend on a function's use? It is a simple syntactic property of expression structure, and nothing outside the innermost lambda should be relevant.

More than tail call needed

If the question is whether a given function can execute in bounded stack space (as implied by the context), it's not enough to show that the function is only called tail-recursively in it's own body, you have to show that it is only called in tail position in the bodies of all functions it might call, directly or indirectly. That implies either showing it's only called as tail anywhere, or doing an interprocedural flow analysis.

on purpose vs automatic

I like the idea of (IDE) support to analyze recursion done on purpose, for programmer feedback on their intent. So UI with this in mind can expect user direction to simplify the task. For example, optimistic Moe can ask: what's the worst stack cost for these three mutually recursive functions?

I don't like the idea of automatic analysis of risk exposure due to recursion. I suspect detecting recursion (in the worst case, waiting to see if a return address ever appears again earlier in the call chain) becomes asymptotically similar to deciding whether code halts. It doesn't seem fair to hang the halting problem on recursion.

You realise Moe'll keep

You realise Moe'll keep asking for the best case cost anyway, right? Sure helps for pointing out to him that bad stuff can happen though.

Yes, but...

Dave Griffith: you have to show that it is only called in tail position in the bodies of all functions it might call, directly or indirectly

Actually, you have to potentially check any call down the chain for being a tail call, including those to other functions. If that's all that Marcin Qrczak meant then OK, although I can hardly extract that from his writing, and don't understand the business with nested functions, because it seems rather irrelevant to this argument. Marcin, maybe you could elaborate?


Here is an actual code I've written (in my language Kogut). It traverses environments through parent links, searching for the first non-null function name of an environment:

let GetLoc exprS {
   loop Env ?env =>
   let functionName = env.functionName;
   if (~IsNull functionName) {exprS.loc->Change context:functionName}
   else {env.parent->IfNull {exprS.loc} again}

Transliterated to Scheme:

(define (get-loc expr-s)
  (let again ((env *env*))
    (let ((function-name (env-function-name env)))
      (if function-name
          (change (expr-loc expr-s) 'context function-name)
          (if-false (env-parent env) (lambda () (expr-loc expr-s)) again)))))

The cruical function is IfNull (if-false in Scheme). The actual implementation of IfNull x f g enters f() or g x depending on the value of x. This means that the loop being tail-recursive relies on IfNull calling its third argument in a tail position. Considering the source of the loop alone, without looking inside the behavior of IfNull, it's impossible to say whether it's tail recursive.

Here the recursive function is again, and it's passed to IfNull as-is. It could be eta-expanded, so the actual call would be visible in the source. Is this a tail call? Yes, when considered wrt. the lambda of the eta-expansion. But whether it participates in a tail-recursive loop, depends on the behavior of IfNull. In principle it could even apply its argument in a tail position or not depending on some external factors, or store it in some data structure and let it be applied later.

Primitive recursion

Rule 2 seems to say that only primitive-recursive functions can be computed by SCC. How severe a limitation is that in practice?

Real Time software

It could just be the nature of the real time embedded software that I used to work on, but I think that the domain is more varied and complex than most give credit. When many think of real time, they assume that it's all about servicing interrupts, timers, controls, etc... The problem with that thinking is that many apps in this domain go way beyond that level, and start doing stuff that are much more general in nature. The recommendations assume that RTS applications are monolithic creatures, wherein the only thing going on is purely clock cycle driven. This will be true for certain critical parts of the code, but it will not scale well as the application surrounding this critical section becomes larger. For example, only about a third of real time software is done without an operating system in place (And many real time apps are using java).

These rules may make sense for certain parts of the code, but they don't necessarily scale to the outer regions of the apps that surround these sections. As you get further out, things like integrity in concurrency make a bigger difference in terms of reliability, than do things like minimizing resource usage.

[Edit Note: added link for the usage stas on operating systems]


Just for completeness:

Guidelines for the use of the C language in critical Systems.

The site seems broken, but i hear a lot of MISRA in the automotive industry.

tail call annotation

(Sorry: meant to reply to raould. I do this all the time.)

raould: ...nice warnings when I wrote something that was stupidly not amenable to tail-call optimization.

That's a good idea. One might add a special form for tail calls that does nothing but emit an error if it's evaluated anywhere but in tail position. This lets a programmer say, "I think this is tail position, so yell at me if it's not."

For folks unfamiliar with TCO, this would be a compile time error (not runtime) because tail position is easily decidable by static analysis, and everyone does when tail call optimization is present, since this is how tail dispatch is done.

[As a Lisp example (only to make this concrete) imagine a new special form like (quote x), that looked something like (.tail x). This special form would evaluate to the value of x; it would be a no-op except for causing a compiler to complain if it doesn't appear in tail position. You could add a reader macro to shorten the syntax; where 'x means (quote x), you might use #^x to mean (.tail x).]

Consider this also a reply to raould

This idea has been discussed previously. In my opinion, this whole issue (the issue of "knowing" which calls are tail-calls) is a non-issue and this is why there is no warnings for this (that and the compiler can't read your mind.) However, controlling or removing general recursion from an FP language intended for real-time is probably a good idea. See Zhanyong Wan's page for some ideas on real-time coding in a Haskell-like FP language. You can no doubt find other such resources.

I'm a johnny come lately

Whoa, I'm breathing Luke Gorrie's exhaust fumes. Feedback on tail calls from a language seems good when tail call optimization is semantically significant, like it often is in Scheme. (Your name next is a link to the cited opinion.)

Derek Elkins : Also, thinking about it now, 'return' almost seems to do a better job than Luke Gorrie's tailcall operator in ensuring that an expression is a tail call and facilitating learning.

Languages with an explicit return operator (eg Smalltalk uses ^) do clearly show tail calls already. But some languages don't -- for example, a number of expressions in Scheme return the value of the last expression in sequence (eg the last in an 'and' special form) -- and might benefit from a change.

I guess you prefer keyword 'return', thus (return E) instead of (.tail E). Keyword choice is almost arbitrary, but a more declarative term might be better; 'return' implies something is being done, which would be wrong in a spot where the keyword was a no-op. I guess I prefer weakest correct terminology.

(After raould's remark I planned to add such a special form to my next Lisp (gyp), but Luke beat me to the idea months ago. I should have been reading LtU all along. :-) Since I was going to integrate with Smalltalk, I thought the Lisp side of things would look good using #^ to go with Smalltalk's ^. I'm flirting with an idea of making all special forms look distinctive, such as starting with a period, so non-evaluation semantics stand out more.)

I suppose I should respond to the original post. Um, I think the idea of removing recursion from FP languages is silly, but probably not what the author was intending; I assumed it was aimed at C and C++ and the like. The desire to banish recursion and garbage collection, etc, seems motivated by a desire to get an iron grip on what the runtime might do, to ensure safety by ruling out strange flow of control. I can totally relate; I was going to get the iron grip by writing my own runtime, if existing dynamic languages won't let me be more paranoid than the language.

For real (draconian) safety contexts, though, I'd be inclined to use a FP language to generate code for a safer runtime model, rather than use the FP language as the runtime by default. Assuming the final runtime context might be frustratingly narrow to work in, generating the code from a nice environment like FP would be a lot more pleasant that working closer to the metal the entire time.

What a load of unmitigated cwap.

The first three should look like so...

Hoare's Rule 0:
Simplicity is the price of reliability.

Want it safe? Make it simple.

There is no other choice.

Ganssle Rule of Thumb : If it's resource constrained, multiply development time by factor of three.

Carter's Corollary to the above Rule: If it took you a factor three more time to develop, it probably isn't simple, and hence by rule 0 isn't safe.

Now into the blow by blow deconstruction...

Rule 1: Simplify Control Flow Banishing Recursion.
Unless algorithm is more naturally expressed in tail recursive form anyway.

Number of bugs I have seen that this rule would prevent? Very very few.

Change it to "design your system so it has no compile time and very few link time cyclic dependencies" and I will believe it.

Rule 2: Set a fixed upper bound on all loops, excluding non-terminating event loops.

Yup. It's called The Watchdog Timer. ;-)

Never test for a condition you don't know how to handle. So your loop exceeds your fixed upper bound? What are you going to do? Call Ghost Busters? Nah. Exactly what your watchdog does. Reset. So leave it to the system that knows how to do it properly. The Watchdog.

Either that or this is merely an instance of rule 5, use assertions.

Rule 3: Banish Dynamic Memory Allocation and Garbage Collection.

Having seen the abortions C programmers produce to avoid using STL... I have grave doubts about that one. I'm not advocating use STL in safety critical apps, I'm saying rewriting large chunks of STL in C makes things way less safe.

Rule 4: Restrict each function's size to around 60 lines of source code.

Ignore this, rather think about clear and simple partitioning of responsibilities across classes. (Modules if you are doing C) Read Rebecca Wirths-Brock on Responsibility Driven Design.

Rule 5: Make liberal use of assertions to test any condition that can't be statically guaranteed.

At last, one I agree with. Assertions are like Testing from Inside with full knowledge. Think your approach to assertions in production code through ahead of time. Test like you fly, fly like you test.

Rule 6: Use static scoping to hide "data objects" to the greatest extent possible.

+1 for data hiding
-10 for introducing lots of static hidden state.
-100 if you are also doing multithreaded code.

-1000 on your safety rating if you are doing multithreaded anyway...

Rule 7: Check all return values and caller supplied parameters.

Make sure you know where the "trust" boundaries in your system are. Assertions are for checking "I am sane", not for sanitizing user / external system inputs.

Rule 8: Banish any significant macro use (like "token pasting", "variable argument lists", and "recursive macro calls") beyond header file inclusions and simple defintiions.

Unless it's the X macro trick. The X macro trick makes C safer.

Rule 9: Banish handles, macro-driven pointer manipulation, and function pointers while restricting pointer use to one level of dereferencing.

Unless what you do instead is much more complicated. In which case either apply Rule 0 or ignore rule 9.

Rule 10: Continuously recompile all code with all compiler warnings turned on and ship no code until all warnings are eliminated and it passes strongly typed static analysis

Yah! At last one I (mostly) agree with.

Mostly, because I will bet you unit testing at each layer boundary will get you much more safety than static analysis and warnings. However I would say do both.


For those who didn't already know, Gerard Holzmann is the guy responsible for the SPIN model-checker. He spent a number of years at Bell Labs before moving to JPL. Not surprisingly, JPL has recently started making use of SPIN for verifying the concurrent aspects of some of their flight software (including the Remote Agent autonomous executive, various chunks of communications software, and software for recovering from spacecraft faults).

Rule 3 is impossible to comply with.

Rule 3: Banish Dynamic Memory Allocation and Garbage Collection.

Trouble is, many real time applications keep track of things that come and go. I've had this conversation several times over the years:

RT developer: We don't need GC because we don't use dynamic memory.

Me: OK, suppose you are doing a radar system. It watches multiple tracks, and you don't know how long each track is going to be there.

RT developer: Oh thats easy. You just statically allocate an array to hold the maximum number of tracks.

Me: So then presumably you have a flag for those array slots that currently contain a live track, or maybe a freelist or something?

RT developer: Yes.

Me: What happens when, due to a defect, a slot doesn't get flagged as empty?

RT developer: Well, you just have to make sure that doesn't happen.

Me: And what happens when, due to a defect, a slot gets flagged as empty but some other part of the system tries to access the track that was in it?

RT developer: Well, you just have to make sure that doesn't happen.

Me: So how is this better than dynamic memory? Surely all you have done is to rename the problem?

Actually there was one neat wrinkle someone suggested, which was to have an allocation count in the slot and keep a copy of that count with a slot reference, so you could tell if you were trying to follow a stale slot reference. In some cases this made for a nice clean semantic where a stale slot reference triggered an abort of whatever was to be done with it. But its not a general solution.


Well, there is the obvious

Well, there is the obvious benefit that you have statically defined a hard resource limit that you can guarantee. So, an "allocation" of one kind of thing cannot impact the success of an "allocation" of a different, more important kind of thing. The system can then attempt to degrade gracefully, or choke and reboot, as the situation requires, and it's almost certain to be able to do useful things like log the error before it reboots.

So it's not so much a case of renaming the problem as naming the real problem incorrectly. Which is, how do you ensure that you always have the resources around to recover from failure? Statically allocating your resources, in a system with known, fixed resource limits, is a useful solution.