Tolerant vs. Demanding Software

In chapter 11 of Bertrand Meyer's excellent book "Object-Oriented Software Construction", there is a discussion of Tolerant vs. Demanding software. In tolerant software, the program tries to handle conditions which should never occur rather than asserting that they do not occur. In demanding software, each routine has a set of preconditions and postconditions, and the body of each routine is written as if the preconditions have been met.

Meyer's thesis is that tolerant code should be avoided because it unnecessarily increases the complexity of the program. My view is that his argument is rock solid; it is logical, and agrees with my intuition and experience. It seems that a significant number of programmers disagree with me:

-According to Code Complete, the Microsoft Word team asserts that unwanted conditions are not true and then proceeds to handle them.

-In the recent thread on Go's panic mechanism, someone made an argument for tolerant software (http://lambda-the-ultimate.org/node/3896#comment-58329).

-I often find myself working with programmers who are writing tolerant code.

Because of this, I am trying to gather a collection of papers and articles making arguments for and against the demanding style of coding. I'm having trouble find said papers and articles; do LtU'ers know of any good ones?

To provide some context, my background is programming in the video game industry using C++.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Armstrong's thesis

Joe Armstrong's PhD thesis, Making Reliable Distributed Systems in the Presence of Software Errors, contains a nice overview of Erlang's error-handling philosophy and its rationale. That philosophy is on the "demanding" side of things, although Armstrong doesn't quite phrase it that way. Chapters 4 and 5, in particular, seem to cover the kind of ground you're interested in.

well it is a bit of both

in general it is demanding - don't try to recover from error cases - fail.

but it is tolerant at the level of process failure - with supervision policies.

Post

I wrote a blog post a couple of days ago that's near that topic. My belief is that both tolerant and demanding software are equally valid, but each is tailored towards different domains. Historically, we've leaned towards demanding software because the hardware was so limited we couldn't afford for the software to be tolerant, and because the problems we were solving were stable enough (i.e. simple programs running on a single machine in one thread talking to reliable hardware) that we could get away with it. Today, software is running in a less and less reliable environment (on the web, on multi-core machines, on a variable of differently-specced machines) that it has to be tolerant to be at all useful.

Tolerant /= Robust

There's a difference between handling cases that violate the assumptions of the code, such as type errors in dynamically typed code, method not found, or invariant violations in an abstract data type, and handling cases that are exceptional but well within the purview of the application domain such as closed connections, missing files, or malformed input.* What you are talking about is in the latter category and certainly should be handled somehow, though usually crashing and restarting is still the thing to do. An (admittedly extreme) example of "tolerant" code would be having readSocket return zeroes when the network connection went down rather than failing. A simple, actual example of one such case is the following C# extension method:

public static bool IsFoo(this C c) {
    if(c == null) return false;
    return c.HasBar && c.IsBaz;
}

This code, which I've actually seen, is ridiculous. An extension method invokation in C# looks like a normal method invokation, which would immediately throw a NullPointerException before invoking the method at all. In this case, though, we treat a null argument as not having the Foo property even though being passed a null in this situation almost certainly indicates a program error.

Robust programming, which is what you are referring to and largely what Armstrong talks about, more or less requires demanding programming. Tolerant programming presses on in the face of violations of expectations even though it no longer has any way to know what to do. Robust programming handles the known failure cases and also recognizes and provides for the unexpected case by giving a plan to return to a known state, again, usually by crashing and restarting though sometimes a recovery action is possible.

* There is also a difference between recognizing and reporting exceptional cases, and silently guessing at what to do next.

I've read that post of yours

I've read that post of yours with interest, and I can mostly agree with it, yes. IMO, you did use quite good analogies which are not too far-fetched (to me, anyway) to make your point. I believe, too, that the overall "software paradigms entropy", if of course such a thing could ever be definable further than just coining the phrase up (e.g. FP vs. imperative, statefulness vs. statelessness, concurrent vs. single-threaded, etc) is likely following laws similar to Jacobson's analogous notion of it in S.E., or to the other, better studied ones in thermodynamics.

Hence, I do think, too, that in an ideal world, external software quality factors (e.g., composability, correctness, performance, ...), would automatically imply ensuring first that the required internal software quality factors they partly or completely rely upon are met beforehand (e.g., type safety, DbC, expressiveness, modularity, provability [whenever feasible], ...): thus, having us always privileging demanding software, as opposed to the tolerant side of it, which would be reduced to very rare exceptions only, both in foundations and scope.

But I fear that aforementioned sort of "entropy principle" (if we consider it holds) is having us "doomed" to be aware of, and on the look-out to have to cope with the latter (tolerant software) "from time to time", and no matter what, as new problems, and inventions are fed into an ever expanding set of use cases for global computing and the competition of ideas it intrinsically engenders (mostly "because of" Godel's past findings, btw).

I don't think tolerant

I don't think tolerant programming should be considered a part of the biology-style programming that you describe. The example provided by Derek would not be any less ridiculous if it were written in Lua rather than C#.

As Derek pointed out, this may just be a case of confusing tolerant programming for robust programming. Robustness isn't just a desirable property for ultra high level scripts, though; it's desirable for software written at all abstraction levels. Hence, neither tolerance nor robustness should be considered a defining charecteristic of biology-style programming.

Preconditions can be timing dependent

In concurrent systems, preconditions can be changing while they are being tested. So how it is the following possible:

In demanding software, each routine has a set of preconditions and the body of each routine is written as if the preconditions have been met.

By the time that a precondition has been tested to be true, it may already be false! So the body of a routine needs to be prepared for trouble.

If the software is

If the software is concurrent and nothing is done to protect the relevant shared state from concurrent modification during the testing/body of the method, then a concurrent modification during the testing/body hardly falls under "conditions that should never occur."

Timing dependent preconditions are the norm

Timing dependent preconditions are the norm because shared resources are the norm. For example independent computations need to be able to concurrently deposit and withdraw funds from a shared account in a way that produces indeterminacy.

A procedure might test ahead of time that there is plenty of storage available but find that part way through that there is not enough because some other computation has already used it.

What do you mean when you

What do you mean when you say that shared resources are the norm? Most programs acces shared memory somewhere? Okay. This does not refute Derek's point, though. There will be a few shared values, and it would be nonsensical to set any preconditions on those. The vast majority of the fields are going to be exclusive, though, so you will be able to have preconditions for those.

meta object protocol

In the general case, every piece of code can't have preconditions, and so Carl is right.

Derek is arguing for a methodology that avoids this problem. But he hasn't explained why his methodology actually works, and so we can't predict if we can build systems that actually follows Derek's methodology. Since Hewitt's is straight logic, we know the consequences, and Hewitt is arguing that we build systems robust enough to deal with those consequences.

There.

Both sides have merit. Hewitt's frightens most people, and Derek's side is often used by people who don't understand its wisdom and why it appears to work. But Derek's wisdom is used by people who seem to be getting along just fine using it.

I wrote about this

I wrote about this subject in Debug It!. The text isn't online, so I've reproduced the relevant section here:

Defensive programming is one of the many terms in software development that means different things to different people. What we’re talking about here is the common practice of achieving small-scale fault tolerance by writing code that operates correctly (for some definition of correctly) in the presence of bugs.

But defensive programming is a double-edged sword—from the point of view of debugging, it just makes our lives harder. It transforms what would otherwise be simple and obvious bugs into bugs that are obscure, difficult to detect, and difficult to diagnose. We may want our software to be as robust as possible in production, but it’s much easier to debug fragile software that falls over immediately when a bug manifests.

A common example is the almost universal for loop idiom, in which, instead of writing this:

for (i = 0; i != iteration_count; ++i)
  «Body of loop»

we write the following defensive version:

for (i = 0; i < iteration_count; ++i)
  «Body of loop»

In almost all cases, both loops behave identically, iterating from zero to iteration_count - 1. So, why do so many of us automatically write the second, not the first?

The reason is because if the body of the loop happens to assign to i so that it becomes larger than iteration_count, the first version of our loop won’t terminate. By using < in our test instead of !=, we can guarantee that the loop will terminate if this happens.

The problem with this is that if the loop index does become larger than iteration_count, it almost certainly means that the code contains a bug. And whereas with the first version of the code we would immediately notice that it did (because the software hung inside an infinite loop), now it may not be at all obvious. It will probably bite us at some point in the future and be very difficult to diagnose.

As another example, imagine that we’re writing a function that takes a string and returns true if it’s all uppercase and false otherwise. Here’s one possible implementation in Java:

public static boolean allUpper(String s) {
  CharacterIterator i = new StringCharacterIterator(s);
  for (char c = i.first(); c != CharacterIterator.DONE; c = i.next())
    if (Character.isLowerCase(c))
      return false;
  return true;
}

That’s a perfectly reasonable function—but if for some reason we pass null to it, our software will crash. With this in mind, some developers would add something along these lines to the beginning:

if (s == null)
  return false;

So, now the code won’t crash—but what does it mean to call this function with null? There’s an excellent chance that any code that does so contains a bug, which we’ve now masked.

Assertions provide us with a very simple solution to this problem. Wherever you find yourself writing defensive code, make sure that you protect that code with assertions.

So, now our protective code at the start of allUpper() becomes the following:

assert s != null : "Null string passed to allUpper";

if (s == null)
  return false;

And our earlier for loop becomes the following:

for (i = 0; i < iteration_count; ++i)
  «Body of loop»
assert i == iteration_count;

We now have the best of both worlds—robust production software and fragile development/debugging software.

Doubts about the robustness of the both worlds approach

Thanks. I've never thought of using the less-than in a loop guard as tolerant before. Maybe it is slightly, but I don't see it as a huge deal.

I'm not sure if your best of both worlds approach is a good idea. As I mentioned, it is the approach taken by the Microsoft Word team, but it has the unwanted effect of unnecessarily bloating your code by handling conditions that never should have happened in the first place.

A common pattern I've seen with tolerant code is that functions which should return void are given boolean return values. You end up with code that looks like this:

bool C::DoSomething()
{
  bool success = true;
  
  success = Foo();
  assert(success);
  if (!success)
    return false;

  success = Bar();
  assert(success);
  if (!success)
    return false;

  success = Baz();
  assert(success);
  if (!success)    
    return false;

  //success!
  return true;
}

Yes, I have actually seen this.

You end up with a function which theoretically should do one thing and do it well, but has four different return points spread evenly throughout the function body. Glancing at this function might lead one to believe that it could do one of eight possible things.

There are two questions

There are two questions here, I think.

Firstly, is defensive programming a good idea?

Secondly, if you write defensive code, how should you do so.

The section that I posted above primarily addresses the latter. Given that many people will write defensive code (the for-loop idiom is so ingrained that I doubt many programmers even think of it as defensive code), I'd far rather see it protected with assertions.

As for the first question, I suspect that the "right" answer depends very much on the details of the particular problem you're working on, the tool-chain that you're working with and the robustness (or otherwise) of any third-party code you have to interact with. The tradeoffs will be very different for C versus Erlang, for example, as will those for embedded software versus server-side or safety-critical versus "throw-away" utility code.

Undertested production code

We now have the best of both worlds—robust production software and fragile development/debugging software.

You also have a situation where the "defensive" parts of the production code haven't been tested, or haven't been tested as much as the rest of the code, because during development the failed assertions would prevent that code from being reached. How confident can you be that the production code will properly deal with the s == null case, or that returning false will actually generate the right behavior? Surely it's better to "test what you fly, fly what you test", isn't it?

Another way to look at it is that assert s != null says "s will never, ever be null, and if it ever is then I have no idea how to respond anyway", while if (s == null) ... says "calling this function with s set to null is in some way meaningful, and I know how to respond to it." Combining the two is like saying "I don't know how to meaningfully respond... wait, just kidding, actually I do."

See my earlier response to

See my earlier response to kclancy. Given that developers will write defensive code, whether we like it or not (and whether it's justified or not), I would far rather see any such code protected with assertions.

There's then the thorny question of whether it's ever justified, and I don't think that this has a simple "yes or no" answer. My general inclination is definitely towards the "demanding" end of the spectrum, but I've certainly run into instances where that hasn't seemed to be the best compromise.

A couple of examples:

When writing long-lived server code in C, in an environment where we had no good way to detect or handle memory errors (dereferencing a null pointer, for example), we chose to aggressively check pointers, even when we "knew" they had to be valid.

At the moment, I'm working on SwiftKey, an Android input method. The Android framework is dreadfully badly documented, exists in many different (and often mutually incompatible) versions and new versions (with their own new incompatibilities) are released frequently. In this instance, we tend to assert things like "onDestroy should never be called without a preceding onCreate", because they're probably true (and match our understanding of how things should work). But we also code defensively because experience has demonstrated that this kind of thing sometimes isn't true. And, in our judgement, it's better for things to "kinda work" on a new device than to be completely broken. Consumers have very little patience if the text-entry method they've grown to rely on breaks when they update their OS, and Google and Android licensees' update policies mean that very often consumers see new releases before we do.

I did

I had already read your earlier post when I wrote my comment. My point remains that if you "protect" your "defensive" code with assertions you end up either (a) leaving assertions enabled in the production code, in which case the "defensive" code is useless because it never gets invoked, or (b) disable assertions in the production code, in which case your "defensive" code is likely to trigger relatively untested code paths (assuming assertions were enabled through development).

Out of curiosity, in your long-running server code, if you had "no good way to handle memory errors" what did you do (aside from crashing the program) when your aggressive pointer check found a null pointer?

In most instances, we failed the individual request

Out of curiosity, in your long-running server code, if you had "no good way to handle memory errors" what did you do (aside from crashing the program) when your aggressive pointer check found a null pointer?

In most instances, we failed the individual request that invoked the error and (did our best to) tidy up everything associated with that request, but allowed other ongoing requests (and future requests) to continue. In a few instances, we decided that things were so broken that restarting the whole server was the right thing to do.

Indeed the inequality test used to be commonplace

... and was stamped out by people who believed in defensive programming. Strictly, though, documented behavior is never defensive programming. If I have a function to return a substring of a string, and it returns the empty string when the arguments are out of bounds, and I don't say so, that's defensive programming. But if I do say so, then it's part of the API, behavior that people may safely rely on, and no indication of a bug.

In one of Mitch McConnell's books on C he complains about realloc() doing too much: if passed NULL, it will allocate the specified amount of memory de novo, and if passed a zero size, it will deallocate the old memory block (if there is one) and return NULL. He therefore wants us to have four functions: AllocateMemory, DeallocateMemory, GrowMemory, and ShrinkMemory. That's fair for some uses. But if you have a dynamic array and you want to change its size without regard to what size it had previously, then realloc() does exactly the right thing (modulo perhaps that if the size is 0 it should return an area of size 0 rather than returning NULL).

Sorry for being dumb

but who is Mitch McConnell? Searching amazon for books by him turns up squat. A general google search for "mitch mcconnell" realloc yields some results, but none that point to a book I can read for this discussion.

I suspect that johnwcowan

I suspect that johnwcowan meant Steve McConnell.

Indeed.

Brain fart.

practice at work

kclancy: papers and articles making arguments for and against the demanding style of coding.

I don't write papers -- just internal-use informal memos by email and wiki. Here's a summary of what folks consider obvious where I work. (We accept the following as obvious subtext without discussion.)

We like software as demanding as possible that doesn't bring down too large a scope on failure. We're most intolerant when debugging, so an assert killing a process under debug is usually a good idea ... unless that stops another developer from debugging too.

Intolerance requires a specific scope when failure occurs. Do you fail the operation? Kill the process? Reboot the machine? Restart a testbed? The whole local network? Reboot every machine connected to the internet? (The last is added only to demonstrate you do indeed have some maximum scope affected by invariant failure.)

For "impossible" conditions, I start with asserts that aim to kill the process. If they ever occur in a live system, it's sometimes necessary to soften the result. (In fact, many asserts are over-eager and insist normally occuring conditions are impossible. You find out by stopping when they happen.)

Every underdesirable but possible condition is counted in statistics, and these statistics are made visible somewhere. You never want bad news to disappear from view entirely.

When software has layers, api contracts require embedded software not kill host software. If you have a module inside a larger host, that module typically is not allowed to kill the host process, unless debugging. In such cases, we typically use debug asserts that do nothing in release builds, followed by code that handles an error, so the host layer gets told "failure" with respect to a request. All failures must be counted where those counts can be reviewed.

In short, when something goes wrong, there's some local or global scope you abort. But there's always a limit to that scope, outside of which things go on as normal. When debugging, you abort a larger scope so failure has bigger impact, so you fix it sooner. (And so the problem is obvious to as many folks as possible, to avoid head-in-the-sand thinking.)

There's always some layer above you that can't be aborted. Maybe it's outside your machine. When you tolerate failure in some larger scope, that tolerance should never hide the problem happened. At minimum, statistics about failure rate must be maintained.

I suspect you posted because you see naively optimistic folks who hope tolerance can replace handling errors, and you need help dissuading them from malfeanse in professional coding practice. Okay, tell them it's dereliction of duty to hide evidence of poor code or bad data, especially when this promotes an overly rosy view of schedule progress.

Absolutely spot on

This is a remarkably clear explanation which summarises exactly how I feel about this subject, but haven't had the eloquence to put into words.

Thank you!

off topic: wondering whether junior devs still learn C

Glad you liked it. For me clarity is just a game, and a source of cheap virtue thrills. Normally I would not reply, but I have a question which may be inappropriate for this venue. Do young folks just out of college know how to program in C anymore? (As in: having actually written thousands of lines of C code to do something non-trivial.)

I would like my company to receive resumes from folks who actually know how to code in C. (If you can code in C++, this is the same as being able to code in C as far as I'm concerned, unless you can't cope with pointers and memory management because you only use standard libraries.) My manager wishes to receive resumes from recent graduates who are clever and can code in C; note "clever" probably means I would think so.

But I don't want to say where to send resumes. :-) I don't want to associate my name with that of my company, since I don't represent them. Also, I no longer read email sent to any publically visible address. (I only read work email.) So I'm almost unreachable unless you guess my phone number. Maybe I should start a new email account just to process pings from this post.

I'm very certain folks who run this site don't want job listings and discussion along those lines. So I'm abusing social etiquette here, unless any reply is mainly about C as a programming language, or something along those lines. (If anyone says something cleverly non-trivial about C, and also inserts an unobtrusive link to more personal info, that might be forgivable.) It wouldn't take much to get this thread killed for administrative reasons.

C

For one thing you might want to mention where you are located :-)

Junior developers don't learn low level languages nearly as much as they used to. Modern GUIs fit event driven programming models and OO is really good for event driven programming. And quite simply efficiency isn't as driving a factor as it was when C was hot.

Even here how often do you see papers implementing simple algorithms on fixed hardware in Forth, Assembly, and C and doing comparisons? How often do you see shootouts of simply algorithms? When is the last time you saw someone count clock cycles in the inner loop? 5 years, 10 years, 20 years?

Kids graduating know surprisingly little given that most of them grew up around computers their whole lives.

As an aside I also thought your "practice at work" post was a very good summary of that POV.

san jose networking

We're in San Jose, California. A mid-sized networking company specializing in load balancing, but I work in a WAN optimization unit. Maybe bigger than mid-sized since the stock just changed indexes, but I have trouble thinking of it as a big company.

I have a guy who counts clock cycles for me. :-) He does profiling at a granularity letting him identify where cache line stalls occur. And he writes inline assembler for me when I request special treatment on specific operations -- typically ones I want to occur atomically in lockless effects. Sometimes I write C sample code and ask whether he can write it faster in hand assembler; typically he says you can't do better than what the C compiler emitted. (I have good intuition about what's actually going to be the bottleneck.) My code tends to look at all traffic passing through, so minimizing cost matters. Another thing I do is clever lockless shared hashmaps tuned to minimize cache line misses.

They don't like it when I write in C++ because some folks can't read it. (But amusingly, when I use C++ with carefully chosen inlines, the result often runs faster than my C version. I think this is because scope is identified better than the C compiler can figure out.)

It's backend networking server work aiming to eke out max performance from hardware given. Several of my peers have 20 to 30 years of coding experience, like me. It's lopsided without more junior folks. My work often drops into pure computer science jargon; for example, some things are equivlant to condition variables without actually being official condition variables. There was a long multi-threaded phase, but now it tends to be more single-threaded with hand scheduling and event processing. I do exotic stuff that makes people's heads hurt in middleware. I don't know what new junior folks might do.

I plan to start a programming language to make some of this easier, with an early focus on testing in simulation. Then I actually want to emit what I write by hand now by compiling from a higher level language. Some of my peers with long experience say, "Every time someone did that before, something unfortunate happened; are you sure that's a good idea?" That's the only part of this on topic here.

cycle counting

Depending on what I'm doing, I count cycles. And bytes, come to that.

Whilst in many cases it's not easy to do better than a C/C++ compiler, there are cases where it's necessary to go down to the metal. For example, I'm doing an embedded lispish language at the moment, symbol hashing and table lookups need to be fast and, above all, small (target hardware has 128K of flash for language and runtime, 8K of RAM for dynamic programs); my hand coded assembler versions are 1/5 the size and at least 3 times as fast as the C versions compiled with maximal optimisation. On the other hand, I am only comparing myself to gcc, so that might well not be much of an achievement.

Cycle counting's not dead, it's just drowned out by those who consider resources to be essentially infinite.

wow, what for

Wow that's rather cool. You are doing hardware however, and on hardware I don't think its dead. Still that's an interesting use case.

I just looked this up and 128k flash 8k ram ran $7.32, while 256k flash 64k ram ran $8.61. The bigger chip is 2x as fast and uses 1/3rd as much power (not a typo). I don't know what you are paying, this was at quantity 1000 so my guess is that the $1.29 spread is a worst case. I can see in sufficient quantity beating this spread is worth the trouble.

Certainly with your hardware platform and trying to get a LISP to work you need to cycle count. 1970s hardware (effectively) demands 1970s methods. The problem I'm having is trying to see why you would want to get a LISP to work in this environment. Are you OEMing this to people running all sorts of other programs in low quantity? Why do they want a dynamic language? Very interesting use case here.

What for?

As you say, my use case is "special", and it probably deserves a little more explanation, so here goes.

TL;DR version - it's just me scratching an itch.

I'm working on a concept I've sorta christened "ubiquitous computing". It's based on the idea that most people are carrying around more computing power than appeared in an early 90s developer workstation, but at best it's being used for playing Angry Birds or emulating flatulence at the push of a button. I figured it might be nice to use that processing power for something else. A loosely coupled, heterogenous computing platform able to pick up and throw away nodes as and when they "appear". It's actually a dream I've had for some time, but the ubiquity of smartphones, PDAs and other mobile computing platforms, coupled with the ease of proximity networking using bluetooth and other wireless tech means that it's more or less in reach. Modulo the need to change everyone's operating system and allow anyone to run untrusted code on your device, of course.

The concept is based around the idea of farming out bits of a given computing process as continuations, and then harvesting the results. The continuations themselves might migrate from node to node as required to get hold of resources, but the results should end up back with the user.

Continuations screamed Lisp (or something lispish) to me, and I have a protolanguage running over CCL that passes continuations from one instance to another and back fairly happily.

Platform wise, I decided to go with ARM, if only because I have this funky little ARM-based tablet that's not doing much, and a bunch of little ARM-based µC boards, also not doing much. If I decide to really hurt myself, I've got a few 8-bit µCs as well :). So what I needed was something that could act as a Lispish runtime on something as stripped down as an STM32F100 (the aforementioned 128K/8K platform, well under a dollar a piece in quantity); this is, in itself, an "interesting" exercise.

Where I'm at with the µC side of things is that I have an almost-lisp in under 1K (cons, car, cdr, cond, lambda, lookup, various predicates, all written as tightly as I can in assembler, although I'm certain there's a good few bytes to shave off here and there as I'm a relative beginner with ARM and Thumb2 in particular), a very memory-miserly type system, and a bunch of support code. It's almost at the stage where the runtime can bootstrap itself (and eat all available RAM, as it's without GC at the moment).

If nothing else, I'll end up with a Lispish system for programming ARM µCs; given the state of C compilers for ARM, that might be a decent goal taken in isolation. When I said 1/5 the size before, I meant it - my hashing routine compiled with -Os from C was ~1400 bytes, pure asm comes in at 206 bytes without any particularly "clever" coding.

And it's fun to play with "stuff".

Ubicomp

Note that the term ubiquitous computing was coined in the late 1980s, and refers to a slightly different (although related) concept to what you've described. To avoid confusion, especially since you're operating in a similar space, you may want to find a different name for your concept.

Ah, knew it felt familiar...

...should have guessed that a term as generic as that would already have been used. Indeed, I should have remembered it being used.

In my defence, that terminology only existed in my head until the point where I wrote my previous post. Still, you're right, best I come up with something different.

Examples

Such things exist today - for example, Folding@HOME, and SETI (Search for Extraterrestrial Intelligence).

Usually it doesn't make much sense for these nodes to have constant communication with the home base server. The computations take several days.

Also, in scientific computing, there is Condor and the Condor-G job scheduler. If you've got computers at work, you can set it up such that whenever a computer is idle for more than X minutes, fire up a Condor job. You've basically got a cloud at night in your office. Not necessarily the most energy efficient strategy for large-scale computing, of course!

Not really the same thing

SETI and friends aren't the same thing at all - they are a massively (and flexibly) parallel execution of a single task, with each node working on different (but potentially redundant) data. Interesting, but not what I'm looking at.

Condor schedules tasks to vampirise spare cycles and run generic applications, but all it does is give you a big compute farm. A cloud in your office, as you say.

What I want is perhaps closer to SETI, although SETI treats nodes as equivalent; I don't want to. One piece of processing might require a node that has access to the global internet, another might require access to a particular database server, another might require a large amount of free scratch memory, a vector processor, etc etc. A particular program might require any number of these, and can "throw" all or part(s) of itself between nodes, on the fly, in order to carry itself out. It might farm parts of itself out, SETI-like, to multiple nodes in parallel, whilst sending other parts of itself off to a database server to extract information to feed to those nodes. A visualisation part might be fixed to the user's node, and collect data fed from the others, and so on.

I'm a long way from being there at the moment.

counting bytes

simon stapleton: Cycle counting's not dead, it's just drowned out by those who consider resources to be essentially infinite.

Glad to hear you count cycles and bytes when appropriate. Also, your personal project can help you get jobs, so you should note it on your resume. I often ask candidates: do you do any coding at all for fun? The usual answer is no, coupled with a look saying 'Why would anyone ever do unpaid work?' While this is a bad sign, it doesn't disqualify candidates when they never code for fun.

I'm one of those folks who count bytes, since even given many gigabytes of RAM I must turn out the largest numbers I can, so marketing can quote them. (Forward looking designs all need projections of cache size per resources given, for example. I warn folks they can't attach arbitrarily large amounts of secondary storage via disk or SSD, without a RAM budget making those resources work. Any resource too much larger than a target ratio can cause other resources to go to waste when they don't pay off.)

Actually, I resort to counting bits in bitfields, bitmaps, and checksums. Otherwise probabilities won't come out exactly right. Probability of failure must be estimated everywhere, since handling this must be covered by budgets for cycles and bytes. (Cost to handle failure must come out negligible by design.) Bits per indexed entity is sometimes an important number.

Since no resource lies fallow, there's no way to treat them as infinite. For example, even if a system is i/o bound in terms of traffic, you can still spend more than all free cycles doing something clever. So to avoid becoming cpu bound, there's a cycle budget, too. I often catch folks trying to spend the same resource twice; I play bad cop there since my system is the one that underperforms if shorted. It's hard to keep folks honest within a factor of two when it comes to resources.

Static allocation is an interesting discipline. An async product can process a request later if resources are exhausted now, which gets you into blocking, scheduling, condition variables, and latency management. (Deadlock can't be permitted, but a tiny bit of latency is okay.)

Anyway, there's no resource so plentiful you can't spend it all, and more then once by unintended consequence, so keep counting. Incidentally, I think I read a page you wrote about a new hash function a while back. I recommend you use either crc32 or crc64, depending on how many bits you need, provided no attack scenario is present that requires cryptographic strength. You can measure collision rates empirically by writing a unit test to measure standard deviation between holes in closed (probing) hashmaps; a smaller standard deviation, derived from sum of squares, defines a better hash function (and any non empirical argument is just hot air).

build not buy

Well the good thing is that people are plentiful. But yeah you are definitely well outside the mainstream. Hmmm... I think I'd look for computer engineering majors or go to colleges and search for interns at the low level programming classes. I haven't met a kid that knew assembler. The other possibility would be to look at computer engineering majors, who I think do have that sort of training.

Sounds like an interesting place to work where code quality really matters.

Off-T for the site, but On-T for this thread

Like McCusker, I know this isn't the purpose of LtU, but like McCusker, I hope I'll be given some leeway, just once. I'm also looking for the same kind of programmers (basically, people like McCusker, or perhaps some junior people who hope to be like him eventually). They're hard to find, and sadly, they do seem to be getting scarcer. I'm in Chicago, and unlike McCusker, I'm pretty easy to find. I promise not to advertise here again.

To add some LtU content, one occasionally hears laments that in the move to higher-level languages, "the kids" just aren't learning how to program like they used to. It's interesting to me that often the people who are interested in the very "highest-level" languages (Haskell, ATS, ...) are also very cognizant of the metal and in fact are often quite competent with C/C++. It seems like it's mostly the languages somewhere in the middle that indicate the opposite (Java, PHP, Ruby...). I suppose I have theories about why that is.

From what I gather

Some form of assembly is usually taught in most CS curriculae. But you might want to look at electronic engineers. Here in the Netherlands, they are most likely to have the most exposure to assembly since they invariably do some course with small ARM processors. However, most of them are usually not as good at reasoning over the algorithms (everything is a device driver and every complex data structure boils down to something on arrays).

But it is a diminishing art form. I grew up with a C64 which was shipped at that time with basic, a block diagram of all components and a list of all machine instructions; my brother grew up with a pentium, javascript, and flash. Life progresses.

NULLs in relational databases

I think a good example to work through is the issue of handling of NULLs in relational databases and the arguments pro and con.

Basically if k is a key and item X has a 1:1 relationship with k obviously it should be in the same table.
If X has a 1:n relationship with n being greater than 1 it will be in a separate table.
But what if n is only 0 or 1?

You end up having trinary logic (True, False, NULL) and the complexities vs. having to introduce 5x as many tables and all the complexity being intolerant.

6NF enough

The highest normal form I know of is 6NF. 6NF would reject your claim that k should be in the same table simply because each key has one instance. Why? Effectively, because each independent group of attributes associated with the key needs its own meta-data: temporal or bi-temporal information, possibly update source and an authenticity signature or certificate, potentially some secrecy information, and so on.

As an added advantage, you should never need nulls in the schema, though you might need to inject nulls for open joins.

A relvar R [table] is in sixth normal form (abbreviated 6NF) if and only if it satisfies no nontrivial join dependencies at all — where, as before, a join dependency is trivial if and only if at least one of the projections (possibly U_projections) involved is taken over the set of all attributes of the relvar [table] concerned.[Date et al.]

I'm rather fond of 6NF in theory, especially from security and extensibility points of view (i.e. you can always extend a 6NF database with new information without introducing compatibility issues - just add a new table).

I would really appreciate if our databases were better at 'denormalizing under the hood' such that we could get decent 'join' performance from 6NF in practice. But the practice today seems to be 3NF (and, except in rare pathological cases, 3NF is 5NF), and that is what RDBMS's are optimized for.

6nf and de-normalization in practice.

6NF would reject your claim that k should be in the same table simply because each key has one instance.

I'd agree I was being strictly 3NF. My context was an example of Tolerant vs. Demanding not a more general point about database design. Sorry if I was being too flippant.

Not being flippant, I'm not sure that 4NF, 5NF or 6NF really changes things. I'm assuming there is a relationship between the key and the item 4NT and 5NF are really about finding better keys when the key by itself is not the determiner.

I would really appreciate if our databases were better at 'denormalizing under the hood' such that we could get decent 'join' performance from 6NF in practice.

They do, its called clustered tables. You can de-normalize under the hood:
http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm

There are actually two types hash and index .... this has been around for years. Oracle invented it but now Postgres, MySQL, Ingres... have it.... You want your physical and logical design to be totally different almost every RDBMS will let you.

I would really appreciate if

I would really appreciate if our databases were better at 'denormalizing under the hood' such that we could get decent 'join' performance from 6NF in practice. But the practice today seems to be 3NF (and, except in rare pathological cases, 3NF is 5NF), and that is what RDBMS's are optimized for.

No, no, no. For very large information data banks, enterprise-grade databases today, like SQL Server Enterprise Edition and Oracle 11g, handle quite well normalized schemas and it almost never makes sense to "denormalize" (which has no mathematically sensible meaning, by the way). The one "gotcha" for modern relational databases is systems like Reddit that require navigational access requirements, and this has to do (mostly) with how the heap files are structured in relational engines today. This leads to poor locality of reference for sites that consistently aggregate data in only a few ways, but have to deal with massive transaction processing requirements and data mirroring requirements simultaneously.

The reason you don't need to denormalize is quantum tunneling. Flash memory and multi-level memory hierarchies have removed the rusty moving parts from the equation. Even RAID 10 configurations that are used on traditional 15,000 RPM RAID arrays are obsolete. The only thing standing in the way between this succeeding in a wide market is the cost. SQL Server Enterprise Edition is $25,000 per core and a 1TB database will cost you over $1,000,000 in storage. The upshot of this expensive storage is that you arguably need fewer DBAs and consultants to maintain the solution, because you have much fewer disks in your RAID configurations to maintain, and your strategies for partitioning database files across these disks is much simpler to reason about and also analyze the performance thereof.

But if your goal is to do everything in an idealistic manner, the technology is there today to do it. The question is simply a business-level decision of whether you are willing to pay for the high-end of the performance curve.

My last experience working

My last experience working with bi-temporal data in 6NF were not impressive, performance-wise. I was paying O(M*N*log(N)*log(M)) per join, when an index that combines multiple tables for common join patterns could feasibly achieve O(M+N) per runtime join (same as a projection of a 3NF table). This was a serious problem because, in 6NF, one very often joins 3 to 5 tables on a shared primary key.

OTOH, I last tried it 7 years ago, on whatever version of Oracle my university had. I plan to look into those 'clustered' tables Jeff mentioned above.

Typo?

> Basically if k is a key and item X has a 1:1 relationship with k obviously it should be in a separate table.

Uh? I'm not very good in relational database, but I think that you made a typo here, it should be in the same table.

thx

thx you were right I flipped flopped that which was bad for the analogy. Corrected the original.

related reading

why do computers stop?

http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294K.S05/tandem_ft1.pdf