language handling of memory and other resource failures

My idea here is to introduce a place to discuss ideas about handling memory exhaustion, or related resource limit management. The goal is to have something interesting and useful to talk about, informing future designs of programming language semantics or implementation. Thoughtful new solutions are more on topic than anecdotes about old problems. (Feel free to tell an anecdote if followed by analysis of a workable nice solution.) Funny failure stories are not very useful.

Worst case scenarios are also of interest: situations that would very hard to handle nicely, as test cases for evaluating planned solutions. For example, you might be able to think of an app that would behave badly under a given resource management regime. This resembles thinking of counter examples, but with emphasis on pathology instead of contradiction.

In another discussion topic, Keean Schupke argued the idea failure due to out-of-memory is an effect, while others suggested it was semantically more out-of-scope than an effect in some languages. I am less interested in what is an effect, and more interested in how to handle problems. (The concept is on topic, when focus is what to do about it. Definitions without use cases seem adrift of practical concerns.)

Relevant questions include: After partial failure, what does code still running do about it? How is it presented semantically? Can failed things be cleaned up without poisoning the survivors afterward? How are monitoring, cleanup, and recovery done efficiently with predictable quality? How do PL semantics help a dev plan and organize system behavior after resource failures, especially memory?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

resilient systems

If you want to prevent resource exhaustion, it is sufficient to determine a bound for the use of that resource and dedicate the needed resources. Conveniently, this holds even when our computation uses dynamic allocation of resources internally, because we already know the maximum dynamic allocation will not surpass the bound. However, many computations don't have a bound on space or time usage. Further, even if we do have a bound, we might not be willing to dedicate sufficient resources for it; for example, when the worst case bound is much larger than the average case, dedicating resources for the worst case would cause utilization to plummet. In such cases, we must take our risks with the quota-hammer and accept possible failure.

We might consider a few cases upon failure.

The most optimal is that we simply stall. The computation is left in an incomplete state, but we haven't lost ground either. If conditions change, we may continue computation. Slightly worse is full failure, we lose ALL our work and need to recompute from the start. And much worse is the 'partial failure' case, where we've lost some work, but other work has been performed wholly or partially, and now we need to recover from this rather ad-hoc state.

It's nice to simplify partial failure by turning it into full failure (e.g. with transactions), or at least into a more predictable failure state (e.g. with cascading failures). But it is feasible to push the other direction, towards stalling without losing work when we hit a quota limit.

One option is to push side-effects away from the center of our application model, and to the edges, such that our 'application' is really a pure data object in a multi-agent system. This isn't a new idea, cf. tuple spaces, functional-relational programming, REST architectural style. I've recently been pursuing a variant where applications (and their databases) are purely functional objects represented in the codebase.

By shifting effects to the edge, we separate the problems of computation failure due to resource limits (which may cause observers of application state to stall) from the problems of performing the effects and injecting results. In general, this allows us to continue computation if we ever acquire sufficient resources for the computation to proceed.

Pre-allocation and checkpoints

I think our ideas of solutions are similar. In general for a bounded computation the best strategy is to request all the required resources at the beginning, and fail early without doing any work if there is insufficient. I would want to do this by inferring effects in the type system to track resource requirements.

For unbounded computation, I would seek to factorise into bounded chunks. Each bounded chunks could pre-request the required resources and then if then if there is insufficient for the next chuck, save a checkpoint to allow the computation to be resumed later. Reserving the resources for the checkpoint would be part of the initial setup for that operation.

In this way interactive programs would ensure the main loop cannot fail due to lack of resources, but individual actions selected by the user might.

Edit: Interestingly checkpointing an unbounded computation strongly suggests a stateful approach, as you need to store the current state in the checkpoint. Does this imply resilience requires an imperative language? Certainly with state being so important to resilience you want it to be a first class citizen of such a language. My preferred approach would be something like a finite state machine, so that the state is clearly partitioned from pure functions that do the transition. Each transition would be one 'chunk' of work, with bounded resource requirements. If you consider that you might have several resiliant nodes arranged in a network to carry out parallel tasks (like a grid computer where nodes are connected by Ethernet) it starts to sound a lot like the actor model. Perhaps the actor model is nothing more than FSMs communicating on a network, something electronics engineers have been doing for a long time.

re: resilient systems

I disagree with none of your remarks, but don't follow the point of your last two paragraphs. It sounds like you are saying purely functional results are simply not present, or stalled, when memory is exhausted.

it is sufficient to determine a bound for the use of that resource and dedicate the needed resources.

At the end of the 80's, I had this conversation with one of my professors: a fellow who informally preferred to be addressed as Stan, who knew what he was doing. (Reference to Steve Carrell's "NO GOD NO" reaction in The Office, now a meme.)

Me: How do you properly deal with running out of dynamic heap memory?
Stan: Don't do that. Statically allocate the most you need up front.
Me: NO GOD PLEASE NOOOO!

But as far as I know, he was correct, and it figures into making systems behave under stress, by refusing work when handling it would surpass available resources. There's even a discipline of credit-based flow control, which amounts to accepting data inputs only after telling producers you have room for it, so back pressure among connected components helps avoid overstepping resource bounds.

The general idea is to impose quotas, enforce them with accounting, and punish infractions before running out of resources. And if you do run out, whack somebody according to current policy, to get resources back. But whacking something implies an architecture where you can cleanly kill a component, because it is sufficiently independent. (As far as I know, if you kill the holder of a mutex on a critical section, none of the waiters on that mutex can be assumed valid afterward, because an invariant involved may no longer hold. It's a mutable state problem.)

If resources can show up later, stalling is a good idea, but you can live lock pretty easily. Freeing a resource might require a reply to a message you cannot send because you already hit your quota. This is not easy to type for static analysis. Other options include killing the offender, or notifying a watcher to do something (like attach a debugger to the stalled code when a dev is waiting for this to happen).

Limits can be expressed as a smaller soft limit and a bigger hard limit. When a soft limit is hit, you send a friendly notification to an offender saying, "You have a problem: you are at your soft limit." Maybe they flush an optional cache when this happens. At the hard limit, you cannot let them run without permitting them to go over. What happens can be policy based: stall for someone to increase the hard limit, kill for fast fail, or attach an agent to evaluate what's up.

Keean's comment about checkpoints is a good one. In the early to mid 90's, the OpenDoc team had to advise how third party developers could cope with running about of memory, which was almost a certainty in some situations where at most a couple megs of RAM were present for application use. When you hit a hard limit, it was too late to recover. So we said: take checkpoints. Instead of wasting compute resources spent before a failure, particularly for long-running code, you can take snapshots at clear points of partial progress where you can resume if needed. It's not clear how that discipline might be given an affordance in programming language though.

re: don't follow

but don't follow the point of your last two paragraphs. It sounds like you are saying purely functional results are simply not present, or stalled, when memory is exhausted

I could have organized it better. But my point regarded "it is feasible to push the other direction, towards stalling without losing work when we hit a quota limit". Nothing in those two paragraphs was specific to 'purely functional' programming or results. (None of the listed examples was purely functional.) Though, I wouldn't exclude purely functional computations.

If resources can show up later, stalling is a good idea, but you can live lock pretty easily.

Livelock? You seem to be making assumptions about implementation or architecture. Consider: Stalling can be implemented by shifting resource consumption, e.g. by backing the computation to file. Providing more resources is a rather flexible idea in a world with cloud computing, virtual machines, and hot swappable memory. Providing input that leads to state updates can be lockless and even independent of the application that processes the inputs into a state, e.g. via append-only logs.

Operating beyond our resource limits is, of course, problematic. A stalled computation is still a failure state. It's just a relatively nice failure state.

Checkpoints

I am not so sure a stalled computation is a failure state. If a computation is designed to checkpoint, then it is actually in a valid non-failure state, where it can be resumed later.

It is surely just a suspended, or paused computation, which could be continued when sufficient resources are available.

To me something is only a failure if it cannot be resumed later. Maybe whether you view suspension as a failure depends on whether you are taking a stateful or pure functional view of the world?

re: checkpoints and failure

I consider it a failure even if my computation runs to completion despite VM thrashing, livelock, etc. if it does not return in whatever I consider an acceptable time frame. OTOH, if a computation checkpoints, fails, resumes, and completes all within my time frame (e.g. via dynamic cloud resources), I would still consider that a success.

I suppose I see it this way: ALL computations are either hard real time or soft real time. My lifespan certainly factors in the latter case... but so does my attention span. And not just time. If I'm paying cash for computation or space (like on a VM cloud) there will be other budgetary limits, too.

There are many computation styles that support more or less transparent checkpoints. Term or graph rewrite systems (and pure FP implemented that way). Blackboard metaphor systems like tuple spaces. OOP command pattern, or anything operating on an append only log. Etc.. Your preference for explicitly modeling checkpoints is certainly doable, but not something I consider remarkable.

However, handing me a checkpoint after stalling on quota and calling that success feels rather like a "Well, son, I didn't finish your house, ran out of lumber and budget you see, but I did throw a tarp over it so construction can be resumed later if you feel like paying for it. So I'm calling this a success."

Close down and continue

I think what is more likely to happen is you close down some other applications to resume the computation, or place an order (Amazon prime now?) for some more memory, then resume the computation. So I would not consider a computation that I cannot complete a failure, but a suspended computation is a bit like Schrödinger's cat, I can't know if it will eventually fail or succeed until I resume it (open the box). Your view that it is a failure seems like trying to call the result of a football match at half time.

Both tuple spaces (which seem a lot like FSM) and append only logs are stateful methods, and I am completely fine with these are ways of achieving this. I think having something built into the language to allow suspension and resumption of computation is the way to go. I don't really want to rely on programmers writing this for every algorithm, because I know they won't, so language support is necessary if I want to have my applications behave like this normally.

Of course if resources are near the limit you probably want to make sure the algorithm is making the most of what you have which may mean manually managing the resource.

Edit: it would be good if you could rewrite the algorithm to save memory and re-run from the checkpoint, as very long calculations might not be re-runnable. To me this suggests the checkpoint state should be explicit, otherwise there is no guarantee a different algorithm would result in the same checkpoint format.

Relevant: Memory Accounting

Time

So far no one mentioned time. So I will. Managing this is extremely hard, and stalling and checkpoints are often NOT an option. The usual method is to attempt to calculate maximum resource requirements (as dmbarbour says), perhaps experimentally.

A real case from my past: a lighting control system running a phase locked loop which is managed by measuring time intervals between certain interrupts. The timer reset interrupt is used to schedule a subsequent sequence of interrupts controlling triac devices that turn lights on for a few ms (the time determining the brightness). The system is hyper real time and uncontrolled failure can be disasterous. Note that 90% of the code runs in interrupts.

One problem is that when many lights are at the same illumination level, many interrupts occur at once. If the system is fading a light up and another down the fading must be smooth when the illumination levels cross which means the interrupts for each light must be issued very precisely. Crossing interrupts are merged into the same interrupt service routine to effect this.

A second problem is that the timing input (actually detecting AC voltage equal to zero) is unstable and noisy which is why a phase locked look is being run.

Some of these systems must have an emergency lighting mode.

The only possible response on loss of synchronisation is to reboot the processor, which means the lights have to be turned on by hardware to full (or some appropriate level). This also means there must be a way of detecting a failure. Loss of synchronisation can leave people in the dark if not detected (dangerous), or it can blow up spot lights (expensive!), and instabliity can lead to serious financial loss by disturbing patrons at a vendue (eg luxury hotel restaurant).

This kind of issue can arise in ALL hard real time systems and also soft ones. As an example of a total collapse, the recent Australian Census was conducted almost completely online. It crashed completely around 2000 when most people decided to log in to the website and fill out the forms. The Census in Australia is compulsory so basically in one short period the WHOLE COUNTRY had to access the website.

Provisioning networks is another time related aspect of resource management. This one didn't cause a nuclear meltdown but it did cause a bit of a political one. Someone didn't do their modelling right.

Time as a resource

This requires evaluating the worst case time that a function can take, and making sure it is less than a certain time interval. I remember doing this kind of thing in 68000 assembly to make sure a video interrupt could not take longer than the flyback blanking period, or that audio processing takes place before the next sample is needed by the DAC.

Again my solution to this is effects. The type system would infer timing effects of functions, and you can use the type system to constrain these effects. If you want smooth interactive graphics you might constrain the redraw handler to run in 300ms or less. The compiler would fail the typechecking statically if there exists an execution path that could take longer than that on the target hardware.

Modern processors are complicated

This has gotten much harder to do in recent years since worst case instruction times are often orders of magnitude greater than typical case times, due to caches, etc.

Concurrency = Unpredictability

I would hazard a guess that your 68000 experience was gained on an Amiga where the executive had been disabled and a known set of custom interrupts was being used - and custom hardware reduced the latency of the video interrupt. Back on a system where there was no memory hierarchy (apart from fast/slow memory), and the main target area for those techniques (games/demos) ruled out multi-tasking.

Without multi-tasking WCET is much more predictable, although as Matt observes above the hot-path with cache hits may be orders of magnitude faster than the cold-path waiting for main memory. If the cold-path causes a page-fault then it is even worse, from repopulating the TLB up to waiting for a disk to be serviced.

The source of a program is not sufficient to predict if the memory will remain hot or cold. After a context switch we don't know how much of the old state is still hot, and how much has been invalided by running unknown code. The problem with WCET is that the answer distribution is largely bi-modal: there is almost no resemblance between the modes in the distribution. The choice of mode is a property of the overall system, rather than the particular program being run.

Obviously a single program device is easier to analyse, but even then the presence of a cache can mean that WCET will give statistical answers with a large variance - not too useful for realtime systems.

Multiple processors

Actually the blitter, copper and CPU were all capable of parallel operation, and you could select whether the blitter triggered a CPU interrupt or resumed the copper.

I also have written a complete operating system in x86 assembly language complete with segment based message passing, a display, disk and network driver, and capability based permissions.

WCET is what it is, so you have to assume nothing is in the cache when the program starts. You can only realistically bound CPU time, as other programs may use the CPU for indeterminate amounts of time. In my experience time critical code normally runs at a higher privilege level, and most CPUs do not allow interrupts to be interrupted, instead you normally have to poll the interrupt flags before you exit to see it any interrupts have happened whilst you were processing the current interrupt.

when vs how much

Time, as a resource, has a couple different aspects for accounting purposes: duration sum (how much) and span interval bandwidth (when). Your examples seem the latter, when you want to schedule execution during target spans of time interval of real time clock, to suit hardware active at that time. But you can only squeeze so many things into hardware bandwidth around a target clock time. Usually I think about the former, in terms of total consumed cycles (possibly synthetic) toward a short or long term budget, such as timeslice remaining before yield or upper bound on cycles permitted an untrusted component doing a job.

In other words, your use cases seem real-time with sync issues, but my use cases are soft budget limits with overspending issues. The soft budget limits only turn into a real time bandwidth problem when you cannot squeeze inbound work streams into available cycle bandwidth. Resolving it the easy way is to say no, once at capacity, to additional work until capacity replenishes. The fancy way to resolve it involves priority, which, as you note later, is easy to mess up. I like priority as favoritism only, giving a larger budget slice of the total pie to higher priority items. (Pretend I said something about priority inversion here too.)

A couple practical problems I want to address with budgets is containing service damage due to in-house and third-party bad actors. If an in-house tool should complete in N units, maybe there's a problem, like an infinite loop, when it hits 50*N units. (I guess that's artificial, since infinite loops can be noticed via heart beat failure. Maybe you add heart beats to things observed to behave badly, which you notice by measuring.) Third-party tools might run in a sandbox, after you bind to them, because they advertise, "I will do J units of job work for you in exchange for C cycle units of execution", with error bars provided too, for quality assurance. If they grossly underperform, you sack them. But what if you host multiple tenants, Alice and Bob and their buddies, and they decide to trust third party tools? You might require Alice and Bob to administer a budget of cycles, where they have to re-up bad actors if they don't want to sack them. That would be an example of stalled tools until a limit is increased. Chains of trust have an irritating mushy quality.

Soft measurement doesn't need to be precise; it can use coarse synthetic units. For example, Erlang schedules in units called reductions. For a programming language to help with accounting, it only needs to be observable with some predictability in unit scale. You could let users spend cycles in thalers, where a thaler is whatever you want. It would help devs avoid thinking of time as irrelevant and unbounded if there is an event stream of data to perceive about it.

More on Time

There is another method of managing time in certain kinds of real time systems: priorities. If you run out of time, just drop lower priority items. So if you don't mind an interesting story about that and how it can go wrong:

When the Apollo 11 Lunar Lander was descending towards the Moon, the computer overloaded. So it did what it was programmed to do: drop lower priority systems.

Including the guidance system :)

A more detailed story in Wikipedia: https://en.wikipedia.org/wiki/Apollo_11