How Useful is Erlang Hot-Swapping of Code?

Please discuss. I am interested.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

hot swapping vs. fix-and-continue

As a meta-comment, there is a fundamental difference between Erlang's hot swapping and Smalltalk's fix-and-continue. The former is a production capability that allows systems to be upgraded while they are still running in the field. The latter is a development capability that allows code to be updated while the program is running during debugging. Although the capabilities are technically similar, the contexts are completely different! Hot swapping has to work reliably, and you might spend a lot of time ensuring that a patch can be plugged into the running system, fix-and-continue is just something you do to save time in your debug-edit loop.

As a result of the different goals, the literature on both mechanisms is fairly segregated.

hot swapping

I always thought that the ability to change classes and functions midstream in a program was similar to, and much harder than changing schemas in a database. There are so many features and cases you want:
1) the ability to make many changes atomically.
2) questions about whether existing objects have to be migrated, and when
3) questions about whether old objects see the changes or continue to use old versions of classes or functions
4) questions about concurrency

It's not just debugging, it's ANY change.

What features does hot swapping expose?

What features does hot swapping expose?

I am not sure. In the context of Erlang, I gather it means to be able continue serving requests with updated code while the running requests on the old code die off.

I gave up on types, basically thought: If not types, then functionality, so I am looking into this.

I guess you need it for a data center language. At Facebook, they patched a GHC version to enable hot-swapping in the context of a Haxl application. Google is looking into Haxl too.

So, anecdotal evidence suggest you need it. But I would think that in a server farm you could just have one of the load balanced servers die of and then be rebooted with updated functionality, and I read Erlang only allows for two hot-swaps on the system (that looks a bit broken to me.)

only pain relief features come to mind

Hot swapping should mainly enable finer granularity of update without restart. So it functions as relief of a pain point. Unless there is some pain, feature value is low. But pain is typical, due to restart latency, or due to service interruption in long-running entities like tunnels.

I gather under Erlang services are organized around (lightweight) processes acting as message-switching actors. If you restart all processes in an Erlang runtime just to upgrade behavior in a single service, this could interrupt a lot of things that didn't need a logical reboot. It violates a model of process independence, that unrelated things don't interfere with one another.

The issue is similar to that of upgrading data describing configuration without a restart. It's reasonable to apply new policy to new traffic without breaking things currently underway. But data is expected to be somewhat dynamic, while code is often treated as relatively static in comparison.

If reachable code is registered in some namespace, you can edit the namespace to cause new code to run in response to new inbound activity. That's a kind of data-to-code control system, usually done in file systems at an OS level, but may be ignored in some app architectures.

OS and languages

If you restart all processes in an Erlang runtime just to upgrade behavior in a single service, this could interrupt a lot of things that didn't need a logical reboot. It violates a model of process independence, that unrelated things don't interfere with one another.

Thanks, this crisply explains why this feature is absolutely pervasive among operating systems, yet rare in programming language runtimes.

I am regularly surprised by the degree of overlap between OS and language runtimes.


Yah. That's what I thought too. Because Erlang started out as a telecom language, I imagine those 'switches' didn't do much more than boot into the Erlang runtime and run applications on top.

I.e., Erlang was the OS. All it needed to do was schedule processes and take care of networking.

Which makes it suited for data center applications but I am not sure that with the legacy it carries along it cannot be done better.

(I was thinking of writing a toy language into the direction of data center applications, but in all honesty, I think I cannot engineer it better than Scala. Too much work to implement all that functionality myself. But, ah well, you learn a lot by toying around.)

OS and languages: more examples

I am regularly surprised by the degree of overlap between OS and language runtimes.

You might have been thinking of the Haskell and OCaml on the bare metal projects, as other examples of the language runtime's being the OS. One of the earliest was probably Burroughs that used Algol-like languages, and the Elbrus computer with Algol68 as a machine language. I once saw a listing of the Pascal compiler written in that language. What I remember were extremely long identifier names and no spaces.

Perhaps the clearest example of a language system as an OS system is Forth. I admire its declaration, in the first sentence of ``FORTH - A Language for Interactive Computing'' (published in 1970).

FORTH is a program that interfaces keyboards with computer. It provides all the software necessary to time-share users and manage core and disk memory. Its key is a dictionary that divides memory into entries that identify character strings, code and data. The resulting language is sufficiently powerful to describe FORTH itself, and sufficiently flexible to make inquiries with. It may be readily extended to handle as many, and as complex, applications as hardware permits. On the B-5500 FORTH uses 2K of core and can express a complex application in each of the 30 1K regions of core that remain.


I was rather thinking of more specific aspects of operating systems and language implementations concerns that overlap. Implementing an OS in a language with a rich runtime is one possible way to reduce the duplication of concerns, but I suspect that there is a more general unification story that is yet to be found.

For example, there are advanced implementations of the idea of copy-on-write in both language runtimes (or fairly elaborate standard libraries) and operating systems. Same thing for reference counting, memory page tables (there is one in the OCaml runtime for example), various threading models, context switching, self-modifying code, etc.

On a related note, it is sometimes enlightening to present language features in terms of OS features we are more familiar with. I don't know if that was the original goal, but this what I got from your (Oleg's) 2007 article Delimited Continuations in Operating Systems with Chung-chieh Shan -- an older version of which was discussed on LtU.

Finally, some tools and idea feel completely agnostic of whether they are used at the programming language level, at the operating system level or even at the hardware level. One example would be "correctness proof by iterated refinements". I have seen this applied to actual machines, operating systems, virtual machines or elaborate computer programs, and it is the exact same approach in each case.

It is stunning that there is not a lot more interaction between programming language and systems research than there is today.

It is stunning that there is

It is stunning that there is not a lot more interaction between programming language and systems research than there is today.

Galen Hunt once told me that the goal of systems research today is to come up with good abstractions to build modern software systems. Hey, wait a second! Truth is, we are doing the same things (abstraction design) while working at different levels (linguistic vs. systems).

In the late 90s, PL and systems were fairly co-located, I started off in a lab that was even titled systems PL (and before that got my first big break under Brian Bershad and Emin Gun Sirer...Spin was an OS with a language trick via Modula 3). But somehow the communities have drifted far apart looking at solving similar problems in completely separated solution spaces.

Runtime vs OS

This is a general comment to the this sub-topic, not a specific reply to the last post (as I was having trouble deciding where it would fit best).

The problem with a language specific OS, where the runtime is the OS, is that the language needs to be capable of writing its own runtime. The best example is 'C', in many ways Unix and 'C' are complementary and Unix is the 'C' runtime, and it is written in 'C'.

I think if you want to have a Haskell OS/Runtime, then you need to be able to write Haskell's runtime in Haskell. Otherwise you can't escape from writing the OS in 'C', and that means Haskell will always be second class, having to use imported 'C' function bindings, and type-system escape-hatches to call OS functions.

This means the language needs to support a non-GC unmanaged memory mode, as well as direct memory access and bit manipulation (for hardware drivers). The problem with languages like Haskell is that they build imperative abstraction (monadic state) on top of the functional, hence you can't escape the garbage collector. I think a better way to do this is to build the functional abstraction on top of the imperative, so that the type system enforces functional abstraction. This comes down to proving objects behave like values, which can be summarised in a few axioms, and a way to deal with references.

As I stated in my previous

As I stated in my previous post, see Spin. They delt exactly with these problems (Modula 2 was garbage collected) from a very aggressive OS perspective.

But again, the communities have drifted increasingly far apart given philosophical differences.

Not sure

Spin looks like an interesting project, but I am not sure that it addresses the problem. Yes you could write kernel modules in modula3, but the kernel depended on the runtime. So there was no way of writing the modula3 runtime in modula3. So the core of the kernel (the modula3 runtime) was probably written in 'C'. To me the runtime is like a micro-kernel on which the rest of the system is running, and that means you are not really writing the kernel in modula3.

Well, Jalapeño/JikesRVM

Well, Jalapeño/JikesRVM bootstraps everything in Java.


There are certainly programming language design to be done for languages able to go very low-level, but it is important to note that this is *not* a prerequisite to the kind of fusion of language and systems concerns that is mentioned in this sub-thread.

Just as most OSes are written in C, most language runtimes are also written in C. You can develop radical PL or OS ideas on top of a C codebase, so we should be able to mix them in radical ways in this setting as well. (In particular, the final result does *not* need to run/program its own runtime/kernel)

You can write your language

You can write your language runtime in whatever is convenient. I haven't touched C in years, but C# is great with the DLR and such. I guess that is written in C/C++ at some level, but as a "radical" language designer I never see it.

Ah, I forgot about Midori....

Runtime and OS

Perhaps I can put it like this, the language runtime can be the micro-kernel. However most languages cannot write their own runtime / micro-kernel.

This reinforces the similarity mentioned above between operating systems and runtimes. They are both things requiring certain language features to develop.

I would argue that for exactly the same reasons micro-kernels are considered a good idea, language runtimes should be minimised. Ideally the operating system kernel should be the runtime.

So to unify OS and runtime, the design philosophies of the two must align. I can see how a language runtime might align with a micro-kernel philosophy, and get to a unified system. The language I am working on is one approach to this. I think it's the right way to go, but I am biased :-)

The other would be to persuade the kernel people there is merit in including a garbage collector in the kernel. Adding extra kernel functionality is not compatible with the ideas behind micro-kernels, but might work with monolithic kernels like Linux. I guess the approach would be to write an in kernel garbage collector, probably a pauseless one that runs in its own thread, and persuade the Linux kernel people to include it in the kernel, along with some new binary format for 'managed' code.

I think it is viable to develop a new micro-kernel/runtime maybe targeting embedded and the internet of things, precisely because it is very limited in features by design. Conversely I think introducing a new 'big' kernel and operating-system is prohibitively complex, and it would require extending an existing OS kernel. Replacing Windows or Linux/Unix (including OS-X as a Unix varient) does not seem a realistic approach.

Nothing wrong with running a

Nothing wrong with running a garbage collector as driver / server under a microkernel. Just because parts of it have to run in kernel space doesn't mean it has to be in the kernel.

Writing the Kernel in Language X

I was more thinking that the GC has to be in the kernel, if the kernel is going to be written in the managed language. The minimal micro-kernel for a Haskell system would be the Haskell runtime plus core OS services. If you have the GC as a service outside the kernel, you are back to writing the kernel in 'C' again.

I guess my priorities are to get away from 'C' and to write more reliable kernels in better languages. I am uncomfortable that all this wonderful pure, secure Haskell, relies on the integrity of the 'C' runtime.

I was more thinking that the

I was more thinking that the GC has to be in the kernel, if the kernel is going to be written in the managed language.

This wasn't the case with the JikesRVM approach. And you can always go meta, which allows all rules to be broken (JikesRVM does some of that also).

If you have the GC as a

If you have the GC as a service outside the kernel, you are back to writing the kernel in 'C' again.

That really doesn't follow.

You have a choice - either, pre-GC, you use a non-garbage-collected allocator, or you use an allocator that will be picked up by the garbage collector accepting that, before your GC service is started, no memory will be garbage collected.

You can use any language of your choice, including your managed language (or a sub / superset of it).

Non GC allocator

So what language can I use to write a kernel and runtime to run on the bare-metal of a microcontroller? You seem to be suggesting I can find something off the shelf to do this, or am I misunderstanding?

Well, there is obviously

Well, there is obviously MicroJava, but this is all really old news.

A language that targets bare

A language that targets bare metal, obviously. There aren't *that* many non-C ones out there, but forth, eLua, and even Andrew Sorenson's xtlang (which compiles via LLVM IR) are some which spring to mind straight away (it's early,and I have to dash off to work)

Mostly, though you'd have to write your own. But you were talking about writing your own managed language anyway, no?

Roll your own.

That's right, but its good to check that I haven't missed anything obvious, if something already does what I want I can just use it, and move on to the next project. My list of things I would like to do is already long enough that I don't think I can finish it all. Also if it doesn't do everything I want it might provide some good ideas to include.

I assume Eralng works well in the embedded environment (routers being on of the first uses), but is there still a 'C' runtime? I appreciate a small amount of 'C' is easier to get correct than a large amount.


The Rust guys producing one toy OS after another
Scheme (Gambit or some other variant from )

External resources

Maybe it is so obvious that nobody has mentioned it, or perhaps it is one of those obvious things that people overlook: hot-swapping means that you can retain access to external resources. For example, if you are running server that should last forever, but you want to upgrade the code during its lifetime. In this case external (to the process) resources are probably file-handles, and in many environments the process will not have sufficient privileges to create those handles. Servers launched by daemons tend to be short-lived, but they do control the privilege level of the server by handing over file-handles for the server's life.

In general, if upgrades are handled by serialising "all the state", and then running a new version that starts by de-serialising everything then external handles have to be closed and reopened. When those handles (and their state of openness) model something external to the process that close/open may have some semantics that are undesirable. Hot-swapping code solves this problem. In the original telecoms target domain it meant that the exchange did not have to drop calls during an upgrade, while the stop and restart approach would have killed all of the traffic in progress during the upgrade.

alternate realities

e.g. Erlang probably has a process cluster per telephone call. They probably die off when the call dies. Only upgrade their code individually when they die off. That way there's no (not much) state to worry about. But of course the top level process might be harder to hot swap. Instead, design the system to allow for another top level process at the same time to take on all new calls, and the old one dies once all its child conversations die off. Etc. I don't know that Erlangers tried any of the other options so I don't know that anybody can claim which is the best way to go, and under what circumstances.