Termite: a Lisp for Distributed Computing

(via Patrick)

In short: take Scheme, remove mutations, add isolated processes with mailboxes, add message sending and receiving operations and an addressing mechanism.

Termite is a Lisp for distributed computing (PDF paper and PDF presentation), providing an Erlang like model on top of Scheme.

As the presentation says, the powerful abstraction facilities provided by Scheme made impelementing Termite rather easy and the implementation doesn't require much code.

[Edit: here's a working link for the PDF paper]

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Here's to hoping...

..that the link to the paper will work.

Works for me...

The link is working for me. Great paper!

Poor me

I can still only access the slides. The paper (94kb file) isn't valid PDF (both ghostview and acrobat reject it). Can someone mirror the paper, or email it to me? Thanks.

xpdf

xpdf worked for me but acrobat didn't.

I fixed the pdf for you

Check your email.

Thanks!

Thanks!

questions about migration

Anyone know some papers explaining how to implement process migration? I don't know how you would best implement things like updating the world's address books so everyone knows where to send messages to a migrated process. Or how to handle the getting messages that got sent to the wrong place while the migration was in progress to be received in the correct order with ones that arrive after the migration, but potentially earlier in time due to being closer.

Migration implementation

Hi James.

Currently migration is implemented in the following way: a process, running on a node A, that wants to migrate to a node B calls the procedure (migrate node-B). This sends a message containing the continuation to the migrate call to a "special" process running on node B. That special process is responsible for dealing with immigration requests. On node B, the continuation received is invoked in a new process that has the same pid as the original process. The original process isn't dead yet: during the migration it continues to receive messages. Once the acknowledgement that migration is complete is received on node A, the local reference to the process is updated to reflect the new location. Then the original process relays all the messages left in its mailbox to the migrated process, then stops executing.

Now, if after migration a message is sent from a process on node C to the process that used to be on A, node A will forward that message to node B, and it will also send a message back to node C to ask it to adjust its reference to that process. Another possible solution would be for migrating processes to keep track of the nodes they've been executing on, and warn each of those nodes of the new process location after every migration.

Also, "correct order" of message reception is not guaranteed. Messages might not get to their destination in the same order that they were sent. This would be difficult to do meaningfully in a distributed system. Note that you can use selective message reception to easily deal with the possibility of getting message out of order.

Hope that answers your questions.

Order guarantees

Also, "correct order" of message reception is not guaranteed. Messages might not get to their destination in the same order that they were sent. This would be difficult to do meaningfully in a distributed system.

Yes and no. It would certainly be hard to guarantee that, if A sends message X to C before B sends message Y to C, C will receive X before it receives Y. However, it ought to be feasible to guarantee that, if A sends message X to C before A sends Y to C, then C receives X before Y.

There would be two ways to do this. One would be sequence numbers: if C receives Y out of sequence, the runtime buffers it up and doesn't present it to the application until it receives X. The other would be to change the migration model so that, rather than forwarding the message and then sending a redirect, the original node just sends the redirect, and the sender is required to retransmit the message. I'd probably use sequence numbers, because the other solution has the drawback that every message must be buffered until confirmed, even if no migration ever happens. With sequence numbers, the only node that has to buffer is the node to which a process has migrated.

Very good

Erlang fans will recognise their favourite language in here and hopefully it will make the most important ideas understandable to a wider audience. Well done!

Notes

Just some notes from reading through:
  • Termite links are unidirectional but Erlang links are bidirectional. There seems to be consensus in Erlang-land that it would be better to have unidirectional links, though you still need to make sure that bidirectional link-pairs can be created "atomically" enough (maybe easy).
  • Erlang propagates exceptions between linked processes immediately. Why does Termite wait for a receive operation?
    Aside: An important fault-tollerance principle of Erlang is that any process can and will crash at any possible point of execution. This has the incredible benefit that you can do the moral equivalent of kill -9 with impunity to all but the most fundamental Erlang processes (just like on Unix).
  • Erlang doesn't do process migration as Termite does but the Erlang spawn primitive lets you say which node a process is to be created on.
  • Possibly related to the previous point: an interesting error case in Erlang is that when my node loses connectivity with a remote node X (TCP connection goes down) it will assume that X has crashed and send an exit signal (exception) to each of my local processes that was linked to a process on node X. It is possible that connectivity will later be reestablished and these same processes that have been observed to "crash" will reappear.

    I wonder how Termite treats node failures and whether it has implications for process migration?

  • Section 5.1: What is this dict-set! that looks so much like a destructive operation, contrary to section 3.2 saying set! is not available? Of course it can be done this way using a process (or ets table) to represent the dictionary but as an Erlang programmer I'd write the inner part of meter-supervisor in functional style:
    (let loop ((meters (make-dict)))
      (recv
       (('load-report from load)
        (loop (dict-set meters from load)))
       ((from tag 'minimun-load)
        (let ((min (find-min (dict->list meters))))
          (! from (list tag (pid-node (car min)))))
        (loop meters))))
    
  • Section 5.2: I like the way they kept the boilerplate of Erlang's gen_server. This would probably be easy to cut out with macros but in this way it's much easier to relate to Erlang code. (Joe Armstrong would probably like to see !? being used directly instead of these wrappers but he's a bit outside the Erlang mainstream :-)).
Once again, excellent paper.

Exceptions, migrations, mutations...

Hi Luke.

I'd say that the fact that exceptions are only signaled at the moment of a receive operation is a bit of an implementation issue. I couldn't figure exactly how to deal with asynchronous exceptions and figured that this should be good enough. Now I realize I'll have to re-examine that point. Note though that there's an existing kill -9 equivalent named terminate!.

Currently Termite doesn't do much for node failures: your messages will not get to the process on the failed node and that's about it. How this should interact with migration is a question we are considering currently. The thing is that there are a few different ways to implement migration that have various implications on failure semantics and efficiency. We are considering various solutions, but we're still in the reflexion phase...

The exemple 5.1 should indeed be as you suggest. It's an artefact of a different implementation where meters were writing their current average load directly in the dict. I agree that in this case it's in better style to use a functional data structure instead of a process.

Many thanks for your comments!

More on directionality of links?

I'm interested in hearing/learning more about that issue. I noticed in this presentation mention of bi-directionality perhaps being a mistake. I'm continuing to google around, but any poignant and direct references would be appreciated.

one-way links would be handy

One-way links would simplify a lot of everyday Erlang programs in small but significant ways. For example, suppose you write a webserver process that spawns one child process per client's socket. You would probably want termination of the webserver (parent) to also terminate each socket-handling child but you wouldn't want the crash of a single child to terminate the server. You could arrange this by linking one-way such that crashes propagate only from parent to child.

But since we don't have one-way links it's slightly convoluted. One common option is to use a two-way link and make the parent "trap exits" to avoid being crashed by the child (i.e. to receive the crash indication as a normal message instead of a signal). Another is that instead of using a link you have the child call erlang:monitor(Parent, process) to be informed by a normal message if the parent crashes. Both of these options have the disadvantages of a requiring extra (boring) code and of introducing latency since normal messages are only processed in a receive whereas exit signals cause an immediate interruption.

Thanks for the info

Any thoughts on how likely one-way links are to appear in Erlang any time soon? Or perhaps a standard library to abstract out the boring code as much as possible.

Ok

Look into "monitors." I believe they're what you want.

monitors

The difference between monitoring and linking is the way the crashes are reported. If you're monitoring a process that crashes then you receive a normal 'DOWN' message that you have to receive, whereas if you're linked to it then you receive an exit signal that will terminate you immediately.

Termite released

More more more

Seems to me that Canadian schemers stop hacking the moment they have something they can benchmark. :-)