Asynchronous calls and error handling

I would like to hear your experience and opinion about error handling as related to asynchronous messaging.

If I read correctly, Erlang uses a separate thread for error handling, which supports custom handlers. I find it difficult to see how such a scheme can be used for error recovery in concurrent programs. Does the handler match the sender of the message that caused an error? How does that sender put things together in a coherent state after its handler is called?

I am thinking of a completely different approach for dodo which uses the traditional exception handling mechanism.

My idea is to delay the error reporting until the caller stops to receive a reply from the callee. The main problem with that is that the caller may never wait for a reply, and the error needs to be cleared before another message is sent (new session). Effectively the context changed and the previous error is no longer relevant.

To address that reuse of the same channel for a new session should be discouraged. That looks like a problem that does not need to exist, so I am hoping you can help me here.

Discussions I think may be relevant on LtU:
Error handling strategies
Erlang concurrency: why asynchronious messages?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Garbage collector is your friend

My idea is to delay the error reporting until the caller stops to receive a reply from the callee.

Sounds a lot like using futures for asynchronity.

The main problem with that is that the caller may never wait for a reply, and the error needs to be cleared before another message is sent (new session). Effectively the context changed and the previous error is no longer relevant.

I do not see any problem, as the usual approach is to create a fresh future for each asynchronous invocation, and when the caller (or some other thread for that matter) forces the future, it will get either a result, or an error (or will wait forever). And, of course, unreachable futures can be garbage collected - paired with allocation of fresh futures for every invocation this pretty much frees the programmer from many headaches.

Reporting error in next forced future

Thank you for your help. Futures are all well and good (I definitely plan to use them), but what if the caller does not keep one?

What I am considering is something like a protocol exchange, for example:

  def foo files.File("foo.txt", fRead)  #file foo.txt, allow read
  files!foo.Open             #open file message
  files!foo.Read(10)         #read 10 bytes message
  files!foo.Close            #close file message

In my idea, if there is an error in the Open operation, that could be reported when reading the result of Read. Since it is all part of the same protocol it stays quite manageable.

But what if I don't store the result in a future as shown above? I could reuse the callee foo for a new protocol exchange, and get very confused when I see an error that was raised in the first session. Hence the need to clear errors between two sessions.

Maybe delaying the error reporting that way is just a Bad Idea.

Pipelining of promises?

My (recently acquired) instinct would be to send Read operation to the result of Open, thus getting the error before requesting reading, not after that.
Your approach reminds me the pipelining of promises as done in E (though I've read about it a long time ago and may be confused) - e.g., Concurrency Among Strangers (look for pipelining).

[on edit: oops, actually pipelining is closer to what I described - sending next message to a promise from the previous send. anyway, it may be useful to look how E approaches error handling]

Re: Pipelining of promises

Yes, I like that approach. Will look into it.

Pipelining of promises in E

I can see a serious limitation in that:

Because the promise starts out as an eventual reference, messages can be
eventually-sent to it even before it is resolved. Messages sent to the promise can-
not be delivered until the promise is resolved, so they are buffered in FIFO order
within the promise. Once the promise is resolved, these messages are forwarded,
in order, to its resolution.

This means that a message is sent only after the operation in response of the previous message has completed. That is not the way many protocols work (eg. HTTP), it may be desirable to send a message whether or not the previous operation failed.

I think you're confusing the

I think you're confusing the reference you're sending to with the promise for the result. If you wish to send two messages in sequence to the same target, this works fine, with no delay in between:

  a <- foo()
  a <- bar()

The situation you seem to be thinking of is

  (a <- foo()) <- bar()

(Incidentally I kept hoping I could enter the < character without resorting to &lt;. That was a pain.)

Error chaining

I still hope to receive comments from people who use different asynchronous mechanisms.

My current thought on the subject is that dodo should use error chaining: if there is an error and it is not reported, it is just piled up with previously unreported errors. If a new error occurs which is reported, then the error comes with a tail of unreported errors.

That way even if previous errors are not much relevant to the current context, at least they do not prevent the programmer from getting a relevant result (the new error).

In the file protocol example above, if an error occurs in Open then the file may not be ready for Read. A new error will be reported, along with the old error in Open. Something like:

Read: File foo.txt is not open
+ Open: File not found foo.txt

more details might draw more suggestions

Denis Bredelet-jido: I would like to hear your experience and opinion about error handling as related to asynchronous messaging.

I read your post a couple times; I'm still trying to identify specific concerns. Use cases (where you desribe a problematic situation you see) might clarify it for me. It's hard to suggest solutions without knowing a target problem. I'm a little like the accountant who asks what you want the answer to be. Details can be worked out if it's feasible.

From what perspective does observation of errors matter? Is it from multiple places/perspectives? A caller has the right to get an error response, if an error becomes known (as opposed to a simple hang which times out). You mention reporting errors, so perhaps you're thinking of a global overview perspective as well, communicating out-of-band with some kind of monitor. Are there any other cases or perspectives that need to know about errors?

In a session, can you have multiple independent interleaved requests? Or are they all serialized? Is it okay to pipeline requests for performance, and only tell you about earlier errors later? (Say if you're writing, is it okay to lie to you and says writes are working, until later you get an error when you try to commit.) All the details affect how you might do things. The finer you specify granularity of constraints, the easier it is to tailor something specific.

It sounds like it might help a lot to know what kind of application is involved, to understand constraints on traffic, and whether transformations will break application logic. It might help to think of async problems as similar to quantum mechanics problems, where you can't resolve whether something is true or not until something forces an observation providing evidence.

A more detailed question

Thank you for helping me formulate my question better. I am asking about it in the context of building a programming language where asynchronous calls are a primary mechanism for concurrency. That is why I am seeking for input from users of existing languages/environments that rely on asynchronous calls.

In linear, or synchronous parallel programs error handling is simple: a call returns either a value or an error, and the caller is expected to handle both or bail on error.

However I realised that this model does not translate well in asynchronous context. Since the error may occur at a time when the original context of the call is gone, it is difficult to recover from it and take appropriate action.

From what perspective does observation of errors matter?

I would like to cover the main usual cases:
- An error is returned instead of a value
- An error occurs and corrective action needs to take place (no return necessary)
- An error occurs and simply needs to be logged, the program continues
- The error needs to be reported to the user of the application who can chose the next action

In a session, can you have multiple independent interleaved requests?

I want to use a client-server model, where requests are placed concurrently and processed when the resources are available. The server can decide to serialise, process in parallel or delegate, that would not be fixed by the language. Placing a request is allowed to block the caller if necessary.

Say if you're writing, is it okay to lie to you and says writes are working, until later you get an error when you try to commit.

What are the advantages and disadvantages of this approach? Do you have experience with it?

async in language runtime

(I ended up writing more than I planned. I took the day off to spend with my kids, and now I'm spending time on this; I probably shouldn't spend a lot of discussion time on async support in languages, since that's fairly involved and I have my own version to code myself. I just don't want to leave you hanging.)

To tackle parts separately, I split this quoted sentence in two:

Denis Bredelet-jido: I am asking about it in the context of building a programming language...

That's a hard case. Async support in a bottom (or middleware) platform layer is tough because you must avoid constraining what apps do on top of your layer. You need to define a thread or process execution model, or else your async call mechanism won't have a context for defining what you mean.

Your language needs a concept of independent flows of execution control, so async messages between flows can be observed to not always be in sync; otherwise you won't have async messaging. (Unless you only want it observable from outside.)

Denis Bredelet-jido: ...where asynchronous calls are a primary mechanism for concurrency.

It sounds like you want async messages to be more primary than a process or thread model, which is normally what defines concurrent activity. I'm not sure you can architect an app on top of a language without letting a programmer decide what activities go in what process, which then message one another (either directly or through some intermediary like a message box or a queue, or sockets, etc).

To add async support, all you really need is a form of messaging that doesn't block on send, so a sender can keep going while a receiver does something in parallel (potentially, if it's in another process or thread). The complex part is getting a response later, by polling or by blocking, where blocking is going to be more performant under scaling.

You can define a messaging system as queues without requiring threads or processes. The runtime could timeslice available cycles, and this is often done these days with event oriented systems. But if you don't want one main thread to block when using a blocking system call, you'd need some model that allows some threads or processes to keep going, to use resources productively in the meantime.

I think my thread of thought just degenerated into totally generic OS speak, so I'm not helping unless I say something you see relates to async. Before I forget: try to draw as many diagrams as you can of what you're thinking about comms, because it helps show possible inconsistencies in your ideas. Use the classic style of diagrams in client/server protocol interactions where messages go laterally and time goes vertically. (Text mediums like this one aren't good for drawing pictures.)

The main rule in async is messages don't arrive before sent. So receiving a message is evidence it was sent. You won't have a lot of other evidence besides received messages. So infer facts about messages sent only in terms of what you receive. You might get crazy sounding app models if you hope apps are able to know more than what can be inferred from received messages. Throw out use cases that try to keep globally synchronous knowledge of what's happening.

Denis Bredelet-jido: An error is returned instead of a value

At the runtime level, an app layer error is just another value in a message that can't be told from a non-error message. It's up the app layer to divide values into error and non-error types. How the runtime responds to runtime errors is separate from how errors get passed in app messages.

If an error happens in the runtime, that's something else entirely. Logging and user notification probably get done in terms of app level messages, but some of the senders can be runtime level entities. (Notification that a runtime error was handled would appear to come from something not defined by an app.)

Denis Bredelet-jido: What are the advantages and disadvantages of this approach? Do you have experience with it?

Yes I have experience I can't be too specific about.

Write ahead optimization can be used in protocols (seems like a lot of them) which wait unnecessarily on send, instead of writing more and getting an error later. Saying 'okay' immediately upon receipt and then actually writing concurrently with more client writes will decrease overall elapsed latency to get multiple writes done. But it uses more resources at one time, which might compete with lots of other sessions being serviced trying to use the same resources.

It's an optimistic strategy assuming things normally go well, with low frequency of errors seen. You'd waste resources if errors were frequent and you committed more resources too early too eagerly. You can delay error reporting until a client needs a sign everything is fine — say on close or commit.

Good luck with your async support in a programming language. If you try to separate what a runtime must know and what apps must know, you might write yourself less confusing requirements. [cf]

Application-level error handling

That's a hard case. Async support in a bottom (or middleware) platform layer is tough because you must avoid constraining what apps do on top of your layer.

In fact, I don't mind that much if I limit what the programmer can do. It is more of a research language, and my philosophy is to offer opportunities for parallel execution while sticking as much as possible to known patterns.
While the thread model is widespread, I don't get the feeling it is that well understood. That is why I am looking at the client-server model as a subset of the former, keeping in mind that it exists at least from the time there are "clients" and "servers".

Now the issue I have is with error handling, and if I need to chose it would be application-level error handling. (not runtime)

asynchronous fail hard / fail over

erlang encourages asynchronous error handling. trying to synchronize all error handling in an distributed system leads to serious troubles, as it imposes a strong coupling between the components.

erlang avoids this by simplifying error handling to most basic operations: fail hard or fail over.

in this design, most classes of errors result in the offending process being killed. interested parties can sign up for error propagation by manually linking to processes. a process failing results in the propagation of nature and source of an error over all defined links to related processes.

this basically allows to react to the problem in three ways:
- retry: respawn the offending process
- fail over: respawn a replacement process
- fail: don't retry, die self

this scheme, by the way, is orthogonal to a synchronized single-process error handling scheme (e.g. exception handling). erlang supports both ways of error handling - but it does not try to impose both designs onto a single solution.

EDIT:

i would like to note an important aspect of error handling in distributed/asynchronouse systems, which is fundamentally different from synchronous systems:

the chain of responsibility in a synchronous systems is always explicit: the caller is responsible for the callee - exception handling mirrors this fact by simply propagating errors up in the call stack.

this responsibility is never implicit in asynchronous systems. error distribution is bound to be part of the system design, and therefor to be manually defined in some way.

Responsibility

Does Erlang support rendezvous, in which one process waits for one or more processes to complete? In that case isn't the initiator responsible for the other processes part of the rendezvous?

When two parties initiate a session or protocol exchange, is there not a temporary coupling between them? What do you mean by "strong coupling" above?

Does Erlang support

Does Erlang support rendezvous, in which one process waits for one or more processes to complete? In that case isn't the initiator responsible for the other processes part of the rendezvous?

erlang supports any process patterns based upon the previously described link primitive and message dispatch between processes.

it also allows to capture such designs in behavioural modules, enabling reusability of these. see OTP Design Principles for examples shipped with the language.

rendevouz is just one simple form of distributed communication. erlang avoids to map specific communication strategies into the language - it rather provides for the implementation of any strategy with the means of the language primitives.

When two parties initiate a session or protocol exchange, is there not a temporary coupling between them? What do you mean by "strong coupling" above?

i mean strong coupling as a negative quality in a distributed system. in a 1<->1 two-way session, this temporary coupling is valid, but this is only one trivial form of multi-party communication.

the erlang communication primitive, which is message dispatch, does not enforce a temporary coupling between the peers because in many scenarios this coupling is neither required nor desirable.

think of a broadcast session as a simple 1-way 1->N protocol example where this coupling is no longer valid.

Re: erlang supports any process patterns

Thank you for the link. Good to see that "server" is one of the shipping behaviours in Erlang/OTP. I did not get to the error handling part yet, but so far it makes sense. I am not sure why channels are called this?

i mean strong coupling as a negative quality in a distributed system. in a 1<->1 two-way session, this temporary coupling is valid, but this is only one trivial form of multi-party communication.
the erlang communication primitive, which is message dispatch, does not enforce a temporary coupling between the peers because in many scenarios this coupling is neither required nor desirable.
think of a broadcast session as a simple 1-way 1->N protocol example where this coupling is no longer valid.

I see. However from an implementation viewpoint, broadcast requires either the sender (or a separate broadcaster) to know all the recipients, or the recipients to poll a shared resource (bus?) which holds the message. I am not sure it is more than a specialised application of 1-to-1 communication or message polling.

How closely are continuations related to asynchronity?

the chain of responsibility in a synchronous systems is always explicit: the caller is responsible for the callee - exception handling mirrors this fact by simply propagating errors up in the call stack.

Does this imply that any continuation-passing scheme is automatically asynchronous?
Is CPS just a recipe for emulating synchronous calls using asynchronous?

cps and async

Andris Birkmanis: Does this imply that any continuation-passing scheme is automatically asynchronous?

What I like about CPS is how well prepared a continuation is structured to handle async callbacks, so flow of control can look almost synchronous to a programmer. This way, even though the process has not blocked on an async call, the continuation state to resume the other half of call processing has been preserved exactly the way it was, just as if the process had blocked at that point.

Lack of decent continuations in C++-based networking apps inspires much code falling under Greenspun's tenth rule of programming:

"Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp."

This is why I think sufficiently complex C++ systems should have a real high level language embedded inside them, specificly to handle such things in a less ad hoc manner.

in the context of the

in the context of the current discussion and my experience with distributed systems, i would like to add the first rule of distributed programming:

Any sufficiently complicated distributed C, C++ or Java program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Erlang.

but that's just my two cent... :-)

ratify

(Rubbing my eye sheepishly.) Yes, that too, and combining the two rules consumes most degrees of freedom left in distributed systems. :-) I wish I could talk folks using C++ into using Erlang instead. It's easier to talk them into embedding something else in their systems less scary than a kludge embedded there at the moment. The kludge worries them, especially when I tell them it's hard to debug what it's actually doing, and that visibility is very important for getting evidence to test hypotheses.

Does this imply that any

Does this imply that any continuation-passing scheme is automatically asynchronous?

i don't think so. CPS allows you to capture world and process state - that's it. if this leads to an asynchronous programming model heavily depends on the conrete application.

for example, CPS is often used to implement cooperative multitasking. this obviously leads to an asynchronous programming model. i wouldn't say that the same is true for a CPS based parser implementation (though error passing might also be non-trivial in this case, but this is a different story).

Is CPS just a recipe for emulating synchronous calls using asynchronous?

wouldn't the reverse rather be true? still, i don't think CPS implies either of these...

Communication and synchronization mechanisms

I think that a key question is how the asynchronous entities in your language are going to synchronize and communicate (in general). Just spawning asynchronous calls isn't enough for most purposes.

I recently wrote a couple of programs using an asynchronous programming library for SML based on a design by Stephen Weeks.

Using the approach, asynchronous entities communicate through events. An entity typically communicates failure by triggering an event with information on the failure. For example, a function that spawns a failable process might have a spec of the form:

val failableProcess : param_type -> (Exn.t, result_type) Sum.t Event.t

IOW, the produced event is either an exception, indicating an error, or the result. The caller then registers interest to handle the event. For example:

when (failableProcess argument)
     (fn INL exn    => (* handle error *)
       | INR result => (* process result *))

Separation of context

I agree that error event handling is a useful feature. However that separates the error handling from the asynchronous call(s), which could be an issue in many cases. A solution may be to use continuations to maintain state between the call and the handler.

Event handling in Felix

Here's what Felix does. First, consider a primitive

  read(sock, n, buffer, &err);

This call blocks, as usual. But now consider:

  spawn_fthread { read(sock, n, buffer, &err); };

This creates a fibre which blocks, however the main pre-emptive thread does not block, nor do any other fibres in that pthread. We also can't tell when the read is finished. So we'll fix that now:

  var chan = mk_schannel[int];
  spawn_fthread { 
    var err:int;
    read(sock, n, buffer, &err); 
    write (chan,err);
  };
  // do stuff
  var &err:int <- read chan;

Now the main program blocks after doing some work, and waits for the previously spawned fibre to complete by reading from the channel chan .. it also happens to read the error code.

Behind the scenes, the read function dispatches a request to a single pthread which uses an OS notification service such as epoll on Linux to perform the request using non-blocking socket operations.

Now the answer to the original question should become apparent. Note that the solution here doesn't send the error code at the time the next request is issued .. that seems to be to be a very bad idea.

Instead, Felix provides fibres and synchronous channels to control invert your program .. the generated C++ code is actually event driven . Using this model you can program as if you were building a circuit, with inputs for asynchronous messages.

In other words

In other words, you essentially code up very poor man's futures by hand.

[PS: Please close the CODE tag in your posting.]

Yes, what you see is very

Yes, what you see is very low level: spawning fibres, creating named synchronous channels, and reading and writing on them, are more or less direct mappings of the system primitives.

The syntax is not very nice. For example, fibres cannot deadlock because when they do the collector reaps them. Nice. However whilst a channel is in scope of an active fibre, an otherwise deadlocked fibre blocked on it could be unblocked, and that fibre is reachable.

So if you wish to ensure two cooperating fibres die after their work is done, you need to write something like:


  spawn_fthread {
    var chan = mk_ioschannel[int]();
    spawn_fthread { ...  read chan ... };
    spawn_fthread { ... write chan .. };
  };

so that the name of the channel isn't visible to the mainline. It would be nice to use something like:


  let chan = mkchannel[int] in spawn_fthread { .. }

to fix that (for example) but Felix prohibits side-effects in expressions: spawn_fthread is a higher order library procedure and unlike Ocaml and ML, the type system prevents this abuse. It's also necessary to do so, since procedures yield to the scheduler, which can't be done when the machine stack is non-empty. Functional code uses the machine stack for continuation passing for compatibility with the C object mode (that is, it uses conventional subroutine calls for functional code, even if the variable frame is heap allocated).

[BTW: I can't see an unmatched CODE tag in my previous post]

unclosed code in pre

If you view html source of this page, right now you can see your post contains:

<pre ><code >
  var chan = mk_schannel[int];
  spawn_fthread { 
    var err:int;
    read(sock, n, buffer, &err); 
    write (chan,err);
  };
  // do stuff
  var &err:int </pre><p >

If you edit that post, you can just remove the code tag start since the pre tag will also give the fragment a suitable monospace font.

The problem wasn't the code

The problem wasn't the code tag. I had to use AMP LT SEMI sequence to stop it mistreating a less-than sign. Thanks. Isn't there a better way to quote code?