Avoiding Actor Deadlocks--JActor

The problem is that actors do not always process messages in the order received, but may limit processing to selected messages based on their current state. This technique is often used when waiting for a response from another actor. Clearly this can lead to something like deadlock where several actors freeze up, even though no locks are used.

JActor takes a different tack. Messages are processed in the order received, but messages are mostly 2-way and are first class objects with each message only being used once. This means that while processing a message, it can be a place to save intermediate state. So if processing involves sending messages to other actors, there is no need to update the actor's state until message processing is complete.

Another difference is that when a message is returned as a response, it is dispatched differently from other messages. When a message is first sent, it serves as a request and is assigned a callback that will be used when the message is returned with a response value. But note that actors process messages one at a time as a means of thread safety, and this applies as well to the invocation of the callback when processing a response.

This is covered in a lot more detail in the JActor project README.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

erlang & out of order processing

yeah, Ulf W. had a talk where he really, really stressed the importance of out of order processing of messages when developing with Erlang.

Messages in transit are immutable because they must be

Messages in transit are immutable because they must be. Messages in transit as electromagnetic waves are not updated although they are checked for corruption in transit.

immutable has the wrong connotations

Messages in transit as electromagnetic waves are highly mutable. Messaging in that domain is more a matter of recreating the original message, with the probability of success depending on the level of the noise.

Digital messages like email may have content that is not modified, but augmented as it is passed from server to server.

The message content in JActor is not modified. Rather it is final. But the message as a whole does serve as a return envelope for the resulting value.

Messages are immutable, Request Actors are not immutable

Messages are immutable, Request Actors are not immutable.

Messages are received in the order that they are received :-)

Messages are received in the order that they are received :-) After a message has been received, it can be processed, not processed, or partially processed whenever. But there are some particularly bad ways to process messages out of order than can lead to deadlock, livelock, unprocessed messages, lost messages, timing races, slow response, etc.

message order preserved between pairs of actors

In many environments the order of messages exchanged between pairs of actors is preserved. This is achieved at virtually no cost within a single process on a single host. It can be achieve, though at greater cost, with UDP, over reconnecting TCP channels and even host failure.

Allowing application logic to determine processing order, in contrast, is extraordinarily expensive in terms of code analysis over the lifetime of the application, as it is not a local matter and does not lend itself to solutions like locking order.

An Actor can process messages in any order it chooses

In a fully distributed system, an Actor can process messages that it receives in any order it chooses as a purely local matter.

seems insufficient

¿It seems to me that while sure it is true an actor can do whatever it wants, that is ignoring the bigger question of e.g. does doing that all too easily lead to deadlock in the distributed system?

Preserving message order imposes significant overhead

In many-core architectures, always preserving message order in the messages that one Actor sends to another imposes significant overhead.

concurrent map vs concurrent queue

A concurrent set would have less contention than a concurrent queue. And I am much less concerned about preserving order than in minimizing the use of state-determined processing order.

An interesting insight.

find Ulf W.

I wish I could find the Ulf Wiger talk (qcon maybe) where he explained the train wrecks he has seen of people trying to not do out-of-order selective receives. Iirc the issue is that then the programmer is encoding multiple state machines on top of each other to try to remember "ok i was doing X and now i got message Y which is about something else, now i have to handle Y right now instead of finishing up X first and then getting back to Y, so now i have to handle the combinatorial explosion of all possible orderings..."

I think this is it:
Death by Accidental Complexity.

I'm not saying I understand all the words going on here, but my pattern matching brain seems to think this is relevant? :-)

The complexities

Wow! Death by Accidental Complexity is a super talk describing why you should not do pure event processing if you can avoid it. Selective message processing was introduced as one means of addressing the complexity, providing you could meet the timing requirements. I.E. No hard real-time.

I found the most interesting remark in the talk being that the problem was that you loose context when you went back to the event loop. And that is also key to the JActor proposal of letting requests send requests to other actors and manage their responses rather than having responses handled by the main event loop.

So we have now two approaches to handle complexity, though the first (selective message processing) introduces the potential for deadlocks that can not be discovered locally.

Mind, I am not saying that all applications can be handled by the new approach introduced by JActor. Indeed, JActor also provides isolation reactors which process each request to completion before handling another request, though it does allow for the immediate processing of 1-way messages.

Isolation reactors then work a lot like the usual kind of actor, but with two major differences:

1. Requests are distinguished from responses by the framework rather than the application code. (Responses are still handled by callbacks rather than being dispatched like requests.) And more importantly,

2. Isolation reactors can NOT exchange request/response with other isolation reactors, not even indirectly. (Requests sent by an isolation reactor are marked as such, as are requests sent as a result of receiving a request with such a mark.) This restriction on isolation reactor interaction then precludes the possibility of deadlock, though it also limits their utility. But it does mean that if you need to do something like single-threaded transaction processing, you have the option.

clueless take on it

so the way i read it, the problem is that 1 actor gets to be bad news when it is supposed to handle more than one kind of interaction, because the messaging for them can overlap. so it seems that if you could split the actor into more atomic ones that had the happy property of only supporting one use and thus never had to worry about overlapping messages, it would be a solution? which sounds sorta like what you are saying about the requests can manage their responses, rather than the 'main' (per actor anyway) event loop.

???

loosing context can be critical to your health

Almost there Raoul. :-)

Overlapping interactions do sometimes need to be addressed, particularly when dealing with errors. So an actor as a whole may need to be able to track all the ongoing activities when the domain is a particularly difficult one.

Conversely, when handling one kind of interaction it may be important to track the overall situation of all types of interaction. Hard problems often do not have easy solutions.

What I am really saying then in terms of the solution space is that putting some smarts in the request can help a great deal, as the request has access to both its own state and the actor's state. As well as providing a nice way to organize things. The flip side being that the loss of context greatly limits what you can do, so processing responses to requests sent to other actors within the context of the request that sent them can be a huge win.

And speaking of context, traditional actors suffer a loss of context when error handling is done by a supervisor rather than by the actor which sent the request that, directly or indirectly, resulted in the error/exception.

In JActor, uncaught runtime exceptions cause an actor to fail, with a reactor closed exception being sent in response to all incomplete requests and also with all outstanding requests sent by the actor being canceled. But other uncaught exceptions are passed up to the originator of the request. This later then works a lot like traditional OO. The expectation here is that the context of the request which sent the request that resulted in the exception can be helpful in determining how the exception should be handled. (And this is a recursive process, so any unhandled non-runtime exceptions will simply rise to the appropriate level for handling.)

Note that the exception handling for JActor does differ from the usual approach taken in OO for runtime exceptions. The thought here is that a runtime exception may well have corrupted the state of an actor and, since many other actors may be interacting with that actor, they all need to be informed that the actor is no longer functional.

Jefferson's Virtual Time is

Jefferson's Virtual Time is a whole system dedicated to processing messages out of order; it doesn't even bother to order messages until they conflict.

Global virtual time does not apply to fully distributed systems

Global virtual time does not apply to fully distributed systems.

a global time keeper is often practical

Imposing a global time keeper on an architecture may be a bit of a cheat, but scales up well enough for many an application. And many applications can benefit from having a global time, at least for some of its activities.

Request Actors for messages (e.g. JActor) is inefficient :-(

Requiring creation of request Actors for messages (e.g. JActor) is inefficient :-(

Inefficiency is a matter of perspective

It is interesting to think of a JActor request as an actor. But it is not. A request is processed by the fiber it targets, i.e. by the same light-weight thread. But I need a bit of terminology before I can say what a request is.

Actors/fibers in JActor are called reactors, which are composable. The components of a reactor are called blades. Blades are objects with both state and request factories, either in the form of inner classes or functions returning instances of anonymous inner classes.

A request then can be thought of as a blade that is a part of a reactor in that it has intermediate state that can be applied to a blade's state on completion of the request.

Is this inefficient? Message objects are not especially expensive to create or to collect. Deadlock avoidance by design could be considered a cost savings. But encouraging the use of 2-way messaging can be an enormous savings in terms of developer productivity and system reliability.

Case in point. Have you ever seen a benchmark written by someone who was not a master? These are easily identified by comments like "benchmark fails with out-of-memory when more than 100K messages are sent." Ah. Flow control. And flow control is generally part of protocol design and is considerably more difficult than "regular" programming. With method calls (and 2-way messaging), protocol design is generally not needed. And too many developers are not qualified to do it.

Efficiency of message passing is crucial

Because Actor systems rely only on message passing for communication, efficiency of message passing is crucial.

Not possible to avoid deadlock in fully-distributed systems

It is not possible to avoid deadlock (waiting for a response that depends on a resource held by the requester) when calling out in fully-distributed systems with nodes that are not mutually trusting.

But we can scope exchanges to actionable time frames.

It is not possible to avoid deadlock...

We cannot avoid remote behavior causing deadlock, if we allow deadlock. But we can avoid deadlock by not waiting forever, ending incomplete exchanges with an error. A combination of sequence numbers, request windows, and nonce values can be used to validate whether a reply received still makes sense (in case one arrives after we give up).

Robust local behavior requires handling messages in any order of arrival, but also making progress anyway when messages don't arrive in expected time frames. You can always negotiate really long timeouts, if needed and someone pays for it. But it's better to contract for limited scopes that can be canceled once invalid.

(I see crazy ordering of signals in my work environment caused by mixtures of: processes starting and stopping; TCP connections coming and going; async services spinning up and down; UDP exchanges arranging service lifetimes. All you need is a definition of reasonable reactions to failures.)

Deadlocks can be detected and broken but not prevented

In a fully distributed system, deadlocks can be detected and broken but not prevented.

Use application details to prevent dadlocks

A typical example of a deadlock involves the transfer of funds between two accounts. But such deadlocks are avoided in actual practice by placing funds on hold. Both accounts then do not need to be locked when transferring funds.

So while it may be true that an unconstrained fully distributed system will suffer from deadlocks, This is not true for all fully distributed systems.

Is it fair to ask developers to operate under such constraints? Better that, methinks, than asking them to change to an event-based system and have to deal with things like flow control. Yes, from a developer's perspective it is a choice of evils.

But we face a real problem in the industry. Too often a company can not keep up with the throughput requirements of their customer base no matter how much hardware they buy. Conversion to a fully distributed system with good load balancing and no single-points-of-failure is vital, but they just do not see their way to that with the software stack they have come to depend on.

I believe we need a real alternative to threads-n-locks. And actors in their present form do hold part of the answer. But when state-based message selection is rampant, problems tend not to be localized--especially when an application needs to be maintained over an extended period of time, turnover is as it is, and maintainers lack the intimate knowledge of the original developers.