Massive Numbers of Actors vs. Massive Numbers of Objects vs. ????

I'm trying to get a handle on core technology for an application that's going to involve massive numbers of entities - where the entities want to have characteristics that draw from both the object and actor models.

Think of something like massive numbers of stored email messages, where each message is addressable, can respond to events, and in some cases can initiate events - for example, on receiving an email message, a reader can fill in a form, and have that update every copy of the message spread across dozens (or hundreds, or thousands) of mailboxes/folders distributed across the Internet. Or where an email message, stored in some folder, can wake up and send a reminder.

One sort of wants to blend characteristics of:
- messages (small, static, easy to store huge numbers, easy to move around)
- objects (data and methods bound together, inheritance, ...)
- actors (massive concurrency, active)

The topic has come up before, in discussions of active objects, reactive objects, concurrent objects, etc. - I'm wondering what the current state of the art and practice look like.

I'm thinking that Erlang might be nice operating environment for such a beast, but wonder at what point one hits limits in the numbers of actors floating around. I'm also wondering what other environments might blend these characteristics.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Objects and Actors

If you're looking to blend characteristics, I'd suggest looking at the E language's vat model and similar systems like Constantine's `AsyncScala`, or William la'Forge's `JActor`. If you want objects to react to a changing environment, you might also look into Tom Van Cutsem's AmbientTalk or Duncan Cragg's Object Network.

Your solution will probably need to involve Databases and orthogonal persistence if you want it to scale. That's something to keep in mind. And while actors support a high level of concurrency, they don't actually scale very well - all the same concurrency problems exist, just with coarser granularity than shared memory concurrency.

Most state-of-the-art in these directions is unsuitable if you want to upgrade software after you have a lot of objects floating around, or if you have a federated system (no central authority), or if you need heterogeneous but consistent `views` of objects and models.

Failures

It is not clear to me what exactly is wanted, but it looks like it would be some big system. In addion to upgrade problem, there is a failures problem.

I suggest considering the following the facts and to define acceptable the failure and upgrade modes:

  1. Hardware breaks and sometime dies in unrecoverable state
  2. Software contains bugs that could lead to inconsistent state that needs to be fixed with partial human intervention.
  3. Software evolves
  4. Different pieces of software system evolve at different speed
  5. Software evolution often leads to structural implementation changes
  6. If there are different clients, they also upgraded at different times. So system should be multi-version aware.

So for the long living big systems, orthogonal distribution (like RMI) and orthogonal persistence (like classic OODB) generally do not work on the long run, particularly for hetetogenous systems. Both the them should be intentional for software to surrive. And this is a lot of hard work, where actors could help only so far. And the biggest challenge for such system lays in relationship between nodes, rather than in internal implementation of them.

Problems of OO and RMI

Problems of OO and RMI should not be attributed to orthogonal distribution or persistence `generally`.

actors could help only so far

It isn't clear to me that actors are helping at all. Actors make it hard to reason about consistency after partial failure or during software evolution even locally.

the biggest challenge for such system lays in relationship between nodes, rather than in internal implementation of them

I agree. I've designed a general purpose paradigm around the notion of declaratively orchestrating and maintaining relationships between nodes. Consistency and resilience - before, during, and after failure - are major concerns in the paradigm's design.

relationships between actors, not nodes

I'm kind of thinking that a more appropriate view is "relationships between actors" or "relationships between replicated copies of objects" - make the environment secondary.

Actor, Object, Agent, Node

I'm not picky. A point is a point is a point, and I'm interested in programming (declaratively orchestrating and maintaining) the relationships between points, the edges.

I do model the environment in terms of context objects and the like.

Intentional vs Orthogonal

Problems of OO and RMI should not be attributed to orthogonal distribution or persistence `generally`.

The only situation where orthogonal distribution was sighted to work in enterprise context is homogenous application cluster where nodes synchronously communicate with each other (if we have asychronous communication channels like JMS, this stops to work as well). In all other cases, I have seen some very nice and easy to use samples, but in real application orthogonal distributions breaks on the next upgrade.

On other hand, intentional distribution works relatively well: CORBA (before adding value types), WebServices, HTTP, etc. But the problem is possibly more apparent for persistence, since sending the message could be seen as storing it some medium by sender and receiving it from the medium by the reciever.

With the orthogonal persistence we save state of the application into presistent storage. With intentional perisistence we are projecting state of the storage inside the program. So the difference is the dependency direction.

In orthogonal persistence, the storage system depends on the application. The changes in the application imply changes in the storage. The problem is that in normal case, application evolves faster than stored information. Even more, the orthogonal perisistence rely on internal implementation details that even invisible to clients. Even if these implementation details change, we loose ability to access the storted information. And the final nail here is that in the enterprise context, the application is regularly rewritten using other approaches and other programming langauges. So orthogonal persistence works only when lifetime of the data is less than lifetime of the application version. The data could become unaccessible on the next application release. Or even on the next JVM upgrade. There are some situations when these conditions are true. For example chache systems like Oracle Coherence are examples of the system that uses orthogonal persistence. But these systems assume that the objects stored in them could die at any moment.

In intentional persistence, the application depends on the storage. The changes in the storage require changes in the application. This allow to evolve the application independently of the storage. The storage evolves at the own speed or even radically transformed, but the enterprise system is able to survive that since there is the explicit ammendable contract between storage and applications.

I suspect that OODB and other orthogonal persistence technologies are not popular in the enterprises because each radical restructuring of IT infrastructure simply wiped out all deployments of such techonologies, so they could not get hold on the enterprises. Also note that currently trendy BigData technologies are also intentional persistence.

It isn't clear to me that actors are helping at all. Actors make it hard to reason about consistency after partial failure or during software evolution even locally.

Actors are helpful for organizing concurrent interactions inside the nodes. With a right model (like in AsyncScala), handling of the failures is not too much different from normal sequetial programming. IMHO interactions with outside and related partial failures is a subject of intentional distribution anyway.

Actors in Databases?

That does sort of beg the question of what's a reasonable framework for storing large numbers of long-lived, but hibernating actors. Object-oriented database w/ triggers? Something else?

Web application servers.

Web application servers.

[PS. Begging the question - you're using it wrong.]

some kind of database

mfidelman: ... storing large numbers of long-lived, but hibernating actors.

In your first post you said you were trying to "get a handle on core technology for an application that's going to involve massive numbers of entities." Can you amplify? What's your budget? Is money no object? Or is this a shoestring personal project? It sounds fun as an arm-chair exercise, but painful as a going concern.

Reliability is very expensive once you add more nines, unless you can fail now and then. Will users mind loss? (Most of them do. :-) An easy implementation is very reliable up until total failure.

If you're serious, try to estimate resources needed. Take someone's advice -- dmbarbour's is good. The app might work if you don't need more resources in a period of time than are available, and if you're willing to pay the tax of operational costs to maintain safety in a storage medium for recovery after failure.

amplification

This is a project that's the subject of several proposals for R&D funding - we've done some work on previous funding, looking toward some future steps - so I'm somewhat doing technology assessment and trying to take a system level view of things. The intent is to build a system that will scale from from a small beginning.

There are a few pieces of this that I'd rather not talk about at this stage, but my original post pretty much some things up: Think email. The original email architecture consisted of formatted messages, MTAs (Mail Transfer Agents), and MUAs (Mail User Agents). It's all scaled rather amazingly over time. So did NNTP and USENET. Robustness comes mostly from replication (NNTP and USENET are still incredible models for a robust, scalable, distributed messaging system).

The question I pose is: What if email messages are active - addressable, containing executable code (say JavaScript), and capable of sending/receiving/responding to messages. As an application example, think about emailing a collection of linked spreadsheets implemented in JavaScript and communicating via network protocols (rather than, say, emailing Excel spreadsheets - which proves to be very brittle in practice).

What's a conceptual model and processing/storage environment for building enhanced MTAs and MUAs?

Conceptually, I keep coming back to messages that behave like mobile agents that move around via email - with characteristics that are a cross of object-like, actor-like, and message-like.

For implementation, on the other hand, it's clear that storing messages as plain files won't work. For small-scale implementations, I can see treating each message as an actor (or perhaps several synchronized actors). I could easily see doing this in Erlang. But... scaling seems like it could be a problem, and I'm looking for models and technologies.

messages that behave like

messages that behave like mobile agents that move around via email - with characteristics that are a cross of object-like, actor-like, and message-like.

This isn't right. (How would it work with mailing lists, for example?) Think more in terms of e-mailing hyperlinks, sometimes protected by a randomized URL.

Perhaps your goal is to `PUT` an object on a remote host, then `POST` some other mailbox (possibly on the same host) about that object, and also enable messages to be received by the created object. I think this better fits HTTP vocabulary than whatever you're envisioning.

(I was quite serious when I suggested web app servers.)

Nope

That's not the model we're working with. Our model is that messages are first-class entities. Think mobile agents. Or for that matter think process migration.

Re. mailing lists: One way of looking at list distribution is simply spawning lots of copies of the original message (each with it's own UID by the way).

Re. web app servers: I understand your reasoning, but there are lots of different varieties thereof, supporting lots of different design patterns. An app server hosting erlang processes is very different than, say, Tomcat.

Meh.

I believe you work with an inconsistent model. Your proposed uniqueness properties on mailing list distribution, for example, seem to be inconsistent with first-class references between message entities.

For web app servers, it's the API that matters, not the implementation.

unique messages

I don't quite follow your argument. When mail is distributed, directly or via a list, the original message has a unique message-id assigned by SMTP. Each delivered copy also is uniquely identified - by the message-id inserted by the final SMTP daemon along the path. Essentially each delivered copy essentially has a class-id (the originating message-id) and an instance id (the message-id assigned at the destination).

With appropriate communications infrastructure, that makes it possible to address both individual instances, and all instances. Sounds like first-class entities to me. (Note that it's become common practice for HTML email messages to contain 1-pixel images as a tracking mechanism - as soon as the message is opened, and retrieves that pixel, it's essentially reported it's existence and location - it's easy to envision a directory mechanism based on this approach.)

In a sense an email list provides that kind of infrastructure. A message containing an In-Reply-To: header is addressing a previously distributed message.

Threaded mail readers, and mail archives, do all kinds of neat things by processing message ids. I'm thinking of an environment where messages do the processing themselves. I.e., a message containing an In-Reply-To: header is (conceptually) passed to the message identified by , for processing.

And it still comes back to both a conceptual and implementation question:

1. What's a good conceptual model for a collection of, potentially millions of accumulated email messages - each of which is addressable, can respond to incoming messages, and can potentially wake up periodically and do things? From a user point of view, folders full of messages might be sufficient. From a system point-of-view, not so much - the two standard models don't work very well (each folder is a mailbox file; the mh model of each folder is a directory, each message is a file). A cloud of actors, addressable by message-id doesn't quite "feel" right. A message database, with triggers starts to come closer. What really seems called for is something like an "actor store" - a persistent repository where one can "park" huge numbers of hibernating actors, then wake them up when messages arrive and/or based on a heartbeat - with added characteristics of a document database (for storing/organizing/searching/retrieving things from a message-like point-of-view).

ii. On the implementation side, one is faced with what both designing both communications and storage infrastructure.

As far as I've gotten is:
- HTML+JavaScript is pretty much a given for "active messages"
- standard email is pretty much a given for initial distribution (augmented by list managers, and possibly NNTP)
- NNTP (USENET) is a reasonable model for massive replication of messages (but has some nits re. message-ids and transience of messages)
- NNTP's group/message model provides a high-level organizational framework (fyi: at one point, Netscape had a really nice collaboration server based on private newsgroups)
- message-id's and threading provide the basis for addressing entitities
- in essence, the question I'm focusing on now is what would a "smart news store" look like - with long-term persistence, where the stored messages contain executable code, are addressable and can respond to incoming messages in an event-driven way

Both raw Erlang and CouchDB seem reasonable starting points - Erlang as an actor platform, CouchDB as a document-oriented database that can contain executable code - but neither scale really to large numbers of nodes (though I understand there's some work EU-funded work going on to scale Erlang - http://www.release-project.eu/).

From an R&D point-of-view, it's a good thing to have challenges :-)

What I'm wondering is if there are other starting points to consider in terms of conceptual models particularly as instantiated in programming languages), and/or run-time platforms/environments. So far, the only thing that comes close is some of the work on platforms for mobile agents (e.g., IBM's old Aglet work), but a lot of that seems to have died off.

Leaky Conceptual Models

each delivered copy essentially has a class-id (the originating message-id) and an instance id (the message-id assigned at the destination). With appropriate communications infrastructure, that makes it possible to address both individual instances, and all instances. Sounds like first-class entities to me.

The `with appropriate communications infrastructure` sounds to me like `then a miracle occurs` by another name.

Today's e-mail messages are not `addressable` by their identifiers. They are not first-class entities. They are immutable values. There is no inconsistency at the moment in that regard.

Creating and sharing large numbers of objects and actors is fine. But those aren't what you're using. You'll need to start by hammering out the logic and semantics of your particular approach if you are to achieve a consistent model.

Or you can forego consistency, cobble something together, track down the worst leaks one at a time and handle them as special cases, and buoy the rest with hot air. It's an industrial approach, well proven by Perl, PHP, and C++. But you won't find a `good` conceptual model for it. Judging by what you say of your implementation, this seems to be your current path.

umm...

.... the whole point of the project is to DESIGN AND DEVELOP "appropriate communications infrastructure" - or more to the point, leverage and extent existing communications infrastructure

I'm pretty sure that message delivery is already well handled by exiting infrastructure. My focus is on enhancing the message stores at endpoints. And yes, of course, it's necessary to flesh out logic and semantics.

I keep coming back to the conclusion that the core entities (in this case email messages containing executable code) are more than messages, and something that mixes the behavior of objects and actors. People have certainly played with computational models around that blend these characteristics ("active objects," "concurrent objects," autonomous mobile agents, come to mind).

I'm looking for some commentary as to applicability and implementations of such models to the case of managing very numbers of "entities" comparable to the volumes of email messages that are commonly stored.

I'm also looking for points to technologies that might support persistent storage of very large numbers of hibernating actors (as one possible implementation approach), object databases with triggers as another, and/or approaches I haven't quite thought of.

Develop and Design

Are you planning to design before you develop, or after?

can we cut the snark, please?

When building large systems, technology/platform selection IS a key aspect of design. If you don't have any comments to offer, how about just zipping it?

Technology and platform are

Technology and platform are important; we should design to implement easily upon the systems that exist. But I suggest you control the dependencies on any specific technology or platform.

Start thinking more carefully about:

  • security and denial-of-service risks
  • forwarding and sharing patterns, and formally modeling them
  • space constraints; destruction or garbage collection

Abandon your uselessly vague concept of "messages that behave like mobile agents that move around via email - with characteristics that are a cross of object-like, actor-like, and message-like."

And remember: LtU is not for design questions (Policy 4a). Take those to more appropriate venues.

And though you've since extended your earlier `umm...` response past its first sentence, you really aren't in a position to complain to me about snark and useful comments.

missed 4a

Guess I missed that policy - and fair enough for reminding me.

Re. technology and platform: Sort of hard to avoid for any large system - have to start somewhere, particularly when deploying into long-established environments (e.g., the Internet). Richard Gabriel's presentationon "Architectures of extraordinarily large, self-sustaining systems" (const - thanks for the pointer) is quite interesting in this regard.

Re. "Abandon your uselessly vague concept of "messages that behave like mobile agents that move around via email - with characteristics that are a cross of object-like, actor-like, and message-like." We play with such things every day - every time we receive an open an email with embedded or attached HTML and JavaScript - they're more than messages, contain object-oriented code, and behave a lot like actors when interacting with external servers. For that matter, an awful lot of viruses and worms match the same description.

It certainly seems fair to pose the following two questions to the collected knowledge congregated here:

1. Is there a less vague way to characterize an email message containing HTML and executable code - particularly when it's actually executing? How would YOU characterize such an entity?

2. What languages and run-time environments are particularly suited to working with such entities (beyond simply treating them as text, and concatenating them into mbox files). Object-oriented languages beget object-oriented databases, for long-term persistence of large numbers of objects. It seems like there's a parallel requirement for persisting actors in massively concurrent environments - has anybody worked on or seen something like an "actor-oriented database"?

misc impressions

mfidelman: What if email messages are active - addressable, containing executable code (say JavaScript), and capable of sending/receiving/responding to messages.

That's quite interesting, but before anything else, note discussion topics here need a programming language focus to fit in this venue. You can get away with general design questions if it serves the purpose of asking how different languages might fare at one aspect or another. But somehow things must be phrased so folks can discuss how their pet language applies to individual parts. If the game appears slanted to pre-ordained choices, it's unclear what language issues can be discussed.

You might ask folks what language ought to be used as embedded code in messages, or how semantics are influenced by given choices. Just asking for comments on suitability of JavaScript here is helpful. (But perhaps people get tired of hearing about JavaScript; I don't know.) Questions that sound like, "How should I build this app?" don't quite engage the right gears. App implementation is a low priority focus.

The intersection of what you want to know and what folks here want to discuss might consist of questions about semantics of binding state and behavior, and managing identity from both a storage system and programming language perspective.

I could ask several dozen questions about semantics you want when various things happen, that all need answers before definition of what can happen seems rational and well-formed. Many of the questions have the form, "What happens when X happens 0, 1, 2 or N times?" What happens if a message is sent but not delivered? Or dispatched exactly once? Or dispatched over an over when just once was intended? I can ask a few more questions along such lines, but I think many folks hate this sort of thing.

What happens if you send a message to a stored entity, and the handler for that sends two messages in response, causing four more messages, then doubling again and again until available resources are exhausted? Pick any kind of exponential growth you want. Then consider cases where activity simply never dampens back down to nothing again after initial stimulus. What budget in resources can be consumed by dispatching a message? Who pays for it? When too many resources are used, how does a limit get imposed?

If you replicate entities in cool sounding ways (well I think NNTP sounds like a plausible model), and any given entity appears in more than one place, then how many should receive a message targeted at one entity? Some? All? At least one? How would accounting be done if you wanted to audit what happened? How do you establish whether an entity is present in a replicated store? Via index? With good access time?

How much state is mutable and how much is immutable? If dispatching a message generates new state to be stored somewhere, where does it go? Can you update the original entity? If the original entity is immutable, do changes go somewhere else to apply as updates? If so, how is this done efficiently? If someone wanted to audit what happened, how would they find out? Assume the system acts crazy now and then, or at least you see partial evidence in this direction. How is diagnosis done?

If executable code stored in an entity is JavaScript, what kind of state can this code reference? Where does that state come from? Can it see more than the state of the message it is inside? Is there some kind of sandbox constraining max CPU consumed or max cache lines touched?

Can you write nefarious applications with these mechanisms, like a vendor sending messages to their own spam to see what else that can be learned from the target of their spam? (I can see salivating vendors now.)

If you want a large system that scales, does it also scale down? Can a client with few resources productively interact with a very big store of entities? Can you function with whatever resources you have? Run faster when you happen to have more resources?

Parts of this model remind me of HyperCard. It's worth thinking about how it behaved, especially when it also had some poorly defined timing issues. (After an initial event, what causes processing in response to ever end, uf one event always causes another to be sent to a new object?) If you wanted to make a new kind of HyperCard, that would be really cool by the way.

I should wrap this up, but let's look at some practical storage issues first. You should avoid making your storage depend on the code. Don't let runtime versions invalidate stored entities. You likely need to avoid anything that sounds like an object database for this reason. You should make storage means abstract, so different sorts of actual implementation can be used, as long as they conform to a minimum abstract api. The system should not care care whether all entities go in one file or database, or one per file, etc. You should run multiple versions at one time, and compare their behavior for equivalence. (It's hard to decide if one system runs correctly by itself without comparing it to another that would have different bugs, if any.)

Reliability through replication sounds like a good idea. But do multiple copies of things change the semantics, the meaning of identity, or the cost of dispatching messages to actors?

You sound like you know what you're doing. But it also sounds like you prefer an aggressive reach beyond your current grasp too, despite having a good grasp. I highly recommend a target goal that sounds simple in summary, because it will actually be much harder in detail. If it sounds hard in description ... that usually doesn't get finished.

mfidelman: What if email

That's quite interesting, but before anything else, note discussion topics here need a programming language focus to fit in this venue. You can get away with general design questions if it serves the purpose of asking how different languages might fare at one aspect or another. But somehow things must be phrased so folks can discuss how their pet language applies to individual parts. If the game appears slanted to pre-ordained choices, it's unclear what language issues can be discussed.

Fair enough. I'm predisposed to Erlang, or a mix of Javascript and erlang. Erlang for its unique and mature capabilities for handling massive concurrency, Javascript because that's the lingua franca of HTML email and web pages.

The basic challenge I'm addressing is building a repository of email message, where messages contain Javascript, and exhibit actor-like behavior when opened in a browser, but lose those behaviors when stored in traditional ways (mbox files, mh-style mail directories, SQL databases).

A conventional approach would be to store the messages in a database, and write an engine to mediate retrieving and "reviving" messages in response to incoming events (along with an addressing scheme, etc.).

An approach that appeals to me, conceptually, is to simply wrap each message with a tiny bit of erlang code, and instantiate each one as a process, link to something like node.js for javascript execution. Each message becomes a first class entity, behaves as an actor, and I can take advantage of all of erlang's machinery for handling massive concurrency and message passing. Issues become ones of long-term persistence of large numbers of actors that are spending most of their time hibernating - hence I have a strong interest in what people have thought about and/or built along these lines. (I expect that what's really called for is extending the erlang VM, to increase the UUID space, and to provide for a more extended and persistent approach to hibernation - otherwise, one would have to persist state in a database of some sort, maintain an additional directory mechanism, and wrap/re-create a lot of erlang's machinery for managing actors and for inter-actor messaging).

Short-term solution is probably going to be based on CouchDB - document-oriented database, handles embedded Javascript, also handles embedded erlang as a scripting option, embedded in the erlang/otp platform.

The intersection of what you want to know and what folks here want to discuss might consist of questions about semantics of binding state and behavior, and managing identity from both a storage system and programming language perspective.

Also something I'm very interested in examining. I'm not aware of a conceptual model (or design pattern, or paradigm, or what have you) for thinking about and working with "active" email messages (i.e. (X)HTML+JavaScript). Different paradigms and semantics apply, depending on what you're doing, how you're doing it, and in what language. They can be "message", files, database entries, objects, actors (particularly if executing headless, say on a server-side node.js engine). "Mobile agent" is about the closest model I've seen, but I haven't seen anything in the way of mobile agent languages/systems that approach the clarity or maturity of actors running in the erlang/otp environment. If someone has some knowledge of such a paradigm and/or language/environment, please speak up!

Beyond that, you've raised a lot of more detailed questions - some of which we've been thinking about, others that we need to - and I thank you for that!

Parts of this model remind me of HyperCard. It's worth thinking about how it behaved, especially when it also had some poorly defined timing issues. (After an initial event, what causes processing in response to ever end, uf one event always causes another to be sent to a new object?) If you wanted to make a new kind of HyperCard, that would be really cool by the way.

Funny you should mention that. I've always thought that HyperCard was the most powerful and accessible environment ever built for end-user coding. I've seen some incredible things built in it, mostly by educators (including some language teaching apps written by my ex-wife). Nothing since has ever come close in power, and it sure would be nice to recreate something like it (Tilestack sort of tried to do that, in a browser, but all tied to server-side code. A hypercard-like environment, overlaid on javascript, all running locally would be an incredibly nice tool.) Then again, there's a sociological discussion to be had about why so many c and java coders seem to discount the basic possibility of end-user programing :-)

I should wrap this up, but let's look at some practical storage issues first. You should avoid making your storage depend on the code. Don't let runtime versions invalidate stored entities. You likely need to avoid anything that sounds like an object database for this reason. You should make storage means abstract, so different sorts of actual implementation can be used, as long as they conform to a minimum abstract api. The system should not care care whether all entities go in one file or database, or one per file, etc. You should run multiple versions at one time, and compare their behavior for equivalence. (It's hard to decide if one system runs correctly by itself without comparing it to another that would have different bugs, if any.)

Is this really possible though? A lot of my background is in networking and protocols - so I very much agree with the notion of implementation independence. But... there's an awful lot of leverage to gained from building on something like CouchDB - which brings with it some serious implementation dependencies.

Reliability through replication sounds like a good idea. But do multiple copies of things change the semantics, the meaning of identity, or the cost of dispatching messages to actors?

I'm not too worried about the costs. The primary application target is keeping multiple copies of documents synchronized (at least loosely) - the number of copies of any specific document, and the frequency of updates will be fairly low.

I'm converging on a model that borrows from NNTP and distributed software version control (particularly DARCS) -- groups, threads, and patches: The replicated message model of NNTP, with a document as the first message in a thread, and subsequent messages acting like patches. A primary reason for treating a document as an actor, is to allow for embedding document-specific rules for applying patches and resolving conflicts. In essence, after it's distributed, a document subscribes to the message stream, updates itself as messages come in, and optionally generates alerts to tell users that a change has occurred. (Think of a book, sitting on a shelf, subscribed to a stream of errata, updates, reader discussion, etc. - capable of updating itself, and flashing an LED to tell you that there's something worthy of attention. Or a document, attached to an email, buried 5 layers deep in a folder, doing something similar.)

You sound like you know what you're doing. But it also sounds like you prefer an aggressive reach beyond your current grasp too, despite having a good grasp. I highly recommend a target goal that sounds simple in summary, because it will actually be much harder in detail. If it sounds hard in description ... that usually doesn't get finished.

Aww shucks... and I hear you. The good news is that part of what we're doing is funded R&D, so having challenging problems is a good thing :-) The bad news is that we're also trying to release real product and services, with limited resources, and that need to be deployed into real-world environments (i.e., old browsers, users who are not going to be installing erlang on their desktops, etc....). Short term is almost certainly going to be couch based (unless I hear a good alternative very shortly), running mostly server-side - with hooks for moving to a more distributed/P2P approach over time. Trying to keep the longer term directions in mind so we don't go down design ratholes that screw us down the road. (I.e., most of this discussion is academic in the short term, both might help avoid problems as we move forward.

Thanks!

CouchDB

FWIW, the creator of CouchDB says:

The CouchDB product has a lot of features, but is too slow, unable to keep up with high loads and inability scale-out on it's own.

[Oh, and they're switching from Erlang to C.]

Architectures of extraordinarily large, self-sustaining systems

BTW this presentation might be interesting for you. At least it will give you some good questions.

Slides?

I don't suppose your slides are available?

This is a very nice

This is a very nice presentation, but it is not mine. You could try asking infoq or the author for them.

that's what I get for....

... responding to things late at night. Ooops.

Really interesting presentation. Thanks (I bit the bullet and spent an hour watching the video).

develop actor-oriented database

Yes I believe you need an actor-oriented database - as an extension of an object oriented one. I wonder why such a beast doesn't exists yet. This is not such hard to develop. For durability, transaction log should just record changes in actor's state together with changes in token queues, including the queues that connect database with outer world. As a starting point, take some opensource java NoSQL storage.

elaboration?

I'm sort of wondering that too. I would think that any large actor-oriented system needs some kind of persistent storage for large numbers of hibernating actors.

"java noSQL" is at the same time pretty specific, and pretty generic (eXist and neo4j are very different beasts, for example) - might you elaborate on your thinking as to:

i) Java (as opposed to, say, an erlang noSQL database), and,

ii) what flavor of noSQL

(I've been leaning toward starting with a document-oriented database such - particularly Couch or eXist - both of which seem to have some functionality relevant to this problem space)

It is scientific workflow with transactions and provenance

Java vs Erlang: Java has much more huge codebase, as well as number of programmers. And there's nothing in your task that Java cannot cover.

NoSQL: I meant some key-value store to keep your e-mails. However, the key point is that you have to keep the state of the whole process, so that after restart, no action would be repeated twice and no action forget. Moreover, you need to save all the process history in order to reply questions like "why did/did not you send that email". The state of the process consists of actors' state and token queues that connect the actors. Transaction includes retrieving input token(s) and pushing output token(s) after execution of actor's handler. Talking of actors and tokens I mean dataflow execution model.

Look at existing scientific workflow engines if they support transactional persistense. If not, start with any transactional database an build working proptotype.

Io

If you're interested in something low-level that allows for efficient storage of objects with some actor-like properties, you might consider Io (http://iolanguage.com). Having said that, over most of its history Io has been fairly buggy ... but having said THAT, I haven't tried it for a couple of years and very likely it's better now.

1. Is there a less vague way

1. Is there a less vague way to characterize an email message containing HTML and executable code - particularly when it's actually executing? How would YOU characterize such an entity?

Make logical distinctions between the message, the interpreter, and the object created by the interpreter upon receipt of the message.

Object-oriented languages beget object-oriented databases, for long-term persistence of large numbers of objects. It seems like there's a parallel requirement for persisting actors in massively concurrent environments - has anybody worked on or seen something like an "actor-oriented database"?

I think this isn't a useful line of inquiry. Databases are already concurrent, right? The main issue is that you'll need to store a script representing actor state, which means you'll need a suitable language - one that can serialize closures.

2. What languages and run-time environments are particularly suited to working with such entities (beyond simply treating them as text, and concatenating them into mbox files).

In addition to persistence issues, language-based security and resource control (to limit denial of service attacks) are very desirable features.

Content-centric networking; Edge-side processing; Social network

PARC's content-centric networking (CCN) is some of the finest available thinking regarding how to build large-scale networked systems. Here are my bookmarks on the topic (the older ones are more general and introductory).

Your post is somewhat vague - in general my feeling is that doing lots of active stuff in the center of the network is problematic. I would try to move activity to the edges, and make the network storage dumb and static (static file serving and joins of sorted files at "planetary scale" is a solved problem - cf. Google).

Every one of your clients probably has multiple, over-provisioned computing devices, some of them always-on (cell phones will soon have 50Mbit/s uplinks) - try to let them do the work.

[One more idea: one of my slogans is to "make the social network the content distribution network". A lot of data has social characteristics. Subscribers/followers (in the Twitter sense) of a data source have an incentive to locally cache the upstream data: makes it easier to build local views, indexes, and aggregations. Now: have subscribers of a data source build a swarm, like BitTorrent peers, for distributing that data further. Probably requires something like Bitcoin for fairness -- OK I'll stop now :)]

cell phones will soon have

cell phones will soon have 50Mbit/s uplinks

I'd really like to see work done towards optimizing for energy and power consumption.

Smart phones today typically have 1500-2500 mAh batteries, and advertise circa 10 days idle, 7 hours speech, but they burn far faster than that if performing heavy computation or data transfer. People reasonably complain when their advertised 7 hours expires after just 1.5 hours on Youtube.

Putting more of the computation `in the center` - or in a cloud - seems a useful way to shift energy costs from where they are expensive to where they are cheap. This must, of course, be balanced against the energy costs for communication. One might also shift server functions to the device, and vice versa, to save on communication costs. But to perform this optimization - without creating many versions of the same application - will require a good language for code distribution (which must also address security concerns).

Pushing things "into the

Pushing things "into the cloud" has all kinds of negative ramifications for who controls what, privacy, and resiliency. It also ties your operations to having reliable connectivity. None of which are desirable characteristics in military, public safety, crisis response, and similar applications (where I tend to work). For that matter, not really useful if you're a revolutionary in Syria either.

Into the Cloud

There are "ramifications", sure, but I think you pessimistic to dismiss them as "negative". We'll need to explicitly or formally address some challenges that have historically been implicit or informal, but that isn't a bad thing.

Imagine the devices as `part of` the cloud, not distinct from it. That is, the cloud has access to a device's CPU, bandwidth, energy, and storage via a service - though it might need to pay the device's owner for the resources it uses. When a client pays for an application, part of that payment is returned to the client when the cloud places the application code on the device. Naturally, the payment is asymmetric - there is profit to be made. And developers will be naturally encouraged to create *efficient* apps to increase the profit margin. This is a good thing.

Code and data distribution to devices then obeys the same rules as distribution to any other part of the cloud. If the cloud programming language ensures graceful degradation and resilience in case of partitioning, then we achieve implicit support for offline and persistent use.

In military, crisis response, and similar applications the cloud is a powerful basis for resilience, storage, and information processing. A cloud can minimize loss of information due to loss of a device. It can fuse information from thousands of sensors and sources, and summarize a cohesive operational picture. It can compute functions and maintain indexes that would otherwise be replicated thousands of times.

Of course, it is valuable to replicate portions of that operational picture onto the user's device in case of partitioning. But that can be understood as a feature of the cloud itself and the `zero-tier` programming language for the cloud.

The cloud is the future. Applications run in the cloud. Devices are participants in the cloud. Some application code runs on devices because that's the best place to run them, but shards of the same application might be distributed across hundreds of devices and machines. Code and data will flow easily - constrained only by economy, security policy, partitioning semantics. Immobile capabilities (like access to a touchscreen) can be expressed by the same mechanism as a security policy.

The cloud will be federated - no single business or country to hold it hostage. Cloud service providers will be forced to compete very directly due to cross-cloud-service compilers and optimizers - they'll be banking on reputation, physical security, and efficiency. As idle desktop computers join and start auctioning some of their extra CPU, memory, and disk - though suitable primarily for low-sensitivity computations - prices will plummet.

I think it inevitable. But there are things holding us back. We need suitable paradigms and languages. We also need support for a efficient fine-grained economy, with payments measuring in tenths of pennies or less, and secure but mostly-anonymous transfers.

People fear for ownership and privacy and such in the cloud. But a well designed, federated cloud will better secure your assets than anything you're currently using - and, at worst, will degenerate to using a fixed set of devices due to stringent security policies.

Fine, but not where I'm going

But a well designed, federated cloud will better secure your assets than anything you're currently using - and, at worst, will degenerate to using a fixed set of devices due to stringent security policies.

Reasonable men may disagree. My experience, in the environments I build systems for, tells me to push everything you can to the edge, and expect anything in the middle to be unreliable, if it's available at all. Basic design premise: smarts and replicated storage at the edges, intermittent and opportunistic IP connectivity in the middle, of varying bandwidth - and in some cases, data gets through by sneakernet.

Having said that, one can envision a "cloud" where all the computing is in the edge nodes, but that leaves open all the implementation details - and that's certainly not where the mass of work on cloud computing is going (all the federated cloud architectures I know of assume that the computers involved are fairly powerful and linked by reliable, high-bandwidth networks). Now if you know of anybody working on a cloud architecture, where the individual nodes are designed to run on smartphones, with intermittent peer-to-peer connectivity - please let me know.

I still come back to the underlying question: what are good ways to model the things running in that cloud? In this regard, I find the WebOS approach interesting, along with the current work on "boot to Gecko" - first class entities, running locally, linked across the net - very thin o/s. Working through what it takes to build systems out of such entities is an interesting question.

smarts and replicated

smarts and replicated storage at the edges, intermittent and opportunistic IP connectivity in the middle, of varying bandwidth - and in some cases, data gets through by sneakernet.

If that is what you design for, that is undoubtedly what you'll end up with - inefficient, inconvenient, and likely inconsistent.

one can envision a "cloud" where all the computing is in the edge nodes

Once you put computing on nodes of the `far` edges (in the geographic sense) there is little point to excluding computation in the middle. We could make Amazon servers look just like edge nodes as far as anyone is concerned. Such would be closer to my own vision.

all the federated cloud architectures I know of assume that the computers involved are fairly powerful and linked by reliable, high-bandwidth networks

If you research the keywords `survivable networking` and `disruption tolerant networking`, you'll find research towards systems that do not make these assumptions. My postgrad studies were deep in these areas, and even my dayjob work touches them on occasion - e.g. regarding command and control of UUVs, in UGV tunnel-mapping projects, safety concerns for teleop.

Anyhow, it is the assumptions our software makes that matters most. If disruption tolerant and resilient software happens to be hosted on a reliable backbone architecture, that's hardly a bad thing.

if you know of anybody working on a cloud architecture, where the individual nodes are designed to run on smartphones, with intermittent peer-to-peer connectivity - please let me know.

My work on Reactive Demand Programming (RDP) is well suited for the role. RDP supports graceful degradation and resilience in case of disruption, and RDP externalizes state and disfavors event processing, which simplifies several challenges related to maintaining and upgrading code distributed to remote systems without losing work.

Gilad Bracha's vision of Full Service Computing and Serviced Objects is also inspiring, though he never did get far on the whole maintenance problem.

Tom van Cutsem's vision of Ambient Oriented Programming is also applicable, as are the RESTful communication patterns he uses between objects. I mentioned AmbientTalk, an implementation of AMOP, in my first post in this thread.

I'm biased, of course, but RDP - with a few design patterns developed way before RDP - is the most promising architecture I know. And I have my own patterns for ambient systems programming.

I still come back to the underlying question: what are good ways to model the things running in that cloud?

The idea of entities or agents interacting in a system is alright. But you need to ensure the system stays maintainable, no matter what. This suggests keeping to a relatively small number of agents or entities. An idiom I developed years ago was agent per interest where agents and the interests they represent are directly tied to and maintained by humans or human organizations. This implies:

  • Agents should be unable to `spawn` or `create` new agents. Agents are the unit of maintenance, and we don't want to create more to maintain. An alternative to creation is discovery and composition.
  • We must always be able to `kill` an agent, almost immediately halting actions and recovering associated resources.
  • We must be able to upgrade an agent at runtime without losing work - i.e. without losing important state. A RESTful design helps here - put all important state in an external database or service.
  • We must be able to administrate every agent, view relevant state and code. I would encourage access to state should be tied directly to some graphical representation or location in code, and I would discourage creation of `internal` objects that might clutter the administrative view.
  • The number of agents, globally, is roughly bounded by the number of active human relationships - perhaps a few hundred agents per human, more for human organizations. Usefully, this tight coupling helps ensure that computers serve humans even as we scale.
  • Since we don't have too many agents, we must extend the intelligence and reach of individual agents - the ability to gain necessary data for a decision and to effect the decision.

Security, stability, extensibility, and predictability are also essential. We need to understand what our agents are doing, to locally reason about the global effect and the risks or vulnerabilities we accept.

RDP was designed with this in mind - nothing new, external state, declarative orchestration with reactive semantics for reach and intelligence, killable behaviors (via duration coupling), and three-fold security (object capability model, implicit revocation membranes based on reactive sharing, and a stable contagion/weakening model to support and constrain automatic code distribution).

But RDP isn't the only place you can find a similar design. You can see a lot of the above properties if you look into blackboard metaphor, RESTful web app servers, concurrent constraint programming, and similar.

Anyhow, from my perspective, you're simply working on the wrong problem. Scaling up the number of actors or objects has the nasty side-effect of scaling up the maintenance burden and does little to actually improve human intelligence or reach.

If you research the keywords

If you research the keywords `survivable networking` and `disruption tolerant networking`, you'll find research towards systems that do not make these assumptions.

Well, here you're talking about MY field of expertise - I spent a lot of my career at BBN, working on packet network architectures, management, and security. It's all about routing around congestion, failures, and destruction. The goal is to maintain connectivity. The research does NOT focus on how you design computing systems around networks that ARE disrupted, or faced with long delays, low bandwidth, or cases where you don't have (or want) connectivity (say, tactical environments where you want to minimize observability), or just working on an airplane with your wifi turned off. In these environments, the issue becomes how to design information and processing architectures around eventual consistency (or around an expectation of inconsistency - in this regard, google Hewitt and 'inconsistency robustness').

My work on Reactive Demand Programming (RDP) is well suited for the role.

I have to say that I can't seem to find a concise definition of the concept. From what I can find, it looks like RDP is aimed more at real-time control systems, than at information management systems with less rapidly changing data, and longer time horizons.
Don't see the relevance of Full Service Computing or Serviced Objects to what I'm working on. AmbientTalk is more interesting - but highly experimental, and seems to essentially combine actors with capabilities a la E (I'd sure like to see some of E's concepts integrated into Erlang, or just more generally adopted and supported).

Security, stability, extensibility, and predictability are also essential. We need to understand what our agents are doing, to locally reason about the global effect and the risks or vulnerabilities we accept.

If anything, the real lesson of large systems is that they have to be self-maintaining and self-healing, and tolerant of very wide ranges of unpredictable behavior. Packet networks are a case in point - you can't track or manage individual packets, or even flows - you build in timeouts and retransmits, provide for dynamic routing algorithms, and do your best to keep capacity growing ahead of load (and fall back to congestion control when you can't.)

In my particular case, I'm looking to extend the capabilities of demonstrably robust messaging systems (notably email and NNTP), by adding a level of "smarts" to messages that goes a step or two beyond what we're already doing by allowing HTML and JavaScript in messages. My sense is that key is a combination of getting the entity model right and designing/using the right protocols for their interaction.

Survivable Networking

It's all about routing around congestion, failures, and destruction. The goal is to maintain connectivity. The research does NOT focus on how you design computing systems around networks that ARE disrupted, or faced with long delays, low bandwidth, or [...]

You and I must have studied different subsets of that field, then.

While regaining connectivity was a big part of what I studied in my Survivable Networking courses, so was designing software itself to gracefully degrade, perform useful work, and resiliently recover under these harsh conditions.

it looks like RDP is aimed more at real-time control systems, than at information management systems with less rapidly changing data, and longer time horizons

RDP is designed to support both slow-changing and fast-changing data in the same system, in real-time. Though, RDP doesn't offer any miracles. Developers do need to make reasonable estimates about what is likely to change fast or slow, and must structure their programs accordingly if they want an efficient program.

A well structured RDP application will keep the slow-changing bits in hibernation easily enough, and potentially even specialize for slow-changing data (i.e. by partial-evaluation or tracing JIT).

the real lesson of large systems is that they have to be self-maintaining and self-healing, and tolerant of very wide ranges of unpredictable behavior

Resilience (self-healing) is important, but tolerance is too vague - there are many ways to go about achieving it, not all of them good. Rather than limping along, for example, I prefer approaches that fail in consistent ways at predictable boundaries (the software equivalent of fuse boxes) where I may specify an alternative.

In RDP, for example, I represent runtime failure as network disruption, which can be detected at special network proxy behaviors. RDP's design is such that I can make the disruption logically atomic, which enables clean transition to a fallback behavior for graceful degradation. Then RDP's reactive semantics allow me to quickly re-establish connectivity when the cause of failure is alleviated, providing resilience.

I'm looking to extend the capabilities of demonstrably robust messaging systems (notably email and NNTP), by adding a level of "smarts" to messages

We do what we must, because we can?

at the edge

Re. content centric networking: I respect Van Jacobson a lot. I'm actually surprised that we haven't seen more coming out of his work on CCN. A few papers, but not a lot of "product" to beat on yet.

Re. at the edge: That's actually the model we're following - NNTP-style message distribution (highly resilient, no central points of failure), document storage at the edges, with multiple copies kept in sync via message passing. (Actors linked by a P2P network, in essence.) The question I'm focusing on is what the actor environment at the edges looks like.

"Green" actors; MapReduce; Serializable continuations

I'm thinking that Erlang might be nice operating environment for such a beast, but wonder at what point one hits limits in the numbers of actors floating around.

I would not recommend mapping your actors 1:1 to an underlying process abstraction, be it Erlang or OS processes. If you did that, you'd essentially limit yourself to that substrate's interface and implementation.

Processing quite large datasets with MapReduce has become widely understood, and offers interesting human fault-tolerance - maybe you can map and reduce your problem to (repeated runs of) MapReduce? (pun intended)

A completely different, maybe opposite, approach would be that of serializable continuations (the most extreme example being Kali). But that has never been used on a large scale.

but that's the point

Granted, this isn't practical today, for the application I'm working on, but... It's fairly common to map things 1-1 onto objects. Some problems lend themselves better to a model where each object is "active" (i.e., everything is an actor). Seems like a reasonable research focus.

Kali certainly sounds like an interesting approach to what I was originally asking (an entity model that is somewhat message-like, object-like, and actor-like at the same time). Thanks for the pointer! (Sort of a shame that it seems like a dead project.)