Eventually consistent distributed STM

An implementation based on a paper I posted on LtU a couple years ago. It turned out to offer a great programming model, which looks like REST but dynamic, like Google Docs vs. a static Web site. The eventual consistency aspect allows server scalability and offline modification of data for clients. Versions merge later when connectivity is back, like distributed source control. link

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Types for Eventual Consistency

From the blog:

Updates are accepted immediately, and conflicts resolved later between versions in a deterministic way on all machines. This resolution is currently arbitrary, the system just picks a version.

For partitioning tolerance, you might interested in pursuing cloud types for eventual consistency rather than whatever ad-hoc mechanism you're currently using. Similarly, Logoot: a P2P collaborative editing system describes a commutative replicated data type aimed at text editing. (These types would mostly support update arbitration, orthogonal to to the transactions model.)

Full partitioning transparency is a bad idea, as there are often timing-sensitive operations in real-time systems. How do you make partitioning observable to interested subprograms of a web-app or server?


Existing eventual consistency systems do not seem to fit well with transaction-based updates. A transaction can modify an arbitrary number of objects, and the system cannot reorder only part of it without breaking the semantic. It has to order the whole transaction against others, and cloud types work at the object granularity. I have not figured out the exact pro and cons of each approach, but one benefit I see is that if you modify a large number of objects in a transaction, you still only create one timestamp, so overhead might be lower.

Updates inside a transaction are actually extremely fast, in the hundreds of millions per second. The overhead comes later, but only once, when the transaction is committed, gets the vector clock info and is serialized.

For partitions, the system is currently very basic, servers send all writes to each other and clients connect to a random host. I am targeting small clusters, like the master/slave setup used for databases, with a simple usable system. Please let me known if you have comments on things that need to be done before you feel you could use it in an actual project.

Merging Transactions

Merging transactions are hardly transactions at all, IMO. Yet if you're performing conflict resolution after the transaction has committed, that's basically what you're doing. Above, you basically said you're using a very simplistic merge function (arbitrary but deterministic winner). Where EC types can help is formalizing the notion of `merge` at the object level, offering determinism without ad-hoc loss of information.


For merging I rely on the STM, which works at the object field granularity and only supports commutative types. When a transaction commits, it merges its write set by updating e.g. a specific map key or object field. If all transactions impact different keys and fields no information is lost.

If two transactions update the same key or field, information has to be lost. They are applied one at a time atomically, so the value that ends up being chosen to resolve the conflict depends on which order transactions are applied. This is where the user should be involved to make a decision, but for now an arbitrary order is picked.

Write skew

Is it true that all STM implementations suffer from some kind of write skew?

Not only STM but also the Oracle RDBMS or any other relational database that is based on Multiversion Concurrency Control (MVCC)?

Write skew

I don't know exactly for databases, but STMs have various trade-offs for isolation. OF can commit a lower number of tx/s than other implementations, but a higher number of ops/s if you batch them in large tx and there is low contention. It offers view isolation, which means transactions seem to run in consistent and stable snapshots of the whole memory and have no write skew.

Constraints between point objects

Let's assume constraint x1 == y2 between two point objects (x1,y1) (x2,y2). Now let's say we have a transaction T1 that alters x1, and a transaction T2 that alters y2, with x1 != y2.
There is no data conflict when merging T1 and T2, but the constraint does get violated. Does your system have a way to efficiently meet such constraints or at least express them?

Constraints between point objects

No, there is no way to express a constraint. It would be great by the way. Consistency is only in the sense that a transaction can commit if the values it read have not been modified by another since it started.

Implicit object constraints

I believe that there are always implicit constraints between objects in a complicated object graph. So applying STM naively on such graphs would probably violate their constraints. I think there is a way out by making constraints between objects explicit.

However, I've no idea how to enforce constraints efficiently in a STM setting. I think that is an interesting research topic.

Virtual Time

This reminds me of David Jefferson's work on
Virtual Time and the Time Warp operating system... I've only recently discovered these, but perhaps there are some useful concepts there to work into ObjectFabric, particularly anti-messages.

Can you explain further?

Please see "The Dangers of Replication and a Solution", by Jim Gray, Pat Helland, Patrick O'Neil, and Dennis Shasha. The paper compares eager and lazy replication (x-axis) against group and master schemes (y-axis) and describes the trade-offs as laws of nature. They then propose an innovation, two-tiered replication, bifurcating transactions into base transactions and tentative transactions.

Can you please come again, and explain how you solve the scalability issues?


OF does not do anything new compared to a NoSQL store, it scales by not providing strong consistency. The innovation is that the eventual consistency information is embedded in the client resource representation instead of being valid only within a system like a Cassandra cluster, so it can be used over the Web to enable offline clients etc.

Is that a response to my question?

The paper I linked, although written before the Web became mainstream, deals with the same kinds of questions the Web deals with now, and is remarkably well-written. Offline clients are simply mobile clients with indefinite, unbounded downtime.

Although the paper is written in the context of ACID and not BASE, the advice given is seamless.

More info

Yes I meant to reply. I just posted an update with more info about the internals and vector clocks on the wiki.