"Relational Model Outgrown" CACM May 2013

Summary:
Relational data bases have been an outstanding success and become dominant in the commercial world. However, computing has changed dramatically in the decades since the Relational Model was developed. Consequently, the Relational Model and SQL have become obsolete because of the limitations listed above and innovations like the Actor Model and ActorScript[1] are required to address current and future needs.

References:
1. Hewitt, C. ActorScript. arXiv. Cornell University. May 2013. http://arxiv.org/abs/1008.2748

URL:
https://docs.google.com/file/d/0B79uetkQ_hCKb20xSFUzUkg3NzQ/edit?usp=sharing

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

If a systems researcher

If a systems researcher wrote a reply with this title, they wouldn't be talking about actors, rather they would be talking about key-value stores, NoSQL, MapReduce, and so on.

As someone teeters between systems and PL, I have a very hard time making sense of this.

Whither dataabases?

Key-value stores, NoSQL, MapReduce, and so on have no inferential semantics.

But pure logic programming is insufficiently performant and semantically insufficient. So what to do?

Right. In systems its all

Right. In systems its all about scaling, performance, and perhaps consistency, so the restricted programming models popular are aimed at these qualities. MapReduce is extremely simple semantically, and so implementations are able to pipeline IO like crazy. More general purpose models...like say logic programming or even actor models, don't really provide similar opportunities.

Now flip through the most recent OSDI proceedings; there are systems like Spanner which is only semi-relational but geo-replicated. Again, restricted semantics in exchange for scaling seems to be the big trend in pragmatic systems.

Requirements to supplant Relational Model

Of course supplanting the relational model requires:
* Mapping, Reducing, and being able to pipeline like crazy
* Geo-replication

Why do you think that restricting semantics helps increase scalability?

Why do you think that

Why do you think that restricting semantics helps increase scalability?

Restricting semantics helps you optimize the heck out of IO, which is all that matters these days. Distributed memory systems are basically dead accordingly since they are too flexible to scale. I'm not sure about actors, but I haven't heard much about how a distributed actor model allows one to optimize IO in a cluster, it doesn't seem to be something that people talk about (yet?).

Can you give me an example?

From what I can tell of most NoSQL research, they don't have a semantic model in the sense Carl Hewitt or Ted Codd would use the phrase. Ergo, they don't have restricted semantics, since you need to have something in the first place in order to restrict it.

For Functional Query Languages, see Ryan Wisnesky's work for the state of the art.

And "all that matters these days" is that researchers continue to get more and more arrogant about what I do or don't need, with little regard to how to solve technical contradictions through real innovation rather than re-arranging buzzwords to form a new programming model. How many different ways can computer scientists cost me 4 hours on the phone with some company because my bill got screwed up?

the center of the tootsie roll pop

How many different ways can computer scientists cost me 4 hours on the phone with some company because my bill got screwed up?

The famous "software crisis" that was a hot topic in the 1970s ended in practice when the capitalists realized there was money to be made on the margins of infrastructure that was maintained indefinitely in a state of imminent failure (thus needing constant but constantly barely adequate reinvestment). It's similar to the seductions of easily-obtained but fiscally-entrapping unsecured consumer credit or too-easy home mortgages. It's kind of a pyramid scheme. Take your money and leave the resulting infrastructure for two or more future guys to clean up.

"computing has changed dramatically" ??

You mean most commercial computing is no longer concerned with payroll, sales ordering, manufacturing, accounting, airline booking, banking, utilities billing?
Perhaps on your planet (wherever that is).

Two points. First, not all

Two points.

First, not all of your listed applications require relational, there are other points in the design space that can accommodate them. It just so happens that relational won for a long time.

Second, a lot of new and (dare I say) hot application areas are very not appropriate for relational. They require scaling and performance that relational can't really provide. The SQL databases aren't going away, but Oracle isn't selling more and more of them every year, the market has completely matured, and is possibly beginning to shrink somewhat.

I beg to differ.

All applications don't require relational databases but most do! Other data structures might be able to handle large amounts of data but not as easily and with the maturity of relational. I remember like it was yesterday when "main framers" (1990) said that relational was "mickey mouse" and would never be able to handle large amounts of data. Who was right?

"Scaling and performance" can't be provided by relational? Have you seen the clustering and other options for relational from Oracle? They might not be selling substantially more of the databases BUT they charge a yearly fee so their revenue definitely isn't going down. Are you aware that Larry Ellison (majority owner of Oracle) is one of the richest men in the world and that that has happened in only the last 15 years on the back of relational databases?

FWIW, I am trying to create an alternative to relational databases and regardless of my above comment, I am not one of their fans.

Post SQL

Put it this way: its like how we talk about "post PC." The PC is not going away by any means, its just that the market is no longer growing and is beginning to contract a little bit.

It was only a few years ago where relational was the only realistic option to managing data, but today, we have more realistic options. This gets into some weird cases where relational is the answer but some guy at a startup picks another technology because it was hot. But the opposite also occurs in the enterprise.

More like: Startups are not the answer

Looking at startups for LOGICAL design is almost always a disaster. Good, Fast, Cheap. Pick Two. Startups almost always are forced to pick Fast and Cheap.

Open source relational databases like MySQL, SQLite, Ingres, and PostgreSQL are all weak compared to Oracle and SQL Server. And Oracle and SQL Server are both weak compared to what they could be, and hopefully will get the chance to be 10 years from now. But what is really sad is nobody is willing to build a new relational database from the ground-up, unencumbered by SQL-92 and various ANSI idiosyncrasies. I guess Greg Morrisett's group at Harvard is an exception, since he and his merry men are looking to build formally verified relational database management systems.

Let's be clear: there are

Let's be clear: there are new problems out there to go along with the new solutions. Its not like old solutions have been deprecated for old problems. It just happens to be that systems research area seems to be focusing on these new problems. The PL area talks like they are trying to solve the same problems, yet the solutions are old, so who knows?

As for companies, many start ups are not solving old problems either. To judge these companies fairly, you have to look at what they are doing.

Is relational the problem?

MySQL started as good, fast and free and then became (by far) the database used by more systems worldwide than any other. This wasn't because of more features or the fact it was relational. It probably would have had the same result if it hadn't been relational, had the same easy PHP interface and still had been free. Didn't this light weight program get sold to Sun for a billion dollars?

There are many developers out there trying to make a better product than the current versions of Oracle and SQL Server but most of them think "relational" isn't as useful as other ideas (NoSQL Databases). What makes you think that strictly adhering to "relational" ideas will make for a better DBMS? What if less adherence to the relational model makes the system more useful?

A valid null hypothesis

I am all for the scientific method and creating a hypothesis and a null hypothesis.

Null Hypothesis: Quality data management has nothing to do with "relational" ideas and there is no benefit to using first-order predicate logic for querying data.

Alternate Hypothesis: Quality data management requires many tools on top of the relational model to be successful. For examples, (1) efficient extract-transform-load (2) automatic presentation of database information (3) schema refactoring tools to quickly adjust to changes in business requirements (4) tools to assist in Record Linkage (RL) [1] and data de-duplication (5) master data management to reason about geographic disbursement and sharding of data, related back to system performance (6) tools like SQLNexus to aid in automatic discovery of whole data stack hardware/software tuning issues. (7) query generation via ORM or some other technique (8) ability to "project" business rules from the data model to other tiers of the application (9) Not simply a transactional DDL, but integrated schematic evolution, possibly via gauranteed robust updates and rollbacks via re-versiable computations or the lesser concept of mere incremental computation (independence and quiescence for upgrades).

Actually, a lot of these ideas are slowly converging. For example, techniques for data de-duplication are being widely applied to SANs now. It is just you don't see this aspect presented to you.

I outlined on LtU awhile back the ways in which we would do these tasks at my previous job (to explain why languages don't always or shouldn't always solve problems), and also outlined in a separate post when languages should be used to solve problems and linked a half dozen to a dozen GOLDEN MAN BABY quality research papers demonstrating timeless language architecture.

Caveat lector: MySQL most certainly did not start as good, fast and cheap. I repeat, Good, fast, cheap, pick ONLY TWO. MySQL was fast and cheap. But it's not good, and never has been. That's why my friend EricB makes hand over fist consulting for companies running into MySQL scaling issues; he made his lay working as the right hand man to the MySQL king at Yahoo for years, and then when he saw the huge market for his rare knowledge, he started his own company.

Are we seriously forgetting that MySQL did not have any real foreign key / "relational integrity" until version 4.0 with InnoDB tables replacing MyISAM tables? Yes, FOUR versions. Oh, and what about correctly optimizing simple update statements to use outer loop joins? Honestly I can't even remember all the things I hated about MySQL in college. You whippersnappers (think you) have it so good (but you are so wrong).

And I don't really understand you bringing money into this. It hurts your argument. SQL Server and Oracle are bloody expensive. Want partitioned indices? $25,000, please. Oh, that's per socket licensing. You have two sockets. Wow, $50,000! Oh, you neat to scale out? We have Data Center Edition, just do a handstand and we'll shake your pants to empty out whatever change you have left. Do you think if SQL Server Data Center Edition were free we would really see people using so many NoSQL instances? If so, why? Is it less to learn? Isn't that defeatist for an engineering profession to drop educational requirements altogether?

Just look at my list of 8 things above carefully. It is just scratching the iceberg of what I really want. I've actually got some decent experience here. [Edit: I will add a 9th thing. Query optimizers for the best SQL DBs are black boxes. So say if I was HYPOTHETICALLY writing my own query generator to target SQL Server. How do I decide when to hoist-sink a common sub-expression in query I am creating? Can a SAT solver optimization figure it out easily?, would you use a different technique and why/how? Now suppose I want to make the query generator a generic query compiler infrastructure and target different backends. How do I do that so that I can guarantee queries will always execute on all my targets and yield relatively explainable and consistent results? Don't bother attempting such problems with NoSQL right now, since the answer is brute force typing and has nothing to do with math or computer science. All NoSQL research seems to be leaning towards FQL (Facebook Query Language) since that is where the research money seems to be. -- And it is only in the past YEAR that SQL Server has simplified its query API for handling common scenarios like pagination. You don't want to see/know how ORMs do pagination differently across DB2, MySQL, Ingres, PostgreSQL, Oracle, various versions of SYBASE/Microsoft SQL Server, or the gory details on lessons learned/won writing gradually less crappy database drivers with every iteration.]

[1] http://en.wikipedia.org/wiki/Record_linkage

MySQL, and so on

I concur on the quality of MySQL - not only has it never been good, it's not good now. Well, it *might* have suddenly got good, but I doubt it - from memory, last time I ran my pathological cases[1] against it, in 2010 or so (against 5.1.<mumbledy> and 5.5.3), I still had 3 repeatable data corruption issues and a couple of runaway queries that would eventually take the server down.

MySQL got its following not from being good, but from being free, and fast enough to run Slashdot. But you don't really want to let it loose on data you actually care about.

Getting back on track, I think we're more or less on the same page. Sure, there's lots of problems with "RDBMSs", but the problem isn't, as I see it, the relational model.

[1] Ironically enough, now lost to a disk failure. Data corruption eats tools for exposing data corruption.

Computing has evolved but the old applications have not

Enterprises are not happy with the limited capabilities of their information systems based on the relational model, e.g., SAP, IBM, Oracle, etc. But currently they are stuck with them.

This is not completely true:

This is not completely true: many enterprises are adopting non relational systems, if not for their old applications but for their new ones. Especially as they move into the cloud, start dealing with bigger data, and so on. I have the feeling we are talking about different things though.

At what scale should your model be implemented?

I would be interested to know what you think of the message passing in Erlang? Erlang and functional programming make function calls that return by using a stack.

It seems to me that your "Actor Model" was designed to be implemented at the CPU level where messages would replace subroutine calls altogether. I recently watched you on a video and you stated that "goto" wasn't the problem but "assignment" was. This sounds quite reminiscent of some of the ideas of functional programming.

With almost every company that uses a computer, using a relational database to look after and manipulate data in a standardized manner, how does the "Actor Model" meet this obvious data need better than relational? For most companies, the scalability of relational is currently not a problem. Smalll scale concurrency is also not a problem, as relational can effectively use up to 16 CPUs now. Computers with many more CPUs than that are special purpose "parallel computers" rather than "concurrent" computers and the current shared memory model breaks down at about 16 CPUs. This makes it unlikely that current multi-CPU architectures will get more CPUs without a breakthrough in memory organization and bandwidth.

It is possible today to scale to more CPUs than 16, to look after data by using distributed computing but relational databases can already implement arbitrarily large distributed systems.

Unstructured assignment considered harmful

Dijkstra blamed the use of "goto" for spaghetti code. The unstructured use of assignment is a bigger problem in creating obscure dependencies and side effects.

Efficiency of Actor Model

In order for an Actor programming language to be competitive, it must perform message passing as fast as conventional languages do function calling and looping.

Erlang versus Actor Model

The relationship of Erlang to the Actor Model is described in the proceedings of Inconsistency Robustness 2011 at:
http://robust11.org

It's obvious since forever that shared-memory doesn't scale.

We've known forever that there's a limit to the shared-memory model. The administrative overhead for threading, mutexes, etc. grows faster than compute power is added, and always has. And this says nothing of the intriguing and mysterious bugs which that model sets in the path of programmers and system security.

Moreover, shared-memory, despite several intriguing attempts to do it, has always scaled even worse across arbitrary-sized networks than it does on one machine. So I'm kind of "down" on putting shared-memory semantics directly into a language. I consider it to be something that is best introduced explicitly by a program transformation or parallelizing tool that doesn't make the careless mistakes or optimistic assumptions that flesh is heir to, and even then only if it can be shown to have a performance advantage over not using it.

For a good scalable, parallelizable computing system, I think communication between processes limited strictly to queues or ports/sockets, while slower at the sub-sixteen-or-so-cpu level, scales better and has a definite speed advantage beyond the range to which the shared-memory model scales.

Oh, wait, I'm talking about optimizing the heck out of I/O queues, aren't I? And that's why relational scales so well, isn't it? A relational back-end is essentially a bunch of processes that talk to each other via optimized I/O queues. And I think there are a lot of ways to transform nearly arbitrary programs into that paradigm.

RDBMS do scale quite well.

The bigger problem with multi-cores using a single memory is the memory bandwidth. I agree that explicitly programming concurrency on a multi-core box is a problem but not if the details are looked after by the language.

I have never heard of "sharing memory" across "arbitrary-sized" networks. I have only used the term "shared memory" to mean, CPUs sharing a single memory! With the only speed increase being more cores (for the foreseeable future), concurrency isn't an option and I agree it can't be left to the programmer. The only conclusion is to put it automatically in the language.

I implement queued messages between "green threads" that are as separate as possible to increase performance and consistency but that must be implemented on a shared memory model as that is what the operating system provides (if every thread isn't to be a full operating system process). I would think that systems that directly use/share the same memory and use locks would be substantially slower than my system that has local memory for most tasks. I have done many tests and this is what they show me.

And I think there are a lot of ways to transform nearly arbitrary programs into that paradigm.

I couldn't agree more with your last sentence!

Coherent memory scales well on a chip

Coherent memory scales well on a chip. And so far it is extremely competitive on a circuit board.

What to do with 100+ cores per chip?

We are rapidly moving towards having 100+ general purpose cores per chip on our phones, tablets, notebooks, and servers. Of course, the number of chips varies :-)

But current software was not designed for this.

How are actors going to help

How are actors going to help there? Even the more flexible Intel Xeon Phi requires some amount of memory pipelining along with the SIMD to be used effectively. Actors solve one problem: dividing a computation into multiple concurrent tasks, but it turns out there are many more problems to solve after this.

Coprocessors becoming obsolete; streaming OK

Coprocessors like Intel Knights Corner that don't participate in coherent memory are becoming obsolete. However, streaming is still a good idea and Actors do it well.

Its not clear whether

Its not clear whether transparently coherent memory is going to win out yet or not. On the one hand, GPUs were able to be really fast without it, on the other, the memory hierarchy is making its way back into new hardware as we figure out what's useful and what's not.

I would like to see/read more work in this area. I get that actors are totally great at managing concurrent computations, but their role in high-performance computing is less clear: when, why, and how will they be the answer to our multi-core/distributed computing problems? To say "multi-core so...actors!" is a vacuous statement on its own. Why are actors being ignored at the high end of systems computing as a solution? Is there just a point where actors can't scale efficiently, or is there just more work to be done? Or are we just talking about different problems that actors are not designed to solve?

Coherent memory will win

Coherent memory will win because otherwise programming is difficult.

Actors will win because they facilitate performant programming. But it takes some time to bake the software :-(

Not clear

Coherent yes, but not necessarily those with strong consistency, because power costs money.

The only thing keeping Intel processors afloat in big data centers is the fact that concurrent C/C++ code written for Intel doesn't port to ARM worth a darn. The issue is the difference in the memory consistency models, and given the absence of semantics in C, it can't be solved in the compiler.

But ARM is working hard on a low-overhead solution, and other languages don't have this problem in the way that C/C++ do. It seems unlikely that the strongly consistent memory model will prevail for too much longer - there are just too many efforts underway that might eliminate the requirement, and weakly consistent hardware is significantly cheaper to operate at large data center scales.

Arm has no power advantage over Intel

Arm has admitted that their processor has no power advantage over Intel. Instead they claim that they have an infrastructure and consequently a cost advantage.

Coherent memory uses less power because fewer instructions are required in an application.

Relational models scale.

Relational model, and even more expressive logic programming, can scale well enough. We can address partitioning of data, distances/latencies, security, and potential for pipelines or streaming. cf. Bloom language which is based on a temporal variant of Datalog, called Dedalus.

SQL was not designed for modular, distributed systems. It is incorrect to attribute the limits of SQL or traditional RDBMSs to the relational model.

The usual difficulty with large-scale relational systems is indexing the views that cross multiple data sources. Fortunately, there are ways to tackle this problem, i.e. by using subscriptions or reactive models to maintain indexes, and by using common storage (whether DHTs or a trusted third party) to track the indexes (goal: to avoid replication without compromising security). I discussed such reactive indexing in an article a while back.

These days I'm interested in stable reactive indexes, i.e. the sort achievable by dynamic decentralized anytime hierarchical clustering.

I think that it comes down to compilers.

Relational paradigm is efficient and scales well to near-arbitrary sizes. This is good.

Other paradigms are more expressive. This is also good.

But, for most problems that can be solved in a general-purpose programming language, there is also a program/schema for a relational paradigm that solves the problem.

Our pain, IMO, is because we don't know how to find it.

If we have some kind of magic compiler capable of arbitrary transformation and optimization (which is an AI-complete problem with superexponential complexity, so obviously we can't have it) then we can take a program written in whatever paradigm, and express it in a way that allows a relationally structured back-end to run it distributed across zillions of CPU's.

We don't have the "magic compiler" that obviates all language performance questions, but we do have an interesting design space on the lower slopes of that unclimbable mountain. How can we present the programmers with a maximally expressive language or interface while presenting the back-end with a tractably scalable expression of the programmer's intent? I don't know of any work in compilers (except for a few proprietary SQL extensions that don't have corresponding papers published) that has been done on the code transformation problem involved here.

Anyway, if you're going to implement a language for this niche, you have to solve some compiler problems that there isn't a decent body of theory for solving yet.

It all comes down to

It all comes down to optimizing data flow graphs and execution plans; e.g. check out DryadLINQ or FlumeJava. The benefits are obtained either [1] through a restricted high-level declarative language (ironically, often looks like SQL), or [2] through staged computation.

The systems field is very much into solving these problems, but they take very pragmatic approaches to doing so.

SQL DBMS are not RDBMS and SQL is not a relational Language

I have gone through the comments here and most seem to not understand that SQL DBMS (such as ORACLE and SQL Server and their ilk) are not RDBMS. Neither is SQL a relational database language.

There are forums that are discussing relational theory and RDBMS and their implementations. There is still many areas within the relational database arena that have not yet been solved to everyone's satisfaction. View updating being one of those hot topics.

I have dealt with this stuff for 25 years or so and people who make the statement that relational theory and relational databases have been superseded all appear to assume that ORACLE, IBM, Microsoft, etc have actually created RDBMS. When looking at the technologies proposed for these "new" systems, they all appear to be based on ideas and methodologies that were being taught 30 - 40 years ago.

If I recall my history correctly, BS12 from IBM was one of the first true RDBMS. Other contenders have been Dataphor which unfortunately decided to include null for compatibility with the SQL DBMS that it could interface with. There are a variety of other implementations (not commercial) that implement relational theory.

As with everything, relational database theory has its place and uses. There are processes and situations where there are better techniques to use. But to say that it has been superseded by a different technology is demonstrating a lack of understanding of the range of tools that we have at our disposal.

To a more concrete example, you don't use a hammer to drive in a screw nor do you use a screwdriver to pound in a nail. Each tool has an appropriate use. As do screws and nails as means of fastening. There are other alternatives as well.

Hence in the database arena, different database technologies give different advantages in specific areas, but will be less efficient in others.

What I find is that many people wanting to solve database problems using alternative technologies to relational database theory don't have any real understanding of relational database theory. I have seen this in commercial software houses where they have developed their own database engine on top of an already existing DBMS. I have seen this in ERP systems and in data warehouses and in in-house developed database systems.

There are many brilliant people out there, but like a lot of brilliant people, they have their blind spots and are quite often very specialised and have no idea about other areas that is not their immediate speciality. Many of these people run rings around me in their specialities but are more ignorant than I am outside of their speciality.

Nulls and ordered multisets

Nulls and ordered multisets (tables) make SQL not strictly relational. But I don't believe they're the limiting factors on scalability. It's the other things - transactions, table namespaces, constraints on indexing, weak support for spatial-temporal joins (using order comparisons), etc. that limit SQL.

Scalability is a separate question to being relational

Whether or not something can be scaled to the large from the small is more related to the implementation that is used. The other matters are IMHO not related to scalability, they can affect it through the implementation decisions made.

The major problem with SQL is that it is inconsistent. The language is poorly designed, doesn't follow relational algebra and each company handles extensions (such as transitive closure) in different ways. The simple fact that all commercial implementations have limitations on the applicable key size and by default do not make the entire row as the key means that duplicate records can (and do) occur in non-keyed tables. This leads to subtle bugs in queries.

Inconsistent treatment of nulls causes other subtle bugs. You are the programmer have to do extensive analysis to ensure that theses kinds of things are not going to bite you when you least expect it.

There is so much wrong with SQL, but like COBOL it will remain with us till computers are no more.

As you can see I am no fan of SQL, even though I have been using it for a couple of decayeds. (Pun intended).

Interface, not Implementation

Implementation does determine our ability to scale a system merely by throwing more hardware at it. But interface determines both our theoretical ability to implement such a system, and our practical ability to transparently adjust an implementation to make it more scalable. Interface (or architecture), not implementation, ultimately determines scalability.

I acknowledge that SQL has a lot of problems, but those you're describing don't seem very related to the issues Carl Hewitt seems to be addressing. If we used a modified SQL' that has no NULLs and uses proper relations, how would this address any of Dr. Hewitt's concerns? (I think it wouldn't.)

Dr. Hewitt's arguments are against the architectures of (pseudo-)relational model systems in common use today, as exhibited in SQL - the instantaneous communication through transactions, the centralization of data and authority, etc.. In this context, the matters you call "not related" are highly related. (Note: I'm not agreeing with his arguments.)

Different problem/solution domains

A quick read of his page suggests that he is looking at a problem that is not relational in basis.

His example: if Alice points her finger at Bob and charges, “I accuse you of harassing me,” and Bob retorts to Alice, “I deny it!” then the mutual co-reference of language and gesture must be expressible.

This is not a relation per se. It could well be put into a form that is relational, but it is a set of opinions/accusations and associated actions. This is not a relational database problem. His use of actors may well be appropriate, I've not the experience here to say either way.

From his document, he does not appear to be distinguishing between different problem spaces and the possible solution spaces that are most appropriate for the problem space.

To say that RDBMS (or even SQL DBMS) is obsolete because it doesn't fit this particular problem is short-sighted at least. I wouldn't use a RDBMS for this particular problem unless the problem was restated.

For this problem, if we are storing accusations and responses, then we would need to set up the appropriate relations. Because we can now say we have true facts to store.

If the implementation bleeds through the interface, then we have a problem. But a good interface shouldn't do this and hence should not be the final arbiter on what theoretical ability we have to implement such a system. The interface should be simply this is the service I want without regard to any of the various means by which we may find that we can solve the specific problem.

If the language gives me matrix multiplication as a facility, I don't care how it is implemented as long as it comes back in a timely fashion. So if the implementer uses multiple cores, a single core or a graphics processor (or anything else that takes your fancy) to get me the results, as long as the results are correct, I shouldn't need to care how it is done.

My original comment was related to the statement of obsolescence of RDBMS or SQL DBMS. As I said earlier, his problem (in the form stated) is not relational, hence using a RDBMS or SQL DBMS is not necessarily appropriate as a solution. His problem/solution is the screws/hammer situation. Use the right tool. If you are using screws, a hammer is the wrong tool. however, the hammer is not obsolete it is just not appropriate.

An aside, most people do not seem to get the idea that there is a great different between the logical model used in the design of a relational database and the physical model that the RDBMS uses to implement that model. In a language context, one writes in the language given, but the compiler will translate this language into the necessary machine code for the machine to execute (whether that m/c is an interpretor or some silicon based monstrosity).

It is interesting that in the database realm, we have had foisted onto us the necessity of having to think about and manage the physical store related to databases we design, instead of leaving this up to the (R)DBMS to manage for us. We do it with languages and compilers but we don't seem to have managed to migrate this knowledge to the database world.

If the implementation bleeds

If the implementation bleeds through the interface, then we have a problem. But a good interface shouldn't do this and hence should not be the final arbiter on what theoretical ability we have to implement such a system.

I believe you have that second sentence backwards.

A good implementation shouldn't bleed through the interface. Hence, the interface is a constraint or arbiter of theoretical ability to implement systems (without bleed through).

(I do agree that some problems don't easily fit into relational. I also agree that such expressiveness issues don't render relational obsolete, e.g. due to the rule of least power.)

Google would disagree!

http://en.wikipedia.org/wiki/Spanner_(database)

This link about Google says that Spanner is a RDBMS and F1 is an SQL DBMS built on top of Spanner. Are you right or is Google?

Your definitions might be technically correct BUT to most other people the word "relational" is used in conjunction with Oracle, Microsoft and MySQL implementations that are mostly relational and contain SQL.

I have been working with databases (and creating them) on microcomputers for as long as there have been databases and microcomputers.

I am working on a language/database system that is definitely not relational but fully includes SQL. Many people (including me) believe that mathematical purity in software isn't a test of good design. What matters is that is works and it is useful. The point isn't to make the software fit the mathematical paradigm. Some people (outside of actual developers) are ignorant about what developers know or don't.

Spanner is only semi

Spanner is only semi relational--its columns must be named. From the OSDI paper:

Many applications at Google have chosen to use Megastore [5] because of its semi-relational data model and support for synchronous replication, despite its relatively poor write throughput. As a consequence, Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database.

And:

Spanner’s data model is not purely relational, in that rows must have names. More precisely, every table is required to have an ordered set of one or more primary-key columns. This requirement is where Spanner still looks like a key-value store: the primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns.

Relational model requires

Relational model requires every table have a primary key. I.e. proper relations are sets. Therefore at least the full 'row' must form a primary key. SQL's failure to enforce this is one reason that SQL is considered only pseudo-relational.

With that in mind, I am confused that Spanner claims to be 'semi-relational' on that basis. At least with regards to this property, Spanner seems more relational than SQL.

Confusion appears to be at Google

Whoever wrote the article seems to not understand what relational means.

Rows have names.

You are probably right that Spanner isn't "purely relational" from a Math point of view BUT try to write an application where you can't uniquely identify a row! I add a unique field with a common name to every relational table I create so that my meta code will work with any table.

The "un-ordered" aspect of a set has no useful purpose on computers. Any code that will work on un-ordered anything can always work on ordered data. Requiring order isn't what I am talking about.

Having unique access to any row guaranteed is routinely programmed into existing RDBMS systems for very practical reasons. Obviously the developers of this DBMS actually use databases to solve real problems.

Does not make them right.

Understanding how and why something works and is useful is a very important tool in any craftsman's toolbox. Not understanding but using any technique will only get you so far.

Understanding the different types of tools and their application allows one to determine a better solution for the problem at hand. Mathematical purity in software is not necessarily a test of good design, but it does help determine if a design has particular kinds of flaws that will come back and bite you. An example here is the different kinds of sorting available to us.

Mathematical purity can lead to insights that mean the tools we use are simpler, more consistent in their use and design. It can also lead to the development of much more complex systems that are understandable. Graphics, communications, database, encryption, etc, all depend on mathematics. Irrespective of whether or not the system is good, bad or ugly.

Too many people (developers) are ignorant about techniques and knowledge outside of their specific work area. The problem with them is that they know what they know and are not interested in learning more about the basics involved, whether this be the mathematics behind the techniques used or variations in techniques that come up with different programming styles/paradigms/languages.

In effect they are not able to think outside the box, which is a major problem in our industry. We all have blind spots, but we can mitigate these by continuing to extend our knowledge, whether this is in our field of study or outside it.

I suspect (from my many discussions over the years) that many are not interested in the history of programming. My personal belief here is that this is why many of the "the next big thing" are just a variational rehash of techniques that were developed, tested and dropped from decades ago. In some cases, we have had to wait till now to have the hardware and infrastructure to allow us to put some of these things into practise. The ideas themselves are old.

My own library contains works on computer and communication networks, compilers, languages, OOP/OOD, simulation, AI, cryptography, graphics, digital control, digital and logical design, operating systems, engineering mathematics, statistics, engineering handbooks, machine design, calculus, combinatorics, geometry, numerical analysis, fluid dynamics, thermodynamics, chemistry, electrical machines, electrodynamics, solar energy, air conditioning and refrigeration, analog/digital circuits, hazardous materials, woodworking, agriculture, and more besides. The point here is that this has in part allowed me to solve problems that have come my way because I have been encouraged over the years to extend my frontiers.

There are others who make my library look small and they have demonstrated a much greater ability to think outside the box when solving problems. By thinking outside the box, we may not get it right but it can lead others to look at things in a new light and get their aha moment.

I don't think its right to

I don't think its right to accuse the systems community of not thinking outside of the box. Yes, they are much closer to real actual problems, but that doesn't make their thinking any less creative. All of the authors I recognize on the Spanner papers have PhDs and are well versed in their maths. One has even been jokingly cast as the Chuck Norris of computer science (and is also a very strong PL researcher in the UW-style).

Mathematical purity has little intrinsic value; sometimes it is useful in understanding a problem better, but so are many techniques we have in our toolbox. Out of the box lateral thinking is not strongly associated with pure mathematics, although no doubt it is applied in math. Such thinking often involves applying existing techniques to new problems (e.g. MapReduce and big data).

Following the herd

I am not saying that they are not creative. I have seen much creativity over the years. But it has been within the confines of current thinking, it is in the training.

I have had colleagues that saw things in a different light and turned things upside down, inside out and achieved what was needed where others couldn't even get the stuff working.

I have worked with others for whom a different approach is just not on the cards (this has been the majority). Still very talented people.

In regards to the use of mathematics, I have also seen the same problems.

To say that mathematical purity has little intrinsic value is, I think, undervaluing what it allows. I am not saying that out of the box lateral thinking is strongly associated with pure mathematics, what I am saying is that both are highly desirable.

And on the whole, over the years, I have found that thinking outside the box and/or using mathematics in your programming/design/development has been looked upon with suspicion and has generally been considered useless/not applicable to the problem at hand.

All of us suffer from "the rut" problem at various stages. We look at a problem and we forget to see the forest for the trees.

All developers the same?

Understanding how and why something works and is useful is a very important tool in any craftsman's toolbox. Not understanding but using any technique will only get you so far.

Understanding how and why is not just important for a "craftsman" but for any designer or CS professional. Why assume that developers don't "understand" the code or algorithm they use? The set of all useful algorithms is much bigger than the set of Mathematical ideas but totally includes the later as well.

Mathematical purity in software is not necessarily a test of good design, but it does help determine if a design has particular kinds of flaws that will come back and bite you. An example here is the different kinds of sorting available to us.

You imply that none Mathematical design is somehow inferior and my long experience says otherwise. Sorting is one of those things that is well defined (unlike most problems) and if you want clarification just reference Dr Knuth.

Mathematical purity can lead to insights that mean the tools we use are simpler, more consistent in their use and design. It can also lead to the development of much more complex systems that are understandable. Graphics, communications, database, encryption, etc, all depend on mathematics. Irrespective of whether or not the system is good, bad or ugly.

Math has no monopoly on "purity", "simpler", "consistent" or "understandable". All good attributes and where Math provides those ideas, why wouldn't anyone want to use them? Database depends on Math? Have you developed a database? Have you read all the Math in the source code for PostgreSQL and found any good ideas that didn't come from Mathematics?

Too many people (developers) are ignorant about techniques and knowledge outside of their specific work area. The problem with them is that they know what they know and are not interested in learning more about the basics involved, whether this be the mathematics behind the techniques used or variations in techniques that come up with different programming styles/paradigms/languages.

Too many "non developers" have no idea what developers know or don't! Most developers (including me) learn everyday and will throw out code, in a heart beat, that they have used for decades if something better comes along. Developers aren't a homogeneous group so why the drivel?

In effect they are not able to think outside the box, which is a major problem in our industry. We all have blind spots, but we can mitigate these by continuing to extend our knowledge, whether this is in our field of study or outside it.

So thinking "outside the box" means using Math ideas only?

My own library contains works on computer and communication networks, compilers, languages, OOP/OOD, simulation, AI, cryptography, ... The point here is that this has in part allowed me to solve problems that have come my way because I have been encouraged over the years to extend my frontiers.

I'm just an "old timer" and use the Internet!

There are others who make my library look small and they have demonstrated a much greater ability to think outside the box when solving problems. By thinking outside the box, we may not get it right but it can lead others to look at things in a new light and get their aha moment.

The set of good ideas is bigger than books or Math. Vast experience also helps!

Many tarred with the same brushes!!

Why assume that developers don't "understand" the code or algorithm they use?

From watching many who just copy and paste code they find, or use the simplest algorithm they know/find. From discussing with them why they have chosen to solve a problem in the way done.

You imply that none [sic] Mathematical design is somehow inferior and my long experience says otherwise.

I am assuming you mean "no one" here. No such implication made. In the sorting example, if you have 20 items to sort, use a simple algorithm (cheapest method), if you have 20 million items to sort, better choose the right algorithm or it will take a long time. It you have 20 billion, imperative to choose the appropriate algorithm or you you will be looking at space/time explosion. If the quantity of items varies between 20 and 20 billion, it might be advantageous to incorporate different algorithms in your code depending on size of sort. The user will still see only one interface.

Math has no monopoly on ... ..., why wouldn't anyone want to use them?

Agreed, mathematics doesn't have any such monopoly. But there are many who won't use them for a very simple reason, they are afraid they won't understand it, and what they fear they don't use. This is my experience of people. I have spent about 25 years designing, developing and maintaining databases. I have not read the source code for PostgreSQL, so I cannot at this point in time say anything about what you can find in it, even though I use it as a tool.

Most developers (including me) learn everyday and will throw out code, in a heart beat, that they have used for decades if something better comes along. Developers aren't a homogeneous group so why the drivel?

I could wish that was true, but I think it is a much, much smaller group that does this than you think. You're right they are not a homogeneous group. However, many that I have met, worked with, talked with, would be happier doing something more exciting.
So thinking "outside the box" means using Math ideas only?
As I said in a previous comment, they are two different things, both are very advantageous to use.
I'm just an "old timer" and use the Internet!
I have much material that is either not found on the internet or is extremely difficult to find. I also have much material that I have found on the internet. It is just one source of information, there are many others.

The set of good ideas is bigger than books or Math. Vast experience also helps!

The set of bad ideas is even bigger. None of us have lived long enough to have had vast experience, which is why we rely so much on the experiences of others, all those many lifetimes.

Inconsistency non-robustness considered harmful -- not!

Carl: congratulations on continuing to use CACM as 'free advertising' for your International Society. FYI, I won't be attending next year's symposium.

Your letter says "all capabilities of current relational databases must continue to be implemented."

Yes: a payroll run must transfer my month's salary to my bank account; Airline ticketing must both take money out of my account and issue me a ticket. And each pair of transactions must be recorded simultaneously, not sometime/eventually.

To be able to determine consistency of databases, we need sharply-delineated semantics. The Relational Model is based in predicate logic, which provides crisp semantics over structured data.

To be able to determine the semantics of the Bob-and-Alice accusations, we need structured data, including the ability to capture "mutual co-reference". By an astonishing co-incidence, the Relational Model has a mechanism to support such relations between datums. Even SQL's foreign key mechanism (poor as it is for these purposes) uses the keyword REFERENCE.

It's like Codd saw you coming over 40 years ago.

And please tell me which banks are recording their transactions using an Actor model. I'll make sure _not_ to place my account with them.

Carl misses the point, but he's still right

Sean's comment that systems people wouldn't look at things this way is dead on. Since I happen to be a systems person, a language person, and a capabilities person it's hard not to chime in here. :-)

While Carl identifies some requirements that SQL/RDBMS systems aren't meeting, he misses other key limitations of the relational model in the context of modern applications, and his focus seems to be is on language issues rather than end-to-end issues.

In an era of highly distributed database systems, language requirements have undeniably changed, and issues of language security becomes important. Carl is entirely right about that. Frankly, the best current language for this would be javascript, if only it had proper support for integers. Not because javascript is wonderful, but because the security improvements in ECMAScript 5 coupled with the capability-based computation work in Caja is so compelling for distributed secure computing. In fact, Caja coupled with Mark Miller's work on CapTP satisfies essentially all of the objectives that Carl cites.

But Carl is missing the same thing that the key/value "cool kids" are missing: there is a reason that SQL is a mostly declarative language. The cool kids, for the most part, aren't old enough to remember what was commonly known as the "CODASYL wars", and mostly don't know how to read the pre-web literature. Carl doesn't have that excuse. The reason for CODASYL losing to [pre-]SQL, in large measure, is that the database system inherently knows a lot more about how the data is organized than the program can. The proposition that the program should manage and be aware of data placement explicitly has been tested before, and the reasons it failed haven't changed.

As much as we need a secure distributed language in order to bring big data solutions to mid-sized companies, the most critical problem currently facing the database world is changing integrity requirements and a desperate need to evolve data models and schema models.

And that has implications for database languages. We clearly need to be able to specify general analytics, and that requires a more general language than SQL. We need those applications to run on distributed shared resources even when some of the queries are hostile (because shared computational resources imply the possibility of hostile users). The de facto standard for that is now Java, and it isn't meeting our needs. The actor model, and specifically Caja/CapTP are really good for those issues. That said, the data processing model needs to retain a declarative data query mechanism for performance and scaling reasons, and there are overall restrictions on processing flow that have to be preserved if the notion of transactions is to survive. Finally, the database content model needs to evolve out of the limits of an accounting worldview, if only because accounting is now a small minority of database applications.

What Carl doesn't say is that the actor isolation and concurrency properties make the actor model a very good foundation for the type of declarative-driven, widely distributed, shared resource processing that we need to be building.

Sometimes a good foundation is, well, a good foundation. :-)

There are several reasons

There are several reasons why key-value stores have become more popular:

  • It's simpler.
  • Large overhead for simple operations in traditional SQL databases.
  • Existing SQL databases were much harder to scale to multiple machines.
  • For performance reasons you can't do general queries anyway.
  • The application controls how data is stored so that certain queries become more efficient.
  • The key-value data model maps more easily to OO data models.

None of these are inherent in having a declarative language, so I'd expect more declarative databases to return that move some of the work that is one in the application layer with key-value stores back into the database. However I'd expect the focus to change from declarative ad hoc queries to declaratively specified indices & incrementally updated views on the data so that certain queries become efficient, plus what you say (a general purpose language for analytics).

Actor model has no relationship to RDBMS.

... and his focus seems to be is on language issues rather than end-to-end issues.

Why are we confusing the "Actor Model" and message passing? If you look at Carl's comments on this thread, you will see that he sees his model being used at the CPU level. Simple one directional (no queue) messages instead of CPU return calls on the stack. This isn't the same as message passing in general or current implementations in PL like Smalltalk etc. If I have mis-characterized Mr Hewitt's model, I would humbly retract whatever is incorrect.

I agree that message passing (in general) is conducive to concurrent processing but I don't agree with many of the details and the very low level of the Actor Model for current computers.

I don't see how the Actor Model has anything to do with an RDBMS or SQL. Where is the macro level organization, global security and implementation details? The Actor Model (as far as I can see) is a very general but very low level group of axioms that should replace the Von Neumann view of computing that is embedded in our CPUs. A DBMS is an application in comparison and could be implemented like any other application using the Actor Model. Mr Hewitt has said that assignment is a bad thing (like functional advocates) which means, have somebody else (a RDBMS) store the problems/state for the functional language. If this is correct, then how could the Actor Model replace a RDBMS?

The reason for CODASYL losing to [pre-]SQL, in large measure, is that the database system inherently knows a lot more about how the data is organized than the program can.

My recollection of why RDBMS won the database wars:

  1. It was multi-user right out of the box.
  2. It was simple in that it used a flat table.
  3. The schema was standardized and available without programming.
  4. SQL could be used directly or by a program without custom programming.

Please note, I didn't say "relational" at all. Although I agree that the "declarative" nature of SQL is a good thing, you say it without saying why. I think SQL's performance on large numbers of small commands, which it has to parse, is horrible. In other cases, the database can make a good plan/program that is efficient but not in all cases (some queries would be better handled if they were programmed at a high level, explicitly). My problem is that current RDBMS don't give you any good options when this "declarative", translated to internal program, scheme doesn't work. I want it both ways with security and consistency.

The biggest advantage that SQL has over all other Query languages is that it is taught to every CS graduate and it is used everywhere. Neither of these attributes make it better at manipulating data than other options but they are significant advantages.

The proposition that the program should manage and be aware of data placement explicitly has been tested before, and the reasons it failed haven't changed.

  • I think the programmer knows a lot more about his data than any program (including a RDBMS) can.
  • Letting the programmer do whatever with that knowledge isn't necessarily good, however.
  • Allowing a DSL like SQL to access data and also allowing a programmer to directly work with that same data hasn't been tried in a full language/database, to the best of my knowledge.
  • The fact that many programmers use stored procedures and triggers would suggest that programming and data aren't actually incompatible. The current use of both causes other problems, however, like having your code in at least 2 languages and normally on different computers.

Finally, the database content model needs to evolve out of the limits of an accounting worldview, if only because accounting is now a small minority of database applications.

Do you have data to back this up or is it just an opinion? I have only an opinion but I think business/accounting still pays most of the bills.

The fact that many

The fact that many programmers use stored procedures and triggers would suggest that programming and data aren't actually incompatible. The current use of both causes other problems, however, like having your code in at least 2 languages and normally on different computers.

Very astute, most people think the N+1 Schema Problem is solved solely through database design, but other "schemas" trickle into a data model via external applications writing their own data access layer and business rules.

I had actually used a database migration framework for awhile that would automatically apply (correct) SQL database migrations given input in terms of relations and predicates (business rules). It was pretty cool. It mainly took advantage of the fact the backend it targeted with PostgreSQL, which can handle insane amounts of business rules encoded in INSTEAD OF triggers due to how it implements rollbacks and writes. Rollbacks are very cheap in PostgreSQL, so "projecting" business rules to the middle tier (as is done in most large-scale applications for so-called "performance reasons") isn't needed for all small to medium databases with a single data center. A very clever idea, and I remember thinking the person who designed it was a genius, and never understood why it didn't take off. But it basically completely overruled SQL DDL and required you use a custom DDL, because that is how it generated correct triggers.

Bringing this back to Carl's article, though. Carl is neither right nor wrong, but taking a philosophical stance on human understanding and theory of organizations. Carl says, in my own words, it is impossible to avoid N+1 Schema problems, so it is best to prepare for it through a logic that can accomodate ensuing madness. I have no problem with Carl's conclusions, if I understand him correctly. I object to his hidden premise. This is basically a GOTO debate writ large.

Why have schemas?

Why do data bases have schemas? Can a given piece of information have more than one schema? Can schemas be inconsistent?

Glaucon, is it true?

Glaucon, is it true? I just saw Adeimantus, and he said that by answering all of Socrates’ questions you single-handedly designed the Republic?

I hear this is a popular book with your colleagues at MIT.

Glaucon, has the world changed?

Socrates says that world has changed Kent's screed first appeared way back in 1978. What do you think of new-newfangled things like Inconsistency Robustness?

"If you look at Carl's

"If you look at Carl's comments on this thread, you will see that he sees his model being used at the CPU level."

And what on earth does that have to do with "Relational Model Outgrown" ?

Many-core is just one fundamental way the world has changed

Having many cores on a chip is just one fundamental way that the world has changed since the Relational Model was invented. Another fundamental change is multi-gigabit networking.

China starting to embrace

China starting to embrace capitalism is another fundamental way the world has changed since the relational model was invented. So what.

Did the relational model have "operating conditions" attached saying that it was only suitable for network speeds up to a few Kbit ? Or did it not have any such "operating conditions" ?

What Carl doesn't say is

What Carl doesn't say is that the actor isolation and concurrency properties make the actor model a very good foundation for the type of declarative-driven, widely distributed, shared resource processing that we need to be building.

This statement perplexes me, although I am sure you would say the same for stuff I write!

Some of the earliest motivations for Actor Model was bookkeeping. I actually wrote a long Programmers Stack Exchange answer about the evils of misusing direct assignment. I did not mention Actor Model, but many of the same teaching points apply. You can read it here: What did Alan Kay mean by “assignment” in The Early History of Smalltalk? Unfortunately, I did not relate it to the Actor Model, including examples of how protecting the address space gives you capabilities for free, and the advantages of that for secure financial computation.

Carl may feel free to disagree, but I think of of the best examples of the Actor Model was how he and his students modeled serialization and guardians. His office printer network example in AI-505: Specifying and Proving Properties of Guardians for Distributed Systems is a particularly good example.

That said, many things revolve around bookkeeping.

Major Challenge: Integrating declaratives and imperatives

A major challenge is integrating declaratives and imperatives. Of course, both are needed. Actors provide a foundation for systems that integrate declaratives and imperatives in large-scale systems. The Actor Model also provides abstractions so that neither programs nor programs have to be aware of the physical details of the how declaratives and imperatives are implemented. Of course, security, modularity, and performance are important concerns.

What are the prospects for

What are the prospects for developing a declarative query language (SQL minus SQL's warts) and a companion declarative query *plan* language?

It seems what we want isn't imperative, really, it's just that we occasionally want explicit control over the query plan for cases where the optimizer needs help or where we want to explicitly weaken the semantics and use caches or probabilistic indexes, etc.

Obviously whether a user-supplied query plan matches a user-supplied query is undecidable, but we might be able to do pretty well for practical cases automatically and more challenging cases with a theorem prover.

In this model, you would be able to write the query as you do in existing RDBMS. If the optimizer does the right thing, you win. If the optimizer doesn't do the right thing, instead of shoehorning "hints" into the query language (weak/brittle) or moving to a different tool you take control of the query plan directly. The compiler does its best to verify that the plan you supplied matches the semantics (and security properties) of the query it was intended to encode, because you already supplied a separate declarative specification of that.

(With apologies in advance to whoever already built this 37 years ago; I plead ignorance.)

Obviously whether a

Obviously whether a user-supplied query plan matches a user-supplied query is undecidable, but we might be able to do pretty well for practical cases automatically and more challenging cases with a theorem prover.

This, and your following paragraph, is actually a great idea I started exploring about two years ago.

As for security properties, that's wicked tricky, obviously, due to the View Update Problem. Easily one of the top 10 most annoying problems in computer science. Many things would be easier if this problem did not exist. Obviously, one approach is to side-step the problem by disallowing the construction of evil views.

Should updateable views even exist?

I have never bought in to the case for even having updateable views / virtual relation variables. This is a long-standing religious war, though, I suspect.

Capabilities are probably the right answer to update security. But "transition constraints" along with the traditional model can get you a long way. Even security at the level of who may invoke check-this-check-that-emit-this-audit-record-and-then-insert-the-other-thing imperative update procedures are preferable in my opinion to updateable views as easier to comprehend and ultimately to maintain.

I may be missing something, but I think it's a misfeature. Which leaves me relatively unconcerned with how to provide support for it alongside this idea, partly because I despair of being able to understand the combination.

General updateable views include

" check-this-check-that-emit-this-audit-record-and-then-insert-the-other-thing imperative update procedures",

so your comment reveals what you are missing.

Though, you are probably on to something by unifying writing with the capability to write something. If you do not give someone a "problematic" reader-writer pair, then there is no View Update problem. Creating a system of axioms and proving you can never create a "problematic" reader-writer pair is difficult, and making sure programmers obey those axioms is even harder (language design). One technique, used for example by REST, is to always pass the writer capability and address with the reader capability. That is the condition Roy Fielding calls Hypermedia As The Engine Of Application State (HATEOAS). But REST doesn't guarantee writers succeed, and it is cheating since it handles local arbitration of the update. Viewing the system globally, the trick breaks because somewhere you have some "intelligence" that handles the view mapping in a non-automatic way.

Good point

That's a good point.

There are two different issues, though. One is the issue of specifying how and when all the auxiliary data structures are to be updated. There's no way out of that.

What can be avoided is a narrower (and strange) idea of view updatability where we have some data structures D1, D2, D3. But for convenience we define a "view" V that displays information from those data structures in a form that we like, so call it V(D1, D2, D3). So far so good.

But some systems will now treat the result of V(D1, D2, D3) as an l-value. The problem is that the system needs an inverse of V(...) in order to figure out how to update the underlying data structures.

In general such an inverse doesn't exist. But some systems keep trying anyway, and define a bunch of rules that the system will use to choose one of the many pseudo-inverses of V and will use that.

In my opinion the convenience of this feature does not justify either the potential for semantic surprises or the cost of implementation (both in complexity and in compromising certain optimizations in some cases).

The view-update challenge

The view-update challenge can be summarized as: there is no general function `(a->b)->((a->a)->(b->b))`, and there is no general function `(a->b)->(b->a)`. I.e. there is no generic function from a view-transform to a view-update-transform, and there is no generic reversibility. The update-transform issue is true in both directions, of course, so the same challenge arises for reactive updates (where changes in the model cause change in the view) as for updateable views (where direct manipulation actions on the view cause changes in the model).

You seem to be attacking the view update problem by analysis, i.e. given some ad-hoc view, how can I update it? (much less deal with the auxiliary structures) I find constructive approaches more useful than analytic ones, so I might instead ask: how can I construct views that I will know how to update? Such questions might lead one to lenses, or views that carry metadata about how to influence the model. Cf. Boomerang language. If we wish to minimize intermediate state, we may need to constrain the views.

One could potentially carry information for view-updates in the form of scripted procedures, which may then be wrapped or composed. Though I find procedures awkward to work with, e.g. with regards to concurrency or diamond-pattern overlapping views.

Can you unpack "updateable views?"

I don't quite think I understand the issue here. And it comes down to me wanting to ask exactly what it is you mean by "updateable views".

I think of views as a fairly benign thing; you can filter the contents of a table, you can rearrange or omit columns, you can show correlations or relative information in a "fake" column that doesn't actually exist in the database, etc.

Updating views (case 0) isn't particularly hard. They can register to get notifications whenever one of the things they display updates, and redraw themselves without a problem.

Updating views (case 1) is straightforward. You can update the tables whose information views present by acting *through* them as though you were acting on a table. Some effort is required to keep the whole update transactional since it may span several tables, but no more than most transactions.

And updating views (case 2) is plainly impossible and I'm pretty sure everybody knows it. Are you talking about handling updates to "fake" columns that show things which aren't actually discrete entries in the database, and then propagating back to derive changed values for the discrete database cells? For example, if I have "fake" columns which show the solutions of a function given the constants and exponents found in other cells of the row, and then someone wants to update one of the "solutions" and we're trying to find consistent values for the other cells somehow?

That'll never actually work, IMO, and it's fairly obvious that it won't. In the first place the existence of an equation of the form that the table rows represent having the user's preferred solution isn't guaranteed. In the second place, if it does exist it isn't guaranteed to be unique, so there may be any number of ways to change the data to make the "fake" data true. And in the third place, it is frequently the case that a procedure we have for determining an equations roots has no obvious or mechanically-derivable inverse that we can use to find the other terms given a solution.

So when I parse "updateable views" I get three possibilities, one of which is trivial, one of which is straightforward and not problematic, and one obviously impossible in the general case. But you seem to be talking about something else, which sounds plausible but which has deep problems when examined closely.

Or did I just make three strikes without getting close to your actual meaning?

Update through View

Updateable view doesn't mean that everything in the view can be updated. What it does mean is that the view (or lens, view model, or some other intermediate construct) must carry information not only about what is presented, but also about how to influence that information (and also how to maintain it, in a reactive or cooperative work system).

Updateable views are naturally composable. And the client of such a view or lens doesn't need independent knowledge of the underlying model, which enables loose coupling and independent development. Without update (or, more generally, bidirectional reactive dataflow) I think views lose most of their potential value. The ability to influence what we view is a common requirement, so if views don't provide influence the clients instead become tightly coupled to the origin model to achieve influence. Without updateable views, abstraction of the data model, and integration of heterogeneous systems, and independent development of model and view, all become very difficult.

You mention security concerns, but those can be addressed: provide a view that is limited in both observation and influence by the authority of the client.

Clarification

In response to both Ray Dillinger and dmbarbour:

The problem as I see it is that, as Ray notes, only some views are updatable. For some very simple cases, also as Ray notes, it is seemingly straightforward to invert the view and move the changes through the inverse. When the view involves projection or functions of attributes or summarization or certain joins then there is no unique inverse and things quickly become murkier before ending up at a point where it's seemingly impossible.

dmbarbour notes, on the other hand, that:

Without update (or, more generally, bidirectional reactive dataflow) I think views lose most of their potential value. The ability to influence what we view is a common requirement, so if views don't provide influence the clients instead become tightly coupled to the origin model to achieve influence. Without updateable views, abstraction of the data model, and integration of heterogeneous systems, and independent development of model and view, all become very difficult.

I think this is a fairly common view of the situation, that these sorts of updatable views are a useful abstraction and reduce coupling.

There's a good summary of the case for this in Date and Darwen's "Foundation for Object/Relational Databases: The Third Manifesto" in what they label "Relational Model Prescription 14". They say (contrary to the consensus we just established that these views are only sometimes invertible) that:

The point is therefore worth stressing that, contrary to what has traditionally been thought, such [relation variables] or "views" are in fact always updateable, barring integrity constraint violations [55, 79].

(emphasis in original) They go on to say that space doesn't permit a detailed explanation of the point, and give examples which don't really address the hard cases. I admit to not having read either of the self-cites with an eye toward analyzing this point, but apparently some chapters of C. J. Date: Relational Database Writings 1991-1994. Reading, Mass.: Addison Wesley (1995). However it seems crystal clear that this claim is impossible, so I'm not sure what they are on about. I will try to make it to the library soon to see if I can find out.

Assuming their claim that such views are always updatable isn't true in a constructive way, here's my central point.

Since only some views are updateable, and even fewer are updateable in a way where the update semantics are immediately clear to the user, allowing such view updates isn't really much of an abstraction. Saying "clients don't need to know about the real relation variables" on the one hand but also expressing detailed conditions on which view expressions we will consider updateable is a strange claim for information hiding, because the user still needs to know the "hidden" implementation details to even know if the view will be updateable. Furthermore, future changes to the definition of the view implementation might change which updates if any will be accepted.

I agree with dmbarbour that lenses of this type might be desirable, but the SQL approach to this problem (attempting to guess the inverse side of the lens by analyzing general query expressions that define views) is fundamentally broken both because it exposes what it is claiming to hide, because it is very confusing, and because the result of the updateability analysis is not encoded in the "type" of the view.

I'm all for building a better facility for creating updateable views that are updateable by construction, so that they truly are "naturally composable". We also obviously need the ability to have arbitrary read-only views, even when those views are not updateable, for other use cases (these are just perfectly ordinary pure functions).

I think it makes a lot more sense to attack this second flavor of updateable views by creating a separate facility for defining them, rather than by attempting to guess the inverse associated with a query. Or continue with SQLs guessing approach, but expose the result of the updateability analysis in the type of the view so that what modifications to the definition may be made without affecting the interface is manifest.

Taking a step back, I'm not 100% convinced of dmbarbour's case for even the improved (over SQL) updateable views. I think (provisionally) that abstraction of "how do I update this" is more naturally expressed by defining a library of procedures by which someone may update the database. Such a procedural approach to update abstraction also aligns more naturally with the security policies of most human organizations too, I think, although I admittedly have no evidence for that claim.

There's a lot to unpack here and some terminological challenges because there are so many existing systems with so many existing approaches to these questions, but for the most part they all use the same names for very different concepts.

Concrete example

A brief example of what I am trying to say...

CREATE TABLE widget_sales
(
  date_sold date,
  salesperson employee_id,
  quantity int,
  unit_price currency
)

So far so good. Now we make a view quarterly_sales with the obvious definition for which I won't bother to invent concrete syntax.

What are we to make of the request:

UPDATE quarterly_sales
SET amount = amount + $17,351
WHERE year = 1973 AND quarter = 2

I think it is clear to everyone that it's a silly request. But it isn't clear from the type of quarterly_sales in the existing systems I have used. And you would need a really expressive type system to make it manifest. Perhaps not so much for this example where no updates at all are probably allowable, but there are intermediate cases where some updates might be desirable and others not. Also there is DELETE FROM quarterly_sales WHERE TRUE which does have an interpretation that you could operationalize, so you need a principled answer as to whether that is allowed or disallowed.

So there's a nest of thorns. And while we have great abstraction for the reading side we have only the illusion of abstraction on the writing side.

Wouldn't it be better to abstract on the writing side by defining CREATE PROCEDURE record_sales_order { ... } AS ...? It seems to me to be more descriptive, easier to understand, easier to understand the security properties of what you've built, etc. The abstraction is better because you really don't need to know about the "underlying" relation variables in order to sell some widgets, whereas you do need to know (some things) about the definition of a view to know in what ways it might be updatable.

Huh. Well in that case....

So that really is what folk were talking about.

Well, in that case -- firstly, agreed that the view should make it clear which fields are updatable. That information should be shown in the UI and it should also be visible to the query language, so that queries can be made to detect whether something is updatable.

Secondly, in cases where direct update has problematic semantics (like the "update quarterly sales" example above where you have a choice of violating relational constraints (such that quarterly sales is no longer the sum of sales made in that quarter) or a wide choice of data to change (which sale or sales exactly should be made $17,351 larger?)), the only way to do this cleanly is to explicitly define an update function using a programming language, maintaining relational constraints and resolving the ambiguity empirically.

Relational constraints can be re-checked before the transaction is committed, but there is no automated way to detect being "wrong" in our choice of how to achieve consistency.

Hi Ray. Here is a simpler example

The following code demonstrates why bi-directional views are useful. Imagine writing the following code 120 times:

create table [dbo].[example]
(
  [pk] int identity(1,1) not null primary key,
  [flags] int not null
);
go

create view [dbo].[v_example]
as
select
  -- Casting each column as a bit will allow each value to be edited on the UI via a checkbox.
  cast(1 & [t1].[flags] as bit) as [is_read_only],
  cast(2 & [t1].[flags] as bit) as [is_primary_key],
  cast(4 & [t1].[flags] as bit) as [is_posted],
  cast(8 & [t1].[flags] as bit) as [is_updatable],
  cast(16 & [t1].[flags] as bit) as [is_nullable]
from [dbo].[example] [t1];
go

-- This will fail, since the underlying view contains a derived value.
insert [dbo].[v_example]
select
  1 as [is_read_only],
  1 as [is_primary_key],
  1 as [is_posted],
  1 as [is_updatable],
  1 as [is_nullable]
go
-- Output on SQL Server 2012:
-- Msg 4406, Level 16, State 1, Line 44
-- Update or insert of view or function 'dbo.v_example' failed because it contains a derived or constant field.

create trigger [dbo].[tr_v_example] on [dbo].[v_example]
instead of insert
as
begin
  --Build an INSERT statement ignoring inserted.ID
  -- @i * POWER(2,@j) is equivalent to @i << @j bit-shift operator
  insert into [dbo].[example] ([flags])
  select
    ([is_read_only] * POWER(2, 0))
    | ([is_primary_key] * POWER(2, 1))
    | ([is_posted] * POWER(2,2))
    | ([is_updatable] * POWER(2,3))
    | ([is_nullable] * POWER(2,4))
  from inserted
end;
go

insert [dbo].[v_example]
select
  1 as [is_read_only],
  1 as [is_primary_key],
  1 as [is_posted],
  1 as [is_updatable],
  1 as [is_nullable]
go

select [t1].* from [dbo].[v_example] as [t1]
go

Something to think about for those that say just ban updatable views.

Note, when I mentioned language design as the resolution to this problem, guaranteeing correctness by construction, something like Boomerang and especially Benjamin Pierce's newer stuff is what I had in mind.

Cheers,
Z-Bo

Packing flags into ints isn't so good either

I read this as an example of why packing flags into ints is a bad idea. A holdover from languages where it was the only way to achieve performance objectives. What this example is "really" looking for -- in my view -- is a database with a boolean type and the smarts to not pad them to a byte each in physical storage. (a facility that already exists in at least some DBMS implementations, and that doesn't seem like that much to ask)

I'm having trouble reading past that to picture parallel scenarios where I would take the position that updatable views are the way to go. (I'm saying this honestly to express interest in other example, not rhetorically to imply that I don't believe they exist.)

It's actual production code.

1) ANSI SQL bit data type is not a true boolean type.
2) Bitfields group together related but independent bits. I suppose you could argue that a relation can do the same thing, but when you are working with the data outside the database, especially in C#, this is a convenient representation, considering enums map to ints, anyway.
3) The advantage to using one bit data type per column is that there is never any need to change the data type for any column, and removing a flag value is as simple as dropping a flag value. Re-purposing storage is easier, since there is no ordinal bit position.

Anyway, it was an example I could think of where there is an updatable view with an unambiguous mapping. There are probably better examples in the academic literature, but I don't know any.

Indeed

Yeah, good points. Mostly related to SQL shortcomings though, which is preaching to the choir. How you can build one of the most widely used languages in existence without having a boolean type remains a mystery to me. The second point can be easily addressed by allowing simple user data type declarations, but most flavors of SQL make that extraordinarily difficult/fraught as well.

UI Integration

In general, if we have updateable views that are not uniformly updateable, we would want to realize this through the UI - i.e. such that a human can determine what is updateable in an exploratory manner, rather than by great intelligence or trial and error. As you mentioned earlier, this suggests the view must keep this information in its type or definition.

Naturally, there is very little we can do to update summary information like 'quarterly_sales', though we could address it by adding a special line item for a set of unknown transactions.

I'm not especially fond of guess-the-semantics, so for ambiguous cases like the one you present I would either require developers to disambiguate, or have an easy-to-explain default behavior. (In cooperative development, machine inference shouldn't be deeper than human reasoning, at least without presenting a clear explanation.)

Just don't do it

It's analogous to the problem of assigning to a procedure call. If we have this in a language:

x = 4;

and we have procedure calls f(x), then surely we would also want to have this assignment statement work:

f(x) = 4;

for example if f(x) = 2*x, then after the above assignment we would have x=2. I think it is clear that it is perfectly fine to have a language where this is disallowed, and also that it is probably a bad idea to have this because inverting arbitrary functions doesn't really work well. For the same reason, update-able views: just don't do it.

An elegant way to avoid this problem altogether is do eliminate assignment. Instead of updating a table with an update statement, define the table as a fold over some input stream somewhat like in FRP.

constraint-based programming

This is actually possible to express in Bling, which supports inversions and data binding:

f(x).Bind = 4

And I found it to actually be quite useful; consider:

a.Right.Bind = b.Left;

Where Right is simply defined as:

DoubleBl Right { get { return Left + Width; } }

However, Bling is a staged language where this is possible. I'm not sure how FRP would solve this problem, it really require some amount of constraint solving. Bling can use inversions, but perhaps just using a physics engine would work better.

After following the citations

After picking up a used copy of Date's Relational Database Writings: 1991-1994 to follow the citations for the claim that:

The point is therefore worth stressing that, contrary to what has traditionally been thought, such [relation variables] or "views" are in fact always updateable, barring integrity constraint violations.

I remain basically unconvinced as to the utility of this feature. They achieve this result by enthusiastically embracing the "barring integrity constraint violations" bit.

So, for example, if a view is defined in terms of projection, an insert to that view is a constraint violation if there is no default value specified in the schema of the underlying relation variable that describes the default value of the missing field.

If a view is defined in terms of extension (EXTEND parts ADD (weight * 454) AS grams_weight is their example, in tutorial D syntax and with abbreviated identifiers expanded for clarity), then they reject updates or inserts where the tuple to be inserted doesn't specify the value for both the computed attribute and the argument attributes, or where the supplied value for the computed attribute doesn't match the value you would get if you computed "forwards" from the supplied values for the argument attributes.

I agree it is possible to play this game, and indeed that it is a better defined game than the state of play in SQL where the rules as I understand them are more permissive but in an ad hoc way. I maintain my fundamental disagreement that this is any "abstraction" worthy of the name. When using an abstraction requires such a deep (indeed, complete) knowledge of the details, then all the usual advantages vanish. Coupling in particular is not reduced at all.

My complaint that the "the result of the updateability analysis is not encoded in the 'type' of the view" can only be squared with the definition of view updatability adopted in the referenced chapters if you identify the type of a view with its entire defining expression. And not even that is sufficient, the type would also need to include information about constraints and default values defined for the underlying relation variables. Surely only madness lies down this road. It certainly doesn't seem to me to be a promising path to follow if one seeks the "bidirectional reactive dataflow" that dmbarbour is (I think correctly) looking for.

The takeaway I think is that we need a complete query language and the ability to define "views" (sense 1) using its full power, the results of evaluating which are ordinary values. There may also be utility to a parallel system where we may define "views" (sense 2) using a different syntax (and different type system?) that is guaranteed by construction to produce lenses through which we may either view or update.