The End of an Architectural Era (It’s Time for a Complete Rewrite)

The End of an Architectural Era (It’s Time for a Complete Rewrite). Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, Pat Helland. VLDB 2007.

A not directly PL-related paper about a new database architecture, but the authors provide some interesting and possibly controversial perspectives:

  • They split the application into per-core, single-threaded instances without any communication between them.
  • Instead of using SQL from an external (web app) process to communicate with the database, they envision embedding Ruby on Rails directly into the database.
  • They state that most database warehouse tasks rely on pre-canned queries only, so there is no need for ad-hoc querying.

The somewhat performance-focused abstract:

In two previous papers some of us predicted the end of "one size fits all" as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific data base markets.

Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T. to one of the popular RDBMSs on the standard transactional benchmark, TPC-C.

We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow's requirements, not continue to push code lines and architectures designed for the yesterday's requirements.

A critical comment by Amazon's CTO, Werner Vogels.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

databases and PLT

For several years I have seen data volume go through the roof in the financial industry. One of the reasons I ended up at LtU was because it kept coming up in google searches for items related to implementing databases.

First, the relational model has pretty direct relation to the kind of things often discussed here: a simple set of operations combined to express an idea which can be executed on a computer.

Even advances made in the database industry are being adopted by academic language community: transactions, list comprehensions with extensions, etc.

Regarding 'column' oriented databases, The Dodo Query Flattening System seems relevant, although I don't understand it very well.

There is another line of research regular LtUers will be familiar with: Wadler's work on Monads and data manipulation up to various recent 'functional' approaches to data management.

In the last couple of years, there is been a great deal of work in integrating XQuery and relational databases.

Obviously this is not a complete list of references, just the ones I have seen.

MetaKit anyone? K?

We've had column-oriented databases for almost a decade. MetaKit comes to mind. KDB. Heck, nearly all of the databases implemented in Forth are column-oriented, because the language offers no built-in support for structures, and they've been raving about the performance they've had for several decades.

Why all the buzz now?

It just doesn't make any sense. I must be missing something fundamental.

Finally, I would refuse to use any product that embedded RoR in it. What a waste; while the IDEA is sound, the execution is every bit as limiting as using SQL in the first place. It's better to capitalize on design patterns instead -- implementing transactions via the Command pattern, for example, like object prevalence does. In fact, I strongly recommend researching object prevalence. They claim rediculous numbers too, but the technique is sound.

Forth?

Can you provide some links for the column-oriented databases implemented in Forth?

I'm not sure I understand how implementing transactions via the command pattern is more interesting than the usual commit/rollback of SQL.

I looked at Prevayler a few years ago and from what I recall, its ability to query data was limited.

I suppose an easy way to persist and archive objects is better than the RDBMS way, if RDBMSs are thought of as just data stores.

Authoritative data

I suppose an easy way to persist and archive objects is better than the RDBMS way, if RDBMSs are thought of as just data stores.

I think you need to distinguish a little more finely in order to be able to evaluate this solution for a given problem.

If your application/environment is designed so that the authoritative data store is your object model, then something like Prevayler is better, since it is just a glorified backup system.

But with enterprise applications, it is common that there are independent reasons to think of your DB as the authoritative data source. For example, multiple applications might use the same data, general purpose reporting tools for reporting needs orthogonal to the operation of the application can be used against it, etc.

It is good if there are alternatives to generic RDBMs for problems that are not well matched to them, but for those problems that are (and a lot of enterprise applications are), you really can't beat them easily.

persistent data vs persistant relational data

CouchDB is an interesting recent project on persisting JSON (JavaScript) objects using map/reduce with effectively global address lookups and a very nice potential for defining views. Doing Flapjax made me realize such a system is integral for scalable web apps in dynamic languages (we implemented a less scalable but more web domain specific version which I want to rewrite over couchdb). The call to arms in CouchDB is that web application data is generally *not* very relational and is infact tree like and disjoint in how you want to use it.

I want to a talk last week on Happs, a haskell project influenced by Prevayler which can be interpreted as the typed equivalent of CouchDB (except you get to control sharding and optimistic vs pessimestic concurrency, rather than using whatever couchdb gives you).

Even if data is king, it is still reflected in these systems: type migration is a feature (challenge?) of Happs as part of their multimaster ideology. I'm data mining couchdb this semester so that you can still get your shape / type feedback based data (so a RoR orm could still be automated).

There are some other interesting approaches coming up. The PADS group are interested in tool support for ad-hoc data, and in my experience, this describes program data structures pretty well (that happen to persist). There has also been some neat pushing by others of data model driven approaches, where views are just hooked together.

I'm finding persistent data to be one of those long misaddressed blemishes of modern PLs for even small personal development (eg, revisiting a project a year later), and the scalability issues of web applications even more so. The premise of Happs was that, in the LAMP world, the software may be free, but you need specialists to tune every single letter, so the existing languages are the 'easy' part and miss the rest of the boat by abstracting it away.

Sharding -- A PL Issue?

Went to a talk last week on Happs, a Haskell project influenced by Prevayler which can be interpreted as the typed equivalent of CouchDB (except you get to control sharding and optimistic vs pessimestic concurrency, rather than using whatever couchdb gives you).

HAppS is not that flexible -- there's just optimistic concurrency, and sharding has not been implemented. Early in the development of HAppS, it was felt that raw performance would be sufficient to mitigate scalability issues -- by stripping out the web server, database server and so forth, you could fit enough users on to one box that you wouldn't need to grow unless you were eBay.

I'm finding persistent data to be one of those long misaddressed blemishes of modern PLs for even small personal development (eg, revisiting a project a year later), and the scalability issues of web applications even more so. The premise of Happs was that, in the LAMP world, the software may be free, but you need specialists to tune every single letter, so the existing languages are the 'easy' part and miss the rest of the boat by abstracting it away.

Why is persistent, distributed storage a PL responsibility? Seems like a filesystem thing to me. If there were good distributed "RAM filesystems" (with journaling) out there, it'd be legitimate, I think, to just dump databases and rely on mem-mapped 'files' and smart APIs for querying them.

I'm finding persistent data

I'm finding persistent data to be one of those long misaddressed blemishes of modern PLs for even small personal development

Because persistence runs into all sorts of very complicated issues, very quickly. For instance, schema/version upgrade. Better to let the developer handle it himself, than to provide a naive default way which will end up wrong or too inflexible for most people, or a very complex way to seamlessly handle upgrades, but which no one understands.

perhaps

1. In a live programming system, we don't ever really need to leave the program to deal with persistence, and thus should do migration within the system (we could write out, edit externally, and read in, but that would seem to go against the model).

2. Isn't one of the arguments for using parser generators is to maintain consistency between data input and generated code (including downstream applications)? Ocamllex/yacc is great because extending data formats can catch type errors (incomplete patterns, etc).

Versioning is one of those things that types would be an excellent match for (though I can think of arguments for richer and coarser types, and even at the level of structural vs nominal). The act of persisting, as opposed to what is being persistent, however, seems harder to introduce into the PL, in terms of semantics and just finding a useful abstraction level.

In a live programming

In a live programming system, we don't ever really need to leave the program to deal with persistence, and thus should do migration within the system

I've only read cursory intros to live programming, so I'm not sure how they decide when and how it's safe to migrate code to newer versions.

I'm not sure transparent persistence and upgrade is possible. At the moment, I sit firmly with E's position here: manual persistence is essential.

I would implement this via a serialization combinator library, so that the developer could decide whether immediate upgrade is vital and seamless (backwards compatible, etc.), or whether existing code relies on the old version (bugs and all), and only new instances should use the new code; this is the hard part after all. Absent some observational/behavioural equality expressed in the types, which I believe would be quite complicated to achieve, I don't see how types can help with deciding when an upgrade should be applied besides trivial interface incompatibilities.

Why all the buzz now?

I read in /. that these solutions (Metakit, etc.) works only with the database in RAM. Stonebraker´s solution can work with disks.

No disks!

H-Store is an entirely in-memory database as well (and I'm surprised everyone misses this "detail" - they probably could have been more up-front about this). This begs the question: how much of the performance gains can be attributed to disk access, as opposed to the architectural changes? They have teased apart the numbers, and a non-negligible performance cost comes from places that aren't disk IO.

(The re-architecting would only make sense in any case, though - most of the complexity in databases is due to guarantees it makes about persistence.)

Column-oriented databases?

You're probably thinking of C-Store, Stonebraker's last project. C-Store does not claim innovation in its representation; if you read the paper, the second paragraph lists several column-oriented databases including KDB, Sybase IQ, and Addamark.

C-Store introduces a number of novel techniques including a hybrid architecture for dealing with writes, redundant projections, compression and coding for bandwidth, high availability using K-safety, and more. You can read more about C-Store here.

The focus of H-Store, in contrast to all this, is OLTP workloads. Most of the techniques and discussion from C-Store does not apply here, as (for starters) this is an entirely in-memory database (though some would, such as K-safety). The assumptions about the target application are also very different.

A sales brochure

This paper strikes me as sales disguised as science. This looks like yet another in a grand tradition of lets blow up the universe broadsides in computer science--while such broadsides frequently are based on good ideas and sound research, they are too often conflated with grandiose claims of the superiority of some approach, unabashed marketing and posturing, and a fair bit of chest-pounding.

Good science is descriptive, not prescriptive.

The ideas discussed in this paper may well have merit. But statements like "existing RDBMSs should be retired" (and replaced with what, exactly?) are inappropriate in a technical paper. With a possible exception for things like global warming (where much is at stake), scientists are most credible when they present their data without excessive finger-wagging.

Ouch

The tone of the paper works for me. I see the piece as fairly early stage science: they are formally formulating their initial hypothesis after preliminary investigations. Testing the hypothesis means coordinating a lot of researchers to pursue the approach of a "complete rewrite". This paper is, pretty much, the pinnacle document in a short series that, collectively, lays out the research charter (and its rationale) for the "Stonebraker school".

But, I'll post something separate below because, while I like the tone of this paper, I'm confident that they're about to make a huge mistake in their technology choices -- exactly on the question of programming language design.

-t

ouch indeed

Given he's been at this for 3 or 4 decades, "early stage science" isn't exactly praise.

performance art

I know he's been at this for decades. This is his turn to hector. I mean that in reading that paper we are seeing the beginning of what I'm pretty sure is going to be a beautiful opera presenting results, and the opening act begins with this announcement. It is formally the concluding statements of a rather broad hypothesis (that they already know is pretty much true).

Really, but for the Ruby idiocy (which is completely understandable, historically) I've nothing but respect for these folks. I've done some innovation in DB design for scientific applications in genomics and got deep into the design space. When I turned on to this informally serial series of papers they just struck me as amazingly good articulation of what I was beginning to see. That confirmed lots of long-standing second-hand sense I had that the "stonebraker school" pretty much knows what they're doing.

-t

as long as i'm slinging out large quantities of praise

I think the stonebraker school (so to speak, nota proper name, other people deserve ample credit, i'm sure) -- i think it grows somewhat out of the tone patterson et al. set for berkeley cs. They are, more than most places I see, really focused on the foundational economics fo hardware and then, from there, what trajectories make sense for software. That subculture has a neat and distinctive way of looking at things.

-t

I wish I had a name like "Stonebraker"

This paper was actually under the Industrial section of the conference, and it is indeed very much a position paper - I think conferences need such controversy to "encourage new thinking" and keep things exciting.

But aside from hype, the paper does include in-depth comparative evaluations of the H-Store prototype with existing systems, and a major contribution is a full top-to-bottom profiling of a "modern traditional" (state-of-the-art relational) database. Such profiling exists, but it should be done every so often (every decade).

Why Rewrite?

They've made a good case for boiling RDBMSes for performance in OLTP. It's not clear, however, why the stuff they've done can't be accomodated by a strict subset of SQL, with RDBMSes that only handle stored queries and ignore certain safe-guards implemented by "the Elephants". Why isn't an incremental approach -- in this case, a good pruning -- acceptable?

brilliant but for programming language choice

I don't know how familiar average PLT readers are with database research. Basically, the authors of this paper are brilliant on the topics of storage hierarchy performance characteristics, computing system economics, physical data structure design, transaction / query processing, and how all of these things relate to practical software architectures.

The disaster, though, is that they're planning to use Ruby on Rails for stored procedures -- application code that runs very close to the physical data store. They correctly observe that disk and network bandwidth considerations, CPU economics, and the pay-offs of domain-specialized storage/query/transaction mgt all mean that some tiny language needs to run in the back end. But Ruby (with Rails)? That's a bad choice for many reasons:

Ruby has no rigorously defined semantics -- it is defined by reference implementation.

Ruby adds a lot of new types to the environment: it's own notions of numbers and strings, it's types like anonymous procedures, etc. It has types that require fully general garbage collection. Thus, it creates a huge impedence mismatch between data in RAM and data on wires and disks and imposes this mismatch at exactly the pessimal place in the system.

Consider a storage "unit" comprised of a physical database that contains stored procedures, an initial content of that DB (a specific choice of what procedures to store), and an execution engine that runs the procedures to respond to client-application requests. This combination of parts defines an API -- what clients see. With this in mind, consider the question:

What is the semantics of the resulting API? With Ruby in there, running the stored procedures, the APIs for *storage* are suddenly going include-by-reference the semantics of a general purpose imperative programming language. It will be very difficult to say much of anything precise and useful about the semantics of the resulting APIs except in cases where the Ruby component is barely used at all. A functional language, with very careful handling of sequencing and side effects, would have been a much better choice.

Jason Dusek, above, suggested strict subsets of SQL. That's better because it's functional, but it sill has problems. It still has the wrong data types -- the XDM would make far more sense. It has the wrong storage ontology (homogenous, statically typed tables of rows with columns of non-standard atomic values). A better fully general model for storage is XDM types stored in unstructured collections, each collection with a customized indexing plan. Then, you can strictly subset *that* so that, for example, one specialized storage engine might support any XDM type you like so long as it is an integer.

So, what the DB guys are lacking here is a language as friendly and teachable as Ruby, but that operates over XDM types, is functional, uses a monadic execution model to express sequencing and side effects, and that is comparatively easy to implement.

They want something like XQVM (disclaimer: I wrote XQVM so this is a self-promotional link in that sense).

-t

compilation to something narrower?

By the way, I tend to like your writing most of the time, giving me a positive bias. But I also have a negative bias toward XML when it introduces open-ended issues that can't be easily grounded, so the two biases cancel out some. I can't read far through your XQVM material without bailing from XML surfeit, but I'm interested in your remarks about storage and languages.

Thomas Lord: I don't know how familiar average PLT readers are with database research.

I've done more personal work in languages, more professional work in storage (but focusing more on more non-standard ad hoc systems). When I saw this paper a few weeks ago I was glad to see this direction pursued by last generation's database gods, since it made some of my interests seem less eccentric.

Thomas Lord: The disaster, though, is that they're planning to use Ruby on Rails for stored procedures ...

That struck me as okay. And I think section 6.2 of the paper glossed over some of their thinking -- just barely noting RoR compiles to db code in another form, implying they intend to compile RoR code to something else rather than planning to run any Ruby at all at runtime. I don't think they spelled out their intention to lock down capabilities via compilation. (That would undermine some PR appeal of saying they wanted to use RoR.)

In the current crop of hyper popular languages, Ruby seems closest to Lisp and Smalltalk -- pretty close to Smalltalk -- so it's a lot like saying they plan to use Smalltalk (compiled to something else) in their approach. And this is a lot like saying the oo db Smalltalk vendors (say in the 90's) had a tack they now consider useful in column oriented stores, etc.

Thomas Lord: With Ruby in there, running the stored procedures, the APIs for *storage* are suddenly going include-by-reference the semantics of a general purpose imperative programming language.

I think they wrote themselves an escape clause by implying they mean to compile from RoR to something native in the db engine.

Their summary message predicts performance gains from unconventional new approaches 1) avoiding one-size-fits-all, 2) snubbing relational approaches, and 3) rethinking both code and data mixes in specialized engines. The advent of "little languages" is approved, and Ruby is just their preferred flavor at the moment. I'd guess they efforts are practical, realistic, and on track for adoption or evolution into adoptable tech.

ruby vs. storage

(Thanks, and...)

Ruby will make a fine little toy in the back end to give researchers a platform for rapid prototyping fo some simple ideas. It's ample for getting papers. A better yet similar choice would something like Xerxes plus xqilla (a DOM implementation and its XQuery engine) because the experiments would follow a saner language discipline (thus be more likely to smoothly transition into something sensible to deploy). No pain, no gain.

I fear that they are headed towards a gracelessly ad hoc solution that will recapitulate similar errors in the evolution of SQL.

As for your anti-XML bias: I suggest getting over it (I mean that in a friendly way!). It's a surprisingly lovely little standard once you dig in. It's restricted to a domain of externalizable types. Within that domain, it picks out excellent abstract syntax and semantics for difficult types like human text. Then, as the key structuring element, instead of lisps CONS pairs are "elements with attributes and arbitrary lists of child nodes except that adjacent text nodes are implicitly combined". There are questionable type restrictions on element node names, attribute names, and attribute values but, if those restrictions are in fact bogus, it is easy to remove them in an upwards compatible way. The surface syntax does manage to make Cobol look concise but, then again, just bring on the structure editors, please.

Some aspects of language design are really profound for their contributions to expressiveness (e.g., higher-order functions). Other elements are profound for their "harmony" and their "thoroughness" -- maybe there's some kind of feng shui of PL design.... More language designers should be thinking about XDM as their starting point for data type design for its profoundess in that second sense.

-t