Simplicial Databases

Simplicial Databases, David I. Spivak.

In this paper, we define a category DB, called the category of simplicial databases, whose objects are databases and whose morphisms are data-preserving maps. Along the way we give a precise formulation of the category of relational databases, and prove that it is a full subcategory of DB. We also prove that limits and colimits always exist in DB and that they correspond to queries such as select, join, union, etc. One feature of our construction is that the schema of a simplicial database has a natural geometric structure: an underlying simplicial set. The geometry of a schema is a way of keeping track of relationships between distinct tables, and can be thought of as a system of foreign keys. The shape of a schema is generally intuitive (e.g. the schema for round-trip flights is a circle consisting of an edge from $A$ to $B$ and an edge from $B$ to $A$), and as such, may be useful for analyzing data. We give several applications of our approach, as well as possible advantages it has over the relational model. We also indicate some directions for further research.

This is what happens when you try to take the existence of ORDER BY and COUNT in SQL seriously. :-)

If you're puzzled by how a geometric idea like simplexes could show up here, remember that the algebraic view of simplicial sets is as presheaves on the category of finite total orders and order-preserving maps. Every finite sequence gives rise to a total order on its set of positions, and tables have rows and columns as sequences!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

One puzzle

What is your interest in this paper and how did you stumble upon it? ;-)

This approach may lead to
improvements in query optimization because one can adjust the \shape" of the
schema to t with the purposes of the queries to be taken. The ability to visualize
data should also prove useful, because these visualizations seem to \make sense"
in practice.

We will
also present the \global table" construction, which roughly takes a database and
joins everything together to make one big (unnormalized!) table.

These are some very interesting ideas, and I like their ideas on visualization. I've been hitting around a lot of thoughts this paper seems to be discussing. There is a lot of boilerplate code I want to remove, but haven't yet figured out how. They also cover schema migrations and duplicate rows.

One nit:

This is what happens when you try to take the existence of ORDER BY and COUNT in SQL seriously. :-)

I am not sure what you mean by that.

"Global Table" = Universal Relation

The One Big Table of everything has historically been called the "Universal Relation." A lot of stuff was published on this idea in the late 1970s and early 1980s. Jeffry Ullman's name comes up a lot. As Alberto Mendelzon says, "The idea seems to be rediscovered every few years."

Yes...

Sorry, the one lookup table was not what I actually found "interesting". Instead, I'm looking for complete mathematical models.

My interest here is in model management.

For example, for schema migration and change data capture, I've been reading some papers from MS Research and other places. But reading all this can be quite tedious when what I really want is a complete uniform model for the problems I want to disappear. Currently, each paper has its own model for these things and they're studying an acute problem. A category theoretic approach might help 'simplify' things. if only giving me a mathematical 'framework' to guide my thinking.

Hope that helps.

You might find this amusing

From Ullman's teaching slides from his DB course:

The idea keeps getting reinvented about 3 times a year somewhere in the world.
- So let's see it once and for all, and then if you ever need it you can just implement it rather than reinventing it.

I've never known so many people reinvent this wheel. Neat.

For those who are unfamiliar with database systems' theory, please see: Consequences of assuming a universal relation by William Kent (acm portals account required).

By the way, I was curious why Ullman's theory about reinvention might be true. Apparently, the phrase "universal relation assumption" was coined in 1979 by William Gerwitz at Bell Laboratories, but only one citation of his memo was ever made. Likewise, very few have cited other descriptions of the problem. I guess in Malcolm Gladwell's terminology, the problem never reached a "tipping point" where it became widely known. The "who cited this" feature is really cool!

no longer says

Unfortunately, Alberto died a few years ago.

Mikhail M. Gilula's "Set Model"?

Is this related at all to Gilula's order-aware Set Model for Database and Information Systems?

Not convinced

"query equivalences are trivially verified when one phrases them in categorical language"

Well a half page of the paper dedicated to the subject failed to convince me. First, there are several subtleties, for example projection and cross product (or natural join) don't commute: one has to play with projection attributes to make it appear to be so. Then, there are rewrite rules for renaming operation
http://en.wikipedia.org/wiki/Relational_algebra#Basic_rename_properties

Updated with some links

Edward Yang reviewed David's tech talk Databases are categories (vimeo).