An interview with Chris Date

In this Interview, Date discusses his new book, Database in Depth: Relational Theory for Practitioners. The book does look interesting and is on my list to be acquired at some point in the future. The Third Manifesto crowd can be a bit bellicose at times (at least for my tastes), but I can't say that I disagree with many of their assessments.

The underlying model (relational in this case) is of interest but more importantly is the manner in which it manifests itself within the language chosen for communication. SQL has some advantages as a language that I think extend beyond the realm of being ubiquitous. It primarily serves as a bridge between the programming languages that we choose, and the databases that we are stuck with. One could naturally ask the question of why have a seperate language in the first place (Object Databases tend to go this route)? And why do we need either SQL or D, when we have much more universal programming languages available at our disposal? If one surmizes that a database language is needed but that it should somehow be crippled in expressiveness (e.g. not turing complete, not imperative, not....), then one is forced to answer the question of where to draw the line.

That said, what I am currently wondering is if there is an implementation of the D language besides Dataphor? I guess I'm spoiled by the myriad of unencumbered programming languages that are being actively developed and I sometimes think that there is a cultural divide between those who implement programming languages and those that work with languages in the database market. Perhaps what we need is for an unassuming brilliance to appear out of seemingly nowhere and surprise us with an implementation of the language in Haskell or Scheme. Any more Autrijus' out there? :-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Implementation of D

My google-fu is quite weak. :-)

Thanks. This gives me a place to jump in when I'm ready. I'm toying with the idea of using the D grammer when I start in earnest on TAPL - inspired in part by Autrijus' implementation of Pugs during his reading. Unlike many here on LtU, I don't really have an ambition to create my own language, but I do use databases "a lot" in my line of work.

On a purely superficial basis, I can't say that I'm that overly impressed with the elegance of D, but I'm hoping that this first impression turns out to be unwarranted. I've worked with too many embedded languages that are frustrating.

HaskellDB

HaskellDB isn't a D, but it is an embedded relational query language. It compiles to (a relatively clean subset of) SQL, but could perhaps be retargeted to a different language. Worth a look while you're looking around.

Because the relational model is implementation-independent, it follows that any query language that uses only terms that are meaningful within that model will not be able to express implementation-specific details, in particular performance tuning details. This leaves the problem of how and where to tune the actual behaviour of the engine that fetches and arranges physical data. To what degree can it be "self-tuning"? Is the problem simply an artifact of poor implementations, or is it more fundamental than that?

Date and his fellow debunkers have generally referred questions of this kind to the proprietory, patent-encumbered and yet-to-be-unveiled TransRelational Model (TM), which they assert provides a model of physical storage behaviour that renders such concerns effectively moot. Prior to this apocalypse, questions about how the gap between model and implementation is to be effectively bridged still need answering - and I don't think D has the answers.

TransRelational Model(TM) - Is It Real?

I regularly visit Date's site but he charges for anything of significant interest. The pithy content is served up in PDFs only after you pay for with PayPal et al. And IMO it usually isn't worth the cost.

AFAICT Date is not nearly as sharp as Codd, nor as clear a writer. He is, however, admirably consistent and persistent in defending the principles of relational databases. Unfortunately he appears of late to have become much more financially opportunistic. Consider the earlier publication of the (very thin) What, Not How which, although offered on Amazon, was apparently primarily distributed as a free promotional flyer by business rules vendors to hype their products.

Date appears to have abandoned academics and is instead concentrating on making money. He refers to the fact that Codd didn't make any/much money from his relational model. Date may be concerned he won't make any money either. He's like the Chinese kung-fu master: pretend you know something no one else does and extract a price for your expertise, but never tell everything.

The "TransRelational Model(TM)" is just that, a trademark. The technical details are undoubtedly a rehash of existing ideas. Date claims that he can't reveal information because of pending patent applications. They can probably get a patent but that doesn't mean much these days. How many academics would handle significant new knowledge this way? Instead they would race to be the first to publish. There is nothing wrong with making money, but I don't buy into the secret technology aspect.

And why not Prolog as an alternative to D?

Can't really comment about his aptitude.

Do know that he knows a lot more about databases than I'll ever hope to be (which strictly speaking is not saying much). :-)

Earlier I alluded to a cultural divide between the PL people and the Database people. And I just think the stuff you talk about is indicative of the kind of gulf that exists. Database designers tend to hold their knowledge and works closer to the vest, being more protective. PL designers tend to openly share their works.

In the PL world, there's not a lot of money to be had by trying to charge for a programming language as a product. You might be able to sell support tools and environments, but making a language definition proprietary is a death sentence. You have to get enough people to use and write libraries for language before it can gain traction. Given the plethora of openly available languages, the supply of languages easily exceeds the demand. Smart people like Alan Kay and Guido van Rossum make a pretty good living. But they also have to job hop periodically as their research money is always subject to belt tightening measures.

In the database world, the economy is very much different. Consumers have far fewer choices available. PostgreSQL has made a valiant effort, but it has been a plodding path they have chosen. In this environment of oligopoly, the language is used as leverage for lock in. SQL may pretend to be standard across the vendors, but the incentives for coming to a more universal standard, much less implementing the standards that have already been agreed upon, is minimal. Yes the vendors have to make some pretense of compatibility. But just enough to get you to buy into their product.

Anyhow, I think it important to not judge Date on the basis of what we are spoiled by in our world where information is so freely exchanged. What I personally would rather see is to have the language issues swim away from the database vendors, and let them fight it out on the grounds of proprietary algorithms and implementation details.

Prolog

Any why not Prolog as an alternative to D?

Prolog offers an enlightening comparison, I think, being a declarative language that suffers in similar ways from the inherent difficulty of finding a generic strategy for determining the optimal "how" for the "what" of a programmer's specification. Prolog offers a concrete example of how a change in implementation can improve performance in a particular range of cases, even though the general case is in principle unoptimizable-for. It may be that something like the TransRelational Model (DMCA) offers an improvement of this order on existing implementations. Really we'll have to wait and see; personally, I'm not holding my breath, although I'm willing to have it taken away when the clouds part and the finished product appears, shining in the heavens.

Peter van Roy could tell you all about this, having found concrete ways of improving on the shortcomings of the Warren Abstract Machine and producing better Prolog implementations (see Can Logic Programming Execute as Fast as Imperative Programming? - which I admit I've only skimmed myself)

Not again

So here goes Mr. Date again with more rhetoric to show that he is right no matter what anyone else in the industry says. Now he's confusing mathematics, science, and engineering. Mathematical elegance doesn't necessarily result in a true scientific theory nor in practical engineering.

Yes, again

He is right, no matter what anyone else in the industry says. This has nothing to do with mathematical "elegance"; it has to do with the proven properties of the relational model. The industry pays lip service to the benefits of these properties without providing education, tool support, or insight into how to apply the relational model. Thus it is up to cantankerous academics to constantly point this out.

I think I understand his combativeness

But the weaknesses of SQL and the current batch of databases is not as much interest as seeing how an alternative language plays out. I haven't met many people that think that SQL is not one kludge after another. In spite of it's warts, though, it has proven its usefulness. Reminds of the discussions that point out the obvious flaws in the "popular" programming languages like C++ and Java. Making a case against SQL is tantamount to shooting fish in a barrel.

What I'm more interested in is issues such as whether a database language should be a DSL; whether it should have different layers like SQL (a data definition language, a query language, a procedure language, etc...) or whether it should be a more general purpose language that blends all these things together. After all, should the act of asking a database a question require an act of "programming".

And more important than all that is how a database language interfaces with the other application languages that we write major amounts of code in. SQL does have the quality of being hard to interface with almost all current programming languages, which has the odd benefit that it is agnostic when it comes the choice of programming languages.

It is these types of questions that interest me more than one of purity and the observed hideousness of SQL.

Code is a liability

And more important than all that is how a database language interfaces with the other application languages that we write major amounts of code in. SQL does have the quality of being hard to interface with almost all current programming languages, which has the odd benefit that it is agnostic when it comes the choice of programming languages.

My point of view is that if your application uses an RDBMS nontrivially, there should be very little code outside of this RDBMS that is specific to the data and application. "Business objects" modeled in e.g. Java are nearly always inappropriate to data-centric applications. There shouldn't be "major amounts of code" outside the database that needs to interface with the database.

The downside of this is yes, current RDBMSes are particularly annoying to program. The best RDBMS I have found from a programming standpoint is PostgreSQL, with its support for common high level languages like Perl, Python, and Java as stored procedure languages.

relational language

I originally came to LTU because I was trying to figure out how the relational model and programming languages can be combined (I later learned that I know almost nothing about PLT or the relational model).

It looks like list comprehension is a very good first step towards making 'relational' a first class object in languages (to me, list comprehension = selection, and in some cases LC = joins). Having some sort of extensible records would add projection. I expect group bys, etc. could be added as well in a very clean way.

'Order by' isn't considered a valid relational operator (because it involves assuming an ordered relation), but if it was, powerful analysis could be done on things like financial/network times series (also, I believe, genetics). Dennis Shasha extends sql beautifully in this paper. While he is extending sql, obviuosly a language with built-in relational operators could be extended as well .

Date, et. al. write a lot about DBMS, but step away from databases for a moment and imagine a GUI which is aware of the relational model (in fact, just imagine an html table which understands RM): the gui could be 'linked' to a data structure (model-view-controller style)...it will have access to the data, its types as well as its 'name' (since a 'record' in a database is supposed to have a 'column' name and data type defined). The GUI (or our extended html table) could have the same basic functionality as what's provided in most DBMSs such as sort, group by, aggregate, etc. All of a sudden, a huge market for expensive OLAP/Data Mining/DSS/etc. is rendered useless because much of their basic level functionality is automatically shipped with each browser!

A large part of DB2Web development (I did that for the first two years of my career) could just be a matter of a DBA populating his database and a couple of lines of code to link the html page to the database...perhaps a designer to make the page look nice. If such a system existed, I would not have had a job ;)

Are you familiar with Kleisli?

Kleisli is a database query language based on list comprehensions.

You've got a great...

...intuition!

The best resource on how to marry DBs and PL that I know of is Torsten Grust's PhD thesis, Comprehending Queries. It's quite heavy on the category theory, though you can still get a lot by skimming over that.

In it, he shows how queries can be interpreted using comprehension syntax, and then uses this to interpret what queries look like in the underlying monadic semantics. This is really interesting, because it turns out that the relational calculus is a kind of list monad with some additional equational structure (to turn lists into sets), and that you can use the fusion laws for list combinators like map and fold to optimize the queries. Furthermore, it lets you see how we can let programmers write their own group functions (such as SUM) in a sound way.

Furthermore, it makes it clearer to me how we can extend databases to store data that can't be compared for equality in a principled way, such as functions (for which judging equality is undecidable) -- we can move from treating data sources as sets to treating them as multisets. Likewise for time series data (which are lists rather than multisets). This principled-ness can also let us figure out how to make all of these different data sources play together nicely.

comprehending queries

I have pdf copy of this thesis on my computer...but it is not an easy read! I'll read it again, keeping in mind the summary you provided. Thanks!

Great introduction

I guess this thesis is going to lose me at around the point where they start seriously hitting the Category Theory, but the prelude is fantastic. Really good writing - lights up the mental switchboard in all sorts of ways and makes me want to understand the rest.

equality

I'm not sure how this is related, but higher order unification allows storage and querying of functions (or predicates/relations) in a principled way. F-Logic restricts higher order unification such that it is decidable and Lambda Prolog just lets the algorithm go out to lunch on some unifications.

I myself haven't found the fact that higher order unification is undecidable in general to be a big problem, much in the same way that it isn't a terrible problem that most compilers let me write recursively defined functions that don't terminate. Your milage might vary.

----------------------------------------------------
If people never did silly things nothing intelligent would ever get done.
--Ludwig Wittgenstein

Sloppy definitions

I just looked through the first part of that thesis. He manages to botch something as basic as the definitions of product and coproduct (he doesn't require uniqueness of the "connecting" morphism), so I suggest category theory newbies read this side by side with a more carefully written introduction to category theory.

Cantankerousness is optional

He is right, no matter what anyone else in the industry says. [...] it has to do with the proven properties of the relational model
E.g., he says that the prescriptions are not negotiable. Let's see.

A choice of set (as opposed to multiset, list, or other "monadic shape") as a base for the definition of "relation" looks pretty arbitrary. By that I don't mean that relation is not a set, just that relation should not be used as the only option if it is limited to sets.

I do understand the value of idempotence, commutativity, and associativity of sets for efficient implementations, but then again, Date said their overriding concern in TTM was with an abstract model, not with matter of implementation (and who said the users cannot use sets when they really want them and get the same great performance even if the more general model is available?).

Read the basics...

"A Relational Model of Data for Large Shared Data Banks" explains:

The term relation is used here in its accepted mathematical sense. Given sets S1, S2, ... , Sn (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from S1, its second element from S2, and so on.

Mathworld also reads: "A relation is any subset of a Cartesian product." So the definition of "relation" from Date's material is anything but arbitrary.

The fact that the relational model deals in sets is critical to data integrity. The possibility of anomalies in database updates is only eliminated when relations are in domain-key normal form.

It makes me sad that LtU readers, of all people, think that this stuff is just "arbitrary." Why do you care if duplicates are allowed, anyway? Saying something twice doesn't make it more true.

Sorry if I was too offensive

Read the basics...
If you mean the definition of relation, then I've read it (as I believe anyone here) in school. It was not in English, but the mathematical lexicons in different languages tend to be almost isomorphic to each other.

I might have chosen a wrong wording, but I tried to convey that I do not question a legitimacy of accepted definition of relation in general, only choosing this concept as the only blessed way to model data.

Are sets more scientific because they are studied longer than, say, categories? Is set theory the only possible way to axiomatize (part of) mathematics? Is there such a thing as the set theory? I am even not asking social and economic questions.

As some people already mentioned, database community tend to hide their research. That might have its justifications, but I belong to a different culture, and unless I see any proofs (not necessarily formal), I dismiss pompous claims as not convincing.

Why do you care if duplicates are allowed, anyway? Saying something twice doesn't make it more true.
Sure, so I will refrain from repeating the links already given in this thread to papers describing unified query across sets, multisets, lists; and use cases for them.
[on edit: from my experience, I wish (R)DBMSes supported temporal aspect in a direct way - think standardizing log files and making them accessible via SQL (or D). I would be grateful if anybody pointed me to a publicly available paper describing temporal relational model, approved by RM gurus. I know I can model history of changes on top of RM, but I strongly believe for many applications it should be done on system level]

It's not about queries

It's about updates.

Sometimes less is more

Chris asks, "One could naturally ask the question of why have a seperate language in the first place (Object Databases tend to go this route)? And why do we need either SQL or D, when we have much more universal programming languages available at our disposal? If one surmizes that a database language is needed but that it should somehow be crippled in expressiveness (e.g. not turing complete, not imperative, not....), then one is forced to answer the question of where to draw the line." The fact that Codd's relational algebra, and SQL, lack Turing completeness makes it possible for the RDBMS's optimizer to do a thorough job.

Determining what to leave out...

...is actually more important than what to include. I fall on the side of making the query language not be a full fledged programming language. At its heart, SQL is not a PL, but rather a Command or Shell Language. You start out with a simple syntax of:

   CMD opts

From there you go to the Big 4 commands. The structure of the options for these 4 are similar by inconsistent.

   SELECT opts
   INSERT opts
   UPDATE opts
   DELETE opts

And to make these commands workable you have to break the options lists into clauses for the various options. The structure of the options for the subclauses is not necessarily consistent within a given command.

   SELECT opts FROM opts WHERE opts GROUP BY opts HAVING opts ORDER BY opts

Well, now that we have a way to manipulate data, you also need a series of commands to establish the storage and the relations:

   CREATE TABLE opts
   ALTER TABLE opts
   DROP TABLE opts

Now that we're in the domain of database administration, the number of actual an possible commands start grow in magnitude. As the level of complexity increases, we start to decide that we better package up these things into some sort of modules:

   CREATE PROCEDURE userproc(params) body
   CREATE VIEW userview([params]) body

Well, now that we have modules, one begins to wonder whether these modules can have local variables and state. Then we decide that conditional execution of commands within the module would be a really useful thing. And why not add branches and loops since we're already 99% there. And before you know it, we've stumbled into an ad hoc programming language that was not thought out particularly well (PL/SQL and T-SQL are the PLs that I have more antipathy towards).

What starts out as a simple language for declaring tuples and their relations, ends up being extended in ways that resulted in major contortions. Now, Date and company might point out that some of this distortion is the result of not abiding by the relational model. But I hardly think all of SQLs grotesqueness can be blamed on this source. At it's heart, it's a language design and modularity issue. How do we design a language for constructing relations, querying the data, modularizing the components, and interfacing seamlessly with othe programming languages.

And back to your original statement. SQL means different things to different people. Some see it purely in terms of the Query aspect (disliking it because it does not provide for stateful loops). Some dislike it for its inconsistent command and clause layout. Some dislike it for the languages used to define stored procedures (which above all else is the most inconsitent implementation detail between the vendors). Finally, some dislike it because it has NULLs and permits non-normalized tuples.

For myself, I think that database experts need to learn how to design languages that are bit more elegant. Define exactly what the language should and should not do. Define the purposes of the various commands, segregating them along of the lines of Definition, Query and Procedures - but keep them consistent in syntax. And if you're aiming to fix what's broken with the current slate of databases, go ahead and fix the problems with the underlying languages.

Oh well, enuf rant for now. Perhaps the exposure to the Debunking spell is wearing off. :-)

It gets worse

How do we design a language for constructing relations, querying the data, modularizing the components, and interfacing seamlessly with other programming languages.

Now imagine you're working with a True Relational DataBase Management System, which support arbitrary ADTs as per the Third Manifesto. You have to be able to express the representations of arbitrary types, together with all of the operations that can be performed with values of those types. A complete implementation of Scheme should do the trick...

Premature optimization is the root of all SQL

By definition, a stronger adherence to the relational model requires a language that is uniform in expression across the model, independent of the particulars of a given data set.

However, when it comes to performance and implementation, two different data sets conforming to the same model can be radically different.

Since the biggest demand on SQL implementers is that their product perform well on the particular data sets that occur frequently in large business databases, any sane SQL implementer limits, hacks or bends the relational model to deliver what their users want: a reliably performing repository for lots of data that specializes in commonly occuring patterns of data.

Though middle-level optimizers do exist, creating optimizers that work acceptably for all data sets is probably NP-complete. (Maybe even impossible in principle.)

So unless performance on THEIR data sets suddenly becomes irrelevant for database users, it is unlikely that D or any other "pure" RM language is going to suddenly take over the DB world, pace Date and his fellow grumps.

(Also, if D or Dataphor was so great, someone would have used it to make a fortune at a place like Google by now.)

(Damn, got sucked in again... ;-) )

The Universal Optimizer (patent pending)

Though middle-level optimizers do exist, creating optimizers that work acceptably for all data sets is probably NP-complete. (Maybe even impossible in principle.)

I wonder how that might be demonstrated, and what the consequences might be of having demonstrated it.

I mean, I suspect it to be true, and have an instinctively incredulous response towards the claims made for the TransRelational (TM) model based on those suspicions - but it would be interesting to see a more rigorous treatment of the problem. In particular, a clear statement of the difficulties involved might enable one to design a better language for describing optimisations (or providing declarative "hints" to an optimising engine).

Where would one start? Are there any papers online that deal with this topic?

Optmizing the optimizing optimizer for optimization

I wonder how that might be demonstrated, and what the consequences might be of having demonstrated it.

Well, the latter quandry is easily answered: Date et al. would have to shut the heck up, and we wouldn't need to discuss them seriously here anymore. ;-)

I think the former problem has two levels: the definition and parameters of optimization and the performance bounds of a given optimizer. Again, the second of these is more straight-forward: for any given definition of "optimized" and an algorithm (or family of algrithms) to perform that optimization, construct a relationally-kosher example that would have exponential bounds.

I think it would be harder to define exatly what "optimized" means. What most DB users mean by optimized is whatever "reasonable" thing I want to do with my "reasonable" data, performance should be "acceptable".

That's a pretty fuzzy, all-encompassing notion, and coming up with a formalization that lived up to the informal notion while unambiguously delimiting the problem would be quite a challenge, I think.

Are there any papers online that deal with this topic?

The DB optimization literature I've seen has, unsurprisingly, quite focused on particulars. A quick search of citeseer was even less promising than I hoped. I think the fundamental problem outlined above of defining a satisfactory, comprehensive notion of "optimized" makes it unlikely anyone has a satisfactory unified theory of DB optimization.

(I too would be interested to see one if I'm wrong, though.)

The Love of...

So unless performance on THEIR data sets suddenly becomes irrelevant for database users, it is unlikely that D or any other "pure" RM language is going to suddenly take over the DB world, pace Date and his fellow grumps.

Of course, speed matters a lot in my usage. But more important than just raw speed for an individual query is the problems associated with shared state concurrency in the transactional model. Any query, no matter how fast it needs to be completed, has the possibility of locking out other users while it performs its duty. Gets to be a hassle when people run reports that are complex, that really don't need to be optimized, start interfering with real time transactions.

(Also, if D or Dataphor was so great, someone would have used it to make a fortune at a place like Google by now.)

Well, perhaps they are covertly using it and just not telling anyone. Just like SRFI-49. :-)
(Damn, got sucked in again... ;-) )
Reeling in. Troll successful. And I didn't even have to use the bait that says that all good relational languages should have static typing. :-)

Purpose of query languages

Query languages like SQL seem to serve two purposes. The first is to let you express your query in a DSL that might be better suited than your application language (e.g. C). The second is to optimize communication into batch scripts with minimal IPC over a network.

But I've done all my database work with embedded databases in nice programming languages. The data model is hashtables and trees and the queries are written as normal application code. Have I side-stepped the problems that make SQL attractive to other people?

P.S. I was recently visiting a company who were having to modify the entrance to their building in order to fit their new Oracle database into the server room. Talk about hardware requirements!

Mnesia

Date would presumably say that Mnesia and its ilk were a regression to the failed database programming models of the past. Of course, if it works for you then one perhaps needs to ask, "failure on what terms?". It may be that the relational model was an improvement on other models done badly (after all, this is the excuse Date uses for the inadequacies of all actually existing implementations of the relational model), and that Erlang/Mnesia represents one way of doing them well.

Mnesia is essentially a relat

Mnesia is essentially a relational model except that in addition to being restricted to storing integers, strings, etc... You can store what are essentially Lisp S-Expressions, which is a hugely richer view of data.
I haven't read much of Date and Codd. I don't know that they would like S-Exps, but I feel that it's a huge win.

Mnesia is also lacking in some features. I'd like to see indexes allowing for secondary and tertiary columns. You can merge two columns together into a tuple stored in a single column... but you shouldn't have to change the schema for performance reasons.

A two-column index usually has a minimal performance disadvantage vs. a single-column index, but speed up many queries that refer to both columns. You could merge two columns into a single column with a tuple to get a similar effect... but you shouldn't ever have to change the schema to tweak the performance.

I haven't looked at the Mnesia source code yet.

Mnesia not relational

As far as the essentials go, Mnesia is not an RDBMS. In particular, it is not an extended RDBMS: there is nothing in the relational model as such that says that only simple scalar values are allowed.

If you stored S-Exps in an RDBMS, you'd need to have some operators in the query language to do things with them - in SQL, say:

SELECT foo FROM bar WHERE CAR(CDR foo)='baz'
or
UPDATE baz SET foo = CONS(foo, 'baz') WHERE CAR(foo)='bing'

The trouble is that pretty soon you'd start needing to construct more powerful expressions, and before you knew it your well-behaved declarative query language (or SQL, as may be) would have expanded to include a full implementation of Scheme.

SchemeQL

before you knew it your well-behaved declarative query language (or SQL, as may be) would have expanded to include a full implementation of Scheme.

Horror of horrors! :-)

I disagree...

You don't need a full implementation of Scheme in your database system to manipulate S-exps. After all, there are *many* numerical things you cannot do on the server side with standard SQL... many sophisticated statistical analyses, for example.

In these cases, you use the database to retrieve the minimum that you need, analyze it on the client side, and then transfer it back. All that should be done on the server-side is stuff that's generally useful for reducing this flow of data without placing undue burden on the server's resources.

The whole point of database systems is that they are not turing-complete. Sacrificing turing-completeness grants greater ability to analyze and optimize.

Mnesia

Mnesia fully supports indexing extra columns. See e.g. mnesia:add_table_index/2. I agree this is an excellent feature. Today I needed to implement a prioritized-queue view of a table -- I added an indexed {Priority,Timestamp} column and I was done!

Ulf Wiger has written a relational querying library on top of Mnesia called rdbms. I don't think anybody uses it though, nor the Mnemosyne query language based on list comprehensions. We stick to the direct interface instead because it's simple and predictable.

I would like to see examples of relational data structures doing sexy things that would be hard on trees and hashtables. Relation Machine Lisp anyone?

P.S. Erlang's best feature is that there is an order defined over all terms -- compare a string to a tuple and you will get a predictable result. Hard to relate how Right this is -- much like NIL in Lisp.

Well, I was going to thank y

Well, I was going to thank you for informing me of highly useful undocumented functionality of Mnesia... but re-reading your post I'm not sure we are talking about the same thing. (The capital letters in your tuple bother me.)

I like Erlang, and I very nearly had a job in it. (Corporate mergers said otherwise.) I've toyed with Mnesia a bit.

I understand that a table can have multiple indexes. However, most queries are limited in their use of more than one index per table. Say you have a 'postalCode' column and a separate 'streetAddress' column. How do you quickly find a particular row given it's street address and postal code?

For separate indexes on each column, start by finding how many rows you would hypothetically examine by looking up the postal code and searching for the street address. Compare that to looking up the street address and searching by postal code. Pick the smaller number. Probabilistically speaking, this is a big help on many data sets, but not all.

If there is a single index whose primary column is 'postalCode', and who's secondary column is 'streetAddress', then the database system can take you straight to the row you want, no probabilistic efficiencies involved. Moreover, the marginal cost of a two-column index is usually appreciably lower than the marginal cost of a second, separate index. The downside is that a single two-column index helps fewer kinds of queries than a pair of one-column indices.

You say few people really use Mnemosyne. Interesting.

Indexing

I get you now. The nearest thing I that occurs is using an {Address,PostCode} tuple for a key. This is efficient to lookup by both values at once:
mnesia:lookup({mytab, {MyAddr,MyPostCode}}).
and if the table is ordered (i.e. index is a tree) it is also fast to lookup by address only:
mnesia:match_object({mytab, {MyAddr, '_'}}).
because Mnesia can use the index provided that the left-hand-side of the key to match is bound (not a wild card). But this would not help to quickly match all records by postal code.

Ulf Wiger has also done some experimenting with letting you compute table indexes using custom code.

P.S. The capital letters in my tuple are borrowing Erlang's variable syntax to say that {Foo,Bar} is a 2-tuple of a Foo followed by a Bar, both of which refer to informal datatypes.

You say few people really use Mnemosyne. Interesting.

I may be wrong of course. The Erlang world is a big place.

Somewhat speculative philosophy...

What RDBMS's do that aren't so easy with hash trees and tables? What about Geographic Information Systems? PostgreSQL implemented spatial R-Tree indexes and a fair number of geometric algorithms years ago. MySQL 5.0 is following suit. Third party add-ons to do GIS with Oracle have been available since the 80s.

Hash tables and trees are the bread and butter of standard RDBMS's. Perhaps comparing a query language based on the relational calculus to Mnesia proper is a bit like comparing pattern matching to selectors. They both get you to your destination, using similar routes. Pattern matching is a more abstract, and arguably more mantainable path.

However, I *hate* the common SQL interface, wher you create your own query string, send it off, and get back some ugly structure. In my opinion, the advantages that the relational calculus has over Mnesia are lost to such interfaces.

Using the relational calculus doesn't need to be this way. I've had jobs programming in so-called "fourth generation languages". They make interfacing with a database pleasant, even if they otherwise make programming painful. Who needs obsolete concepts such as procedures and function calls... that's sooo third generation.

More reading...

If you want to know what practical big-iron database folks are up to, there are lots of articles on the web by Ralph Kimball and Bill lnmon about data warehousing:

A Dimensional Modeling Manifesto
http://www.dbmsmag.com/9708d15.html

More articles by Kimball:
http://www.decisionone.co.uk/resources/KimballArticles.htm

Bill Inmon:
http://www.dmreview.com/authors/author_sub.cfm?AuthorID=30038

The world the data warehouse consultants live in sounds rather scary; they're the ones helping companies like Walmart extract trends from their mountains of data, which I'm not at all sure is a good thing. But on a practical level, they're definitely focused about what real customers are faced with when they have terabytes of data to query.
I don't see how Date's theorizing intersects with their world at all.

Apples and oranges

According to Kimball's own DM manifesto:

Dimensional modeling (DM) is the name of a logical design technique often used for data warehouses. It is different from, and contrasts with, entity-relation modeling (ER). This article points out the many differences between the two techniques and draws a line in the sand. DM is the only viable technique for databases that are designed to support end-user queries in a data warehouse. ER is very useful for the transaction capture and the data administration phases of constructing a data warehouse, but it should be avoided for end-user delivery. [emph. added]

In other words, according to a leading proponent of data warehousing, data warehouses don't have anything to do with data management and data integrity. Which is precisely what Chris Date, Fabian Pascal, and Hugh Darwen are saying as well.

Yes, but...

Yes, they're apples and oranges, but no, they're not saying the same thing. For example, Date says in the interview:

"I think the right way to view things like OLAP and data mining is as special-purpose applications that run on top of the DBMS."

According to the data warehouse folks, this is a very foolish thing to do. A transactional database just isn't designed for OLAP queries, for a variety of reasons. The data needs to be extracted, cleaned, and loaded into a separate database for data warehouse queries. Also, the data lives on in the data warehouse after the data has been modified or deleted in the transactional database, so the data warehouse becomes the data repository for historical data - it's not just an application.

Date doesn't say you shouldn'

Date doesn't say you shouldn't cache the data in some other place for the purposes of querying; he just implies that the relational DBMS should be used to establish the data of record.

The important point with data warehousing is that there's no database management going on; hence DW features do not make a piece of software a better database management system. Data warehouse people are far too excited about glorified greps on text files. That kind of querying is useful for a lot of things, but it's not database management -- it's fast searching.

That's rather dismissive

You're essentially defining "database management" to mean only transaction processing, which is a rather narrow definition. A read-mostly database is still a database, and a large part of working with databases is running queries quickly. I suppose in a way all search is "glorified grep" but a lot of database folks (and people at places like Google) would disagree with you.

Take it as a compliment

You're essentially defining "database management" to mean only transaction processing, which is a rather narrow definition.

What I mean by database management is ensuring data integrity (which sometimes entails encapsulating updates in transactions). Data warehouses are not designed for that; thus they are not a form of database management.

Don't get me wrong, querying and optimizing data sets for fast search is a great way to get more results -- literally -- but it's not a replacement for database management. Some domains deal in facts, rather than hints and allegations, and the relational model is the most effective way to ensure that these facts remain consistent within a software system.

truth versus consistency

While consistency-checking is important, I think it's a major error to casually assume that a database that enforces consistency more thoroughly deals in "facts" while one that doesn't deals in "hints and allegations".

A simple example might be a system that makes a phone number a required field in a certain format. If the customers don't have a good reason for providing an accurate phone number, many people will make something up, resulting in more inaccurate data than if they were allowed to leave the field blank. And people whose phone number doesn't fit (such as if it requires an extension) will have to do something nonstandard to get it in, such as putting part of the phone number in a comment field.

Another example might be a system that downloads pages from the Internet and requires each page to have well-formed HTML. Either rejecting bad pages or restructuring them to conform will distort the original data.

A database can be considered as a way of recording observations. Consistency-checking means that some observations are rejected or distorted to fit into the system. Many errors are caught this way, especially when data is entered manually, so often the trade-off is worth it. However, it's important to remember that the real world often doesn't conform to your model, and that a more consistent database isn't necessarily a more truthful one; in many cases, a database with fewer consistency rules that stores data closer to its original format can be more accurate than one that's more highly normalized and consistent. (Though it'll be noisier and harder to analyze.)

On the other hand, sometimes a database's role is to enforce an organization's policy. For example, a user registration system can force people to pick unique ID's and pick good passwords. But that's because the database is acting as an enforcer, not just a passive record of observations.

C2 Wiki

Consistency-checking means that some observations are rejected or distorted to fit into the system.

Similar ideas were discussed on C2:
AutoKeysVersusDomainKeys and linked pages.

Other implementations

That said, what I am currently wondering is if there is an implementation of the D language besides Dataphor?

C2 cites at least 4 others implementations in TutorialDee, though I didn't check whether they are still alive.

We are not restricted to pure

We are not restricted to pure speculation on what would happen if better open source language designs for DBMSes were available; we have some past history to go on. The reason PostgreSQL has that "QL" at the end is because it's the result of taking the original Postgres, ripping out its query language POSTQUEL, and replacing it with the objectively less powerful and harder to use SQL. The folks who did that and the current designers saw and see no reason to maintain POSTQUEL at all.

Recently, as well, in the closed-source world, even Ingres has seen fit to remove QUEL from their commerical product. (They started discouraging its use in favour of SQL many years ago.)

The reasons behind this are open to discussion, but that other, better languages have been dropped in favour of SQL, both in the closed- and open-source world, is an established fact.

I personally don't find this surprising, and I think the trend will turn around in the next couple of decades. After all, it took more than twenty years for the general programming community to accept something as simple as garbage collection. This stuff moves slowly.

I predict that eventually the relational model will embed itself into all programming languages, replacing the essentially network model that we use for internal data structures these days. You'll write your own routines to navigate pointers, hashes, arrays and linked lists about as often as you currently write your own sort routine from scratch.

Separation of Interface and Implementation

One thing that modern OO languages have done a pretty good job at is the separation of interface and implemenation for simple things. Collections are a good example: I'll happily write up collection-using code without even worrying about how the underlying stuff is implemented, and throw in whatever implementation is handy. If efficiency concerns come up, I'll replace that implentation with something better, either something I find in a library or, should the situation warrant it, something I write myself.

I think one of the great failures of current DBMSes is that they give you very limited means of doing this. I really ought to be able to describe in detail the physical storage and access methods for data when the default implementations are going to be slow; I ought not have to change my logical model in order to specify a different physical model. This is where I find the advocates of "denormalizing for performance" so annoying; they agree that the denormalization is not good, but seem to be unable to see that there could exist a world where one could specify that things be stored the way they wish physically, yet from the point of view of the it is normalized.

No such thing in most OO languages

One thing that modern OO languages have done a pretty good job at is the separation of interface and implemenation for simple things.

Mh, I must either have a different understanding of interface/implementation separation, or I don't know those modern OO languages you're referring to, but in my book mainstream ones like C++, Java, and sorts, do a rather terrible job at separating interface from implementation. In fact, they don't do any job at all: there is no way you can describe the "interface" of a class without giving its definition (= implementation). I.e. in Java method definitions have to go into the class definition, and C++ at least requires you to declare private attributes and methods there.

(You can work around this by indirection through abstract classes (or Java's so-called interfaces), but that is really just a work-around, and a tedious one. But alleged experts obviously feel the need to write whole papers and books about it and give it pompous names like "Dependency Inversion Principle".)

Your collection example suggests that you meant something like encapsulation (i.e. hiding of implementation). But from my understanding that is a different modularity feature. So I'm a bit confused about your post.

I have no problem with the wo

I have no problem with the word "encapsulation" if that's the one you like to use. From your comment, it appears that you understand what I'm trying to get at, here.

Agreed.

Somehow I missed this entire discussion, so as a relative latecomer, I agree with you almost exactly, and I've felt the same way for years.

Denormalization is a necessary thing for performance in the real world of data storage, but it's terrible for maintainability. Changing your schema requires changing all of your queries... then if you decide that these denormalizations were unwarranted after you've made some further changes, then you have to back-port your changes to the old code, or change the new code again.

Schemas need to be separated into logical schemas and physical schemas. The logical schema would be created by somebody with both database and domain-specific expertise. The logical schema would be in Boyce-Codd normal form, or at least 3rd normal form. The database would be responsible for translating a query on the logical schema into a query on the physical schema. You could let any hack go at the physical schema, because the physical schema is only capable of harming *performance* issues.

Admittedly, I'm not a database guru, but I am broadly familiar with the capabilities of current database products, and I'm not aware that any have anything like this. At best, you can (with considerably more work) approach this concept using PostgreSQL's views, rules, and other advanced features.

The entire point of RDMBS's is that you give up turing-completeness so that you can perform more analyses and optimizations... and give you flexible and elegant access to structured data. But RDMBS's don't live up to this very well. Separating physical and logical schemas would be a huge step towards this goal.

Databases also need to have a richer view of data, and make it easy to plug in your own operations and indexes. I get frustrated when I have a particular operation I want to do efficiently, and I know that a particular kind of index would do this, but that index isn't available. Another example is the current craze over GIS applicatons... the R-Tree is not in most traditional databases, but it's becoming a lot more common.

Database access should also be pretty transparent in a programming language. Having to create your own SQL query strings is *tedious* and error prone. I did have a job one summer in college working on an Enterprise Resource Planning application in Progress, which is similar to PL/SQL. Database queries were nicely integrated into the language, but I can't say anything else positive. The language was crippled by lacking procedures and functions... under the theory that this was a "Fourth Generation Language" that had no need for these "Third Generation" concepts.

I didn't go back to that company, not only because of the language. It was a well-run software house compared to some places I've seen... they had a nicely set up development environment, used version control, and weren't constantly stumbling over themselves. However, I decided that their development costs grew quadratically with the number of customers. They make a dozen people a modest living, but they will never be anything more than that.

The relational algebra is more baroque than logic programming

In date's description of the relational algebra we are left with a rather baroque programming language. Even simple recursive queries become excessively complicated to express due to the use of the natural join.

One of the simplest examples is that of the parent/ancestor relations.

In prolog we can think of a table as the set of all valid substitutions. Ancestor can easily be writen (assuming the parent predicate has been defined) as

ancestor(X,Y) :- parent(X,Y);parent(X,M),ancestor(M,Y).

Try this with tutorial A and you will find yourself with some rather difficult to read renames and projections that don't occur if we use by-order arguments rather than joining based on the name.

----------------------------------------------------
If people never did silly things nothing intelligent would ever get done.
--Ludwig Wittgenstein

TC is inexpressible in RA

You are confused -- it's impossible to express parent/ancestor relation (tarnsitive closure) in relational algebra, or any first-order language.

Tutorial D, however, extends RA by introducing, among other things, a TC operator.