Relational database implementation paper ?

I'm a developper on a large project, and regulary have to deal with database performance and design issues. To solve them our DBAs have to explain us many internals of our database system, but tends to do it with the database vendor's point of view (Oracle) as much of their knowledge comes from the books published by the vendor.

As a consequences, I often can't know if the limitations/contraints they exlpain us are intrinsic to relational engines or are specific to Oracle, and this makes technical discussions difficult and frustrating.

I've read Codd's papers and much of the free papers of dbdebunk.com to learn about relational theory and vendors's positions and now I'm looking for a detailed paper or book that explain how to design/write a non-trivial relational database with the usual features (ACID, persistence ...) without being poluted by a vendor point of view.

Until now, I failed to find something that match this description: I only found beginner-level documents or some open source docs that explain technical issues with specific problems.

I hope it's not too implementation-related for LtU and that someone can help me.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

"Database System Implementation"

I'm no database implementation specialist, but I do have this book:

http://www.amazon.com/Database-System-Implementation-Hector-Garcia-Molina/dp/0130402648

Is fairly "overview", but does have some specifics on implementation techniques for transaction management, optimizations on the relational algebra, and optimized disk access.

Datalog

This reminds me of my first comment ever posted to LtU. I do recommend the book "Principles of Database and Knowledge-base systems" by Jeffrey Ullman, although I'm not the best-versed person in the literature of this field.

It has been a while since I've thought about database systems in depth, although I would still very much like to see a database system where you can specify a fully-normalized "logical" schema that defines the interface of a database, as well as a system for flexibly denormalizing a database in such a way that preserves the logical interface and doesn't change the overall API.

I haven't tried very hard to develop this idea yet, though its something that's been in the back of my mind for years. One possible pitfall is that denormalizations can change the semantics of the database: e.g. your logical database might, in principle, support many-to-one or many-to-many relationships, while a denormalized database might not.

I also think Happstack's approach to ACID semantics is interesting. Alex Jacobson has nice things to say about the value of persistent data structures and Haskell's Software Transactional Memory in his Bay Area FP presentation. Incidentally, Simon Peyton-Jones himself has acknowledged inspiration from the database community that STM draws upon.

First-order Performance Issues

The first-order issue with database performance is what kind of indexes you have, and what kind of operations they support. The "big idea" behind GIS, for example, is spatial indexes based around R*-trees and friends.

I've usually seen horror stories when it comes to databases, usually based on the unshakable faith that DBMSes somehow magically make every query fast. I've said it many times: there is immense benefit accrued to programmers that understand approximately how their compilers work, and that's doubly true for database systems.

Based on what you've said, and the fact that you brought up ACID semantics, I suspect that your team has moved beyond these first-order concerns, but it could be worth double-checking.

In one case involving a search engine for local networks, the database developer did all his searches using regular expressions. The problem is that regular expressions don't usually (and in many cases cannot) make good use of indexes. Yet he was convinced he couldn't possibly normalize his database by splitting up his tables, as joins on tables of a few hundred megabytes is "inefficient", ignoring the fact that the performance of joins that make proper use of indexes should be far more related to the number of rows returned, than the size of the databases.

Another example is strikingly similar: another company specializing in scheduling, attendance tracking, and other software for K-12 schools did most of it's searches using wildcards at the beginning of a LIKE condition. Since they couldn't "obviously" improve the speed of their software, they simply tried to solve brute-force searches of their entire database by requiring their customers to purchase immense quantities of hardware. (Which was barely a solution anyway) As a taxpayer, and somebody who cares deeply about public education, I found that philosophy extremely offensive.

Indeed, most online SQL tutorials don't touch on such concepts, and some results on top-rated google searches even use wildcards at the beginning of a LIKE condition without the slightest mention of what an index is or why they are important, or why this particular use case isn't suitable for general, repeated use.

Because of my familiarity with data structures, I'm aware of many other possible indexing schemes that are possible in principle, but aren't usually provided by RDBMSes as they aren't terribly useful in most cases. For example, LIKE conditions with wildcards in the front, but have a sufficient number of "hard" characters in the back, can be efficiently supported by a normal string index stored in reverse-lexicographic order.

I suspect there is an indexing scheme for arbitrary substring searches, but I doubt there is an indexing scheme supporting arbitrary regular-expression searches. The first questions to ask are what operations am I using, do my databases have appropriate indexes in place to support them, and does my system actually make proper use of the indexes in these scenarios?

All the performance issues you can need

You are right: our concers are farther. We currently use hints to twiddle the queries plans, partitioning for the biggest tables, we had to learn about histograms for concurrency issues related to latches, and last but not least we use the reverse index you talk about.

I'm happy we don't need two-phase commit for the moment!

So I'm trying to have a clean model of a database engine to be able to understand how each of these things work together instead of a set of tips and tricks.

RevLex Orderings

Heh, I did a bit of googling and saw that Oracle has it. I don't think MySQL or PostgreSQL has them though. (I prefer PostgreSQL.)

In some ways this isn't too suprising... while I am skeptical that it's terribly useful that reverse-lex orderings support special cases of LIKE conditions, graded reverse-lexicographic orderings are a rather important enhancement for Buchberger's algorithm. It looks like this idea is used more for formal purposes analogous to Buchberger, and not so much for supporting LIKE.

Good reverse-lex ordering use-case

The reverse lexical index are perfect when writing GUI frontend to manipulating some specific data types: those who share the same long prefix but can be easily discriminated by their last characters; several financial data have this typology.

So in the GUI, the users can type "*XYZ" to retrieve all the data whose name ends with XYZ and the resulting query is efficient.

(Side-note hack: when you don't have reverse lexical indexes, you can user a straight index on a column containing the reversed value and use "select .. from .. where reversed_value like reverse('%XYZ')")

Hmm...

RevLex orderings do indeed appear to be somewhat more commonly available than I thought. PostgreSQL supports them through "functional indexes", a more general mechanism, but MySQL apparently does not. I was somewhat inarticulate, in that I wasn't trying to dispute that they are useful sometimes, just not commonly so. Even so, it's nice to hear about cases where they are. :-)

As for your hack, that might speed up string comparison, but how could it make good use of the index? My intuition is that you'd still be searching an entire table or join if that was the first condition selected for... or are you simply storing the value itself backwards? [edit: n/m my silliness, I understand now]

I seem to recall, vaguely, reading about an extensible RDBMS where people could, with relative ease, add their own indexes, operations, etc. Oracle hated the idea, and spread a lot FUD about it, saying this would be a disaster for stability. I think, IIRC, that the company got bought out by Unisys, which still makes their software available, but I'm having trouble locating the name.

[edit: I meant this as a reply to Archiloque. oops.]

Some references

Hi, I don't know any one place to look, but there are a couple of things that are good to look at.

1. Jim Gray and Andreas Reuter's book "Transaction Processing". This is an extremely good book about everything to do with the implementation of transactional systems. Very systems-y in flavor.

2. Torsten Grust's PhD thesis, "Comprehending Queries". This thesis basically reconstructs many classical query optimization techniques in one unified framework based on equational reasoning. Very PL-ish in flavor.

3. SQLite is a relatively small open-source database engine. It's no Oracle, but the source code is relatively compact and readable.