Relational Model Considered Obsolete

See Letter to Editor of CACM January 2013 at http://bit.ly/WWXsDK

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Proofs vs. Tests

Is there an actor-based database that was shown to outperform modern relational systems on representative workloads?

Why actors?

SQL lacks performance because it has parallelism but no concurrency abstraction. Needed therefore are languages based on the Actor Model

I don't follow.

I also don't see how actors model is any better with respect to "conveniently expressing negation or disjunction" or avoiding "information loss". And it is unclear what you mean by 'inconsistency' in the relational model. (Do you mean inconsistent with a relational schema?)

I think relations will be valuable for a long time - perhaps even become more so, as fine-grained and first-class values.

Concurrency is expontentially faster than SQL

Actor oncurrency is exponentially faster than SQL parallelism.

hidden costs

You're assuming simple models -- there may be hidden costs from achieving them (e.g., it may be too expressive). Are you making unsubstantiated claims, or has someone tested them on representative workloads?

The new model is not yet commercialized :-(

The new model is not yet commercialized :-(

Actor concurrency may

Actor concurrency may achieve effective parallelism for some problems. It is not clear that it can do so for a data management context.

Also, relational is not SQL or transactional. Streaming relational can achieve much parallelism.

Relational Model includes SQL and transactions

Relational Model is defined to included SQL (based on Relational Algebra) and transactions so that SQL can run over a snapshot.

No, it doesn't and isn't.

No, it doesn't and isn't. SQL isn't even proper relational model (SQL tables are not formally relations, and are often a problematic approximation). A DBMS is not essential to the relational model.

Tables are treated as relations in the Relational Model

Tables are treated as relations in the Relational Model. For example, a table named Greek could have one column with two people (Socrates and Plato) representing the following relationships:

Greek[Socrates]
Greek[Plato]

Tables are an implemetation detail not part of the model

What you have described is not a relation. The relation is

<<Name>> Is a Greek

Just because someone treats tables as relations, doesn't mean that they are in fact relations in the relational model. They may be the implementation means to representing the logical relation. I have seen too many SQL databases over too many years that just store stuff and are in no way logically defined relations under the relational model.

If you want to start blasting away at the relational model you need to talk about the relational model and relational algebra and not conflate the rubbish implementations that are commercially labelled as relational but are only SQL based.

Relational Algebra is inexpressive

Relational Algebra has no convenient way to express negation or disjunction. For example, it can store Greek[Socrates] but cannot conveniently store the following:
* ¬Greek[Napoleon]
* Athenian[Socrates]∨Melian[Socrates].

I did not ask how relational

I did not ask how relational is deficient. I asked how actors model is better.

Inconsistency Robust Reasoning

See http://arxiv.org/abs/0812.4852

Paraconsistent logics are

Paraconsistent logics are useful for reasoning in open systems, certainly. They enable modular reasoning and control over propagation of inconsistency. Yet I do not see how actors model is essential, or even closely related, to their expression or implementation. The book-keeping and inference semantics of your Direct Logic seem to cross-cut an actors system. I.e. you cannot simply add an actor to your system to introduce a new fact.

So this question remains without a satisfactory answer: why actors?

Actors are the way to implement Inconsistency Robust Reasoning

Actors are the way to implement Inconsistency Robust Reasoning. See http://arxiv.org/abs/1008.1459

They are one way.

They are one way.

Considered Obsolete

Who cares about obsolesence? Anyone older than three in todays world who is not obsolete is just not trying very hard! I consider myself to be obsolete in at least a dozen major areas!

Obsolence of Relational Model is important

Obsolesce of Relational Model is important because it is holding back progress. We need a new model.

A good compromise

I like the new move to NoSQL that being said I think relational databases continue to be the best default choice around. Relational Databases IMHO represent a good compromise of performance vs. flexibility vs. consistency.

Things in this paper like "inconsistency non-robustness" are generally features not bugs. You want your data stores clean. Relational systems allow for crummy data where it is desired and clean data where it is desired.

As for "information loss" Oracle has preservation and building in preservation is easy.

As for disjunction. I might be misunderstanding the claim here but things like
NOT EXISTS <subquery> or <scalar-expression> NOT IN <subquery> seem like disjunction.

NoSQL is just key-value storage, i.e., no semantics

NoSQL is just key-value storage, i.e., it's just a graph and there is no semantics.

Inconsistency is the Norm

You may want all the medical information to be consistent, but it is clearly impossible. Physician notes are currently stored as uninterpreted strings in the Relational Model.

Can you walk me through how you would store physician notes?

e.g., in the Actor Model?

Parse physician notes using a specialized parser

The physician notes need to be parsed using a specialized parser, e.g, MI is Myocardial Infarction. Then use inconsistency-robust reasoning :-)

Well, you're not wrong.

Hi Carl,

I worked for a hospital billing and analysis start-up where we developed many cutting edge techniques for charge capture. We had by far the most extensive approach in the industry, where most competitors simply used naive Bayesian inference techniques. Part of that work included parsing op reports and so on. For example, looking for specific surgical utensils can detect inconsistencies between the bill of materials for the surgery and the actual work performed.

But, what exactly should such a tool look like, and who would use it? One group of users is hospital financial analysts, who specialize in charge capture. So, for example, they might search for "scalpel" but also want various stem words automatically included in their search. It is true that you don't get this with a blob of text, and that today's SQL DBMSes don't even support such parsing well or at all. Most in-database text search services use simple graphemes and don't allow the programmer to do inferences such as "MI"="Myocardial Infarction". So, the typical workaround is to use something like Apache SOLR and Lucene. And if you look at the design of such systems, it is true that the Actor Model better approximates the way programmers find solutions than simply classifying everything as a Relational Modeling problem.

That said, I ask that you pause and reconsider your design slightly. A specialized parser is at best an incomplete solution to structuring the data, and interactivity is the more important component. For example, if I am a charge capture analyst searching for ideas related to scalpels used in operations, I want the system to show me related searches other users have already made and what they learned doing those investigations. For example, if another analyst looked at a specific range of procedure codes in conjunction with scalpels, and what the net result of that investigation was.

Also, most areas of inconsistency I have seen are strictly avoidable. For example, it is not uncommon for a hospital to specialize their Charge Master in a way the billing system vendor did not intend. If they are not avoidable, they are detectable by simple machine learning or even simpler statistical techniques. For example, we found with relative ease an entire pharmacy in a hospital where the machine used to record drug dispensation was unplugged - they were simply the only unit in the hospital with no typical revenue that comes from such machines.

In the case of using Actor Model and inconsistency robust reasoning, I challenge you to explain to me the benefits of encoding such discrete inference rules vs. just using machine learning techniques like support vector machines to develop classifiers. Would love to learn something new. Further your MI=Myocardial Infarction example, please.

Just some constructive feedback.

Cheers,
Z-Bo

The problems in medical information are deep.

Hi Z-Bo,

Thanks for your thoughtful comment.

By a specialized parser, I mean one that understands the specialized language used in physician notes as well more typical natural language. Obviously, creating such a parser is a major undertaking, but considerable progress is being made.

Undoubtedly you are well aware that inconsistency is pervasive in medical information between doctors clinics, hospitals, insurance companies, medicare, mobile apps, etc. Contradictions are processed when they are found and inconsistencies cannot be eliminated. Triage dictates how and when it is desirable to seek out contradictions in medical information.

You might be interested in my recent Stanford colloquium. See http://HIST.carlhewitt.info

physician notes

Even if I were to ignore the contents of physician notes:

a) versioning and diffs, particular if notes are revised is critical.
b) time stamping for sure possibly also location and device stamping
c) all sorts of relationships exist between doctor, visit, treatment... those notes are associated with billable events and future billable events

____

More importantly I don't think it is a good idea to ignore the content of the physician notes. That's why modern EHR systems don't just treat the note as a blob of text but decompose them using standard MVC architectures to capture the notes in pieces as usable data.

SQL updates lose information.

SQL updates lose information.

Removed

Removed

No convenient way to store disjunction or quantificaton

The point is that there is no convenient way to store the following using the Relational Model:
* Athenian[Socrates]∨Melian[Socrates], i.e. Socrates is Athenian or Socrates is Melian
* Socrates∈(Athenians∪Melians), i.e. Socrates is a member of the set of Athenians and Melians
* ∀x∈Men → Moratal[x], i.e. All men are mortal.

Note: I revised the last 2 examples in light of the comment below

No convenient way to store disjunction or quantificaton?

R1 := <<Name>> is Athenian
R2 := <<Name>> is Melian

R3 := <<Name>> is (R1 UNION R2)

And have either Athenian[<<Name>>] or Melian[<<Name>>] or both containing Socrates as long as the Domains for <<Name>> in both relations are the same domain.

The last example is simply a relation containing all men and a secondary relation that expresses the mortality based on the first relation.

Remember that if a statement can be expressed in relational algebra it is automatically expressed in the relational model as the relational model is based on relational algebra.

However, it may not be easily expressible (or for that matter may not be expressible) in SQL or any system based on SQL.

No convenient way to store disjunction or quantification?

∀x∈Men → Moratal[x], i.e. All men are mortal.

is a foreign key constraint and can be easily expressed even in SQL.

CREATE TABLE
Mortal(Name varchar(100) primary key)

CREATE TABLE
Man(Name varchar(100) primary key references Mortal(Name))

Inconsistency in second expression

Hello Carl,

I think there is an inconsistency in your second example

Socrates∈(Athenians∪Melians), i.e. Socrates is a member of the set of Athenians and Melians

Your expression would say that Socrates is a member of either the set of Athenians or Melians.

different denotations from different domains != inconsistency

the mathematical union is also termed 'or' in set math, but is not the same thing as the natural language 'or', which tends to be xor. or maybe not. in other words: natural language sucks, news at 11.

Please look again

Socrates∈(Athenians∪Melians), i.e. Socrates is a member of the set of Athenians and Melians

The logical expression would mean Socrates is a member of either the set of Athenians or Melians (or both), not what is stated in the explanation - which would only be true if Socrates is a member of both sets.

Therefore there is an inconsistency between the expression and the explanation.

I think it's a stretch to

I think it's a stretch to read "the set of Athenians and Melians" as set intersection. What about "the set of 1, 2, and 3" or "the set of guns and roses"? In this context, the word "and" serves to say that the set contains Athenians and that it contains Melians.

I really don't understand what the problem is


CREATE TABLE Citizen_of_City
(
City varchar(100) NOT NULL,
Name varchar(100) NOT NULL,
Primary Key (City, Name)
)

--Disjunction: Socrates is a citizen of either Athens or Milos (or both)

SELECT Name from Citizen_of_City
WHERE City = 'Athens'
AND Name = 'Socrates'
UNION
SELECT Name from Citizen_of_City
WHERE City = 'Milos'
AND Name = 'Socrates'

--Conjunction: Socrates is a citizen of both Athens and Milos

SELECT Name from Citizen_of_City
WHERE City = 'Athens'
AND Name = 'Socrates'
INTERSECT
SELECT Name from Citizen_of_City
WHERE City = 'Milos'
AND Name = 'Socrates'

Both return the empty set if the result is false.

Universal quantification: all men are mortal example already covered as foreign key constraint above.

Perhaps, Carl, you need to explain a little more clearly.

Insert not select

Will --

I think what's he's saying is there is no way to store these sorts of statements, not that there is no way to represent them in queries. That is you can't

INSERT INTO Citizen_of_City (City, Name) 
VALUES (Athens or Milos, Socrates) 

So you change the query to a proposition

which is of course supported in the relational model.

In SQL:


CREATE ASSERTION "Socrates is Athenian or Melian"
CHECK(
EXISTS (
SELECT Name from Citizen_of_City
WHERE City = 'Athens'
AND Name = 'Socrates'
UNION
SELECT Name from Citizen_of_City
WHERE City = 'Milos'
AND Name = 'Socrates'
))

So there can be no state of the database in which Socrates is neither a citizen of Athens nor of Milos.

I remain very puzzled, as so far everything that Carl Hewitt has claimed to be very difficult in the relational model has proved (to my mind at least) to actually be very easy.

I think it would help my understanding if Carl were to demonstrate how these examples are expressed in his proposed alternative model.

Data vs. Knowledge stores

The knowledge "Socrates is either a citizen of Athens or of Milos" doesn't require you to know which. For example, you can use this knowledge to answer "Is Socrates a citizen of Greece?" Constraints don't capture this.

So it's much easier than I thought

That Socrates is a citizen of either Athens or Milos, but you haven't verified which yet, can be represented in the relational model with no trouble at all, it's just a question of designing the database correctly and making the interpretation clear to the users. If you think about it a little you will see what I mean.

I have seen nothing here that persuades me of the validity of the hypothesis that the relational model is obsolete. As far as I can see I have been able to address every proposed weakness brought forward by Carl Hewitt.

If the kind of knowledge base you describe is what Carl is proposing then it would seem to place the intolerable burden on the user of writing expressions of arbitrary complexity themselves. As such it seems of little practical use. If the users were intellectually equipped to achieve this, they would also be intellectually equipped to design their own database.

SQL DBMS are NOT Relational DBMS

Good morning Carl,

You appear to be making the common mistake of assuming that SQL based databases and database management system are relational. They are not. SQL DBMS have been a bane to work with for many years because they do not follow the Relational Model.

As you should be aware, there is a major difference between the logical Relational Model and any management system that implements it. Unfortunately, the vast majority of available systems allow the implementation to intrude into the logical. This is no different from programming languages and their implementations.

In relation to your simple examples given above, here are two simple convenient ways to store them

Relation 1: <<Name>> Is Not Greek [Napoleon]

where the candidate key is [<<Name>>]

Relation 2: <<Name>> Is a Member Of <<Group>> [[Socrates, Athenian], [Socrates, Melian]]

where candidate key is [<<Name>>, <<Group>>]

To be more precise, a relation is a statement of something true, anything that is not a member of the relation is false, so if you are wanting to get the negation of any relation, you may well be asking for some infinite set depending on the domains of the individual fields. So irrespective of what kind of system you are proposing, the negation in specific examples will be problematic.

If you want to look at the current state of play in relation to relational theory, have a look at the various discussions taken place on the The Third Manifesto listgroup.

Again, don't make the mistake of thinking that any SQL DBMS is relational, they are far from it.

third manifesto

I've seen the 3rd manifesto stuff. I can agree on some of their complaints but I would never go as far as to say RDBMS as they exist today with SQL aren't Relational. They have a few properties that justifiably could be objected to.

For example Anonymous columns and "FROM clause restricted to named tables" were fixed in SQL92.

An example where I think they go overboard is duplicate rows. Everyone I think everyone agrees that "Duplicate rows" are bad. That's just a pure tradeoff in eliminating them in terms of either space or time, \ that Rel/D would face as soon as they went mainstream. And in databases where this is not acceptable lots of indexing particular mixed with a synthetic key (SUID) for each row solves the problem.

Then finally there is weird issues with NULL and everyone agrees that is tricky and is poorly thought out. We all need a new standard here and there is no disagreement on the SQL side.

I don't think insert syntax on column orders is a violation when the column order is specified in the query. I think that kind of ordering is fine since it just syntactic sugar. So this is an area where I just disagree.

So I think good ideas, good criticism, worthwhile but nowhere close to "SQL is not relational".

SQL supports Relational

It takes some discipline to use SQL in a manner compatible with relational model, primarily with respect to nulls, duplicate rows, and avoiding operations that depend on ordering of rows. But one must admit, it takes a lot less discipline to use SQL to model relational systems than, say, C language.

Beyond SQL, there are many approaches to using the relational model - e.g. those supporting streaming, subscriptions, or temporal data. Even if we think that SQL supports relational, it doesn't effectively or efficiently support the whole of it.

streaming, subscriptions, or temporal data.

I don't know what streaming or subscription is. You might have to give me a bit more context. But in terms of temporal data I think there is three meanings

a) Time and datas in which case I think SQL does a great job handling that data. Time stamp and time stamp conversions are really good.

b) changing data. In which case something like:
primary key = non temporal key + create date + delete date + effective date + end date
Some of this is moving towards becoming the default in many RDBMS

c) Temporal view of the datastore. I.E. X has access to Y during hours A to B. I think SQL is moving towards handling that. A lot of the issues are handled by forking off VPD (virtual private databases) from the main datastore. This was originally meant for more complex group permissions to prevent leakage but it works on time or timezone based security as well.

____

SQL to my mind isn't perfect, but given the huge amounts of money being spent by database vendors I don't see failures of implementation of the relational model to be a substantial concern. Arguably SQL is the best funded language in the world. Oracle, which sort of sets the traditions, really does care about relational and thinks about it deeply.

That to my mind is as good as it is ever going to get for a programming language.

SQL can represent bitemporal

SQL can represent bitemporal data, but does not provide effective support for temporal queries (e.g. to find the 'most recent' row data, or to join temporal data), nor for temporal schema.

Streaming relational involves long-running queries where new entries can flow in a stream for processing. This allows a lot of pipeline parallelism, and requires some constraints on the query language and algebra. Subscriptions (also called 'continuous queries') are more about efficiently tracking updates to old data for long-lived queries. IIRC, Oracle has experimented with both of these, and even has components to support continuous queries. IIRC, KDB focuses on streaming queries. The Bloom language (from the Berkeley Orders of Magnitude (BOOM) project) also focuses on streaming relational/logic.

There are many languages with a lot of money and time invested in them. I'm not sure whether Java, C, or SQL has greater investment today (C gets a lot of work from Intel). But I think that investment has more to do with incumbency and popularity than quality.

[dup]

[dup]

Duplicate Rows - more than just Bad

The allowance of duplicate rows, nulls in fields and the inconsistency of how nulls are treated make using SQL a nightmare. I have come across (and had to fix) too many instances of people using these (what's a nice word) features of SQL databases and DBMS's to think anything other than SQL is not relational.

I have a copy of an example from many years ago of some code (written in D4) that demonstrates the statement of "SQL is not relational." What I found interesting at the time was that the "SQL is relational" proponents would not accept the challenge of writing the corresponding SQL code. After looking at it at the time, I concluded (for myself) that the SQL would be many times larger and may not even be able to be proved correct. If anyone is interested, I'll try an track down the example, or if the original author lurks, he may wish to supply it.

I suppose that the major property of SQL and SQL DBMS's is the difficulty in proving that the results returned are correct. Particularly in light of the ability to have duplicate rows and nulls.

In relation to duplication in relational databases, it is the implementation that handles this, not the user. In some aspects, it is similar to handling memory management in programming languages, in some languages, it is up to the programmer to handle memory whereas in other languages, the implementation handles this behind the scenes.

Lastly, SQL as specified does not implement relational algebra, hence it cannot be said that "SQL is relational."

Primary key

Bruce if you don't want duplicate rows or nulls just declare the whole table to be a primary key and that forces rows to be unique and all fields to be non-null. That isn't earth shattering complex, it is part of one line of code. Make it an IOT and the whole thing would run rather quickly as well, as long as you are mainly querying and doing small manipulations.

In relation to duplication in relational databases, it is the implementation that handles this, not the user.

You may have mean the opposite, that it was the user not the implementation. Oracle has pretty much worked out the tradeoffs on these issues about 20 years ago and everyone has followed their example. The implementation has features to allow the user to make an informed choice, and the user makes an informed choice. We have an entire profession of people called DataBase Administrators who are available to developers who have to work with large datasets and specialize in the relative benefits of these sorts of choices.

And there are real tradeoffs particularly in the area of performance. Relational is a tolerable system only in so far as it just a few times slower than a network database for flat-file. When the differences become larger than that, or those differences are seen as too large there becomes pushes away from relational.

For small datasets there are easy solutions and without having to worry about performance issues they all work perfectly.

___

Anyway. SQL is the #1 most popular data lookup and manipulation language integrated into essentially every other programming language. I'd say the fact that something like 98+% of programmers who have to store data use SQL of Flat-file systems to store and manipulate it shows that is obviously is not an unusable nightmare.

A primary key isn't really

A primary key isn't really sufficient. Many of the issues with duplicate rows arise on joins or column-eliminations for those temporary tables internal to queries.

You raise performance as an issue, but I think it would not be difficult for an implementation of relational to decide where filtering duplicates is most relevant. It is the ability to observe duplicate rows (e.g. in sums or counts), not merely their existence, that takes SQL a step from relational.

SQL seems to mix a lot of presentation logic (column ordering, row ordering, duplication and alignment) with the relational query language, which makes SQL much less compositional than it should be.

select distinct

For internal where you want to avoid duplicates SELECT DISTINCT.
In terms of aggregation it is the same thing: "SELECT COUNT(DISTINCT column_name) FROM table_name" or if it is over multiple columns something like: "SELECT COUNT(DISTINCT col1||col2||col3) FROM table" .

And I think it is expensive which is why it has to be manual:

SELECT is O(n)
SELECT DISTINCT is O(n*log n)
unless you have a prebuilt index to run the query against.

Do this at every step and the cost can be much larger. This IMHO is a short term concession to performance.

___

SQL seems to mix a lot of presentation logic (column ordering, row ordering, duplication and alignment) with the relational query language, which makes SQL much less compositional than it should be.

I agree there on composition. The natural language syntax of SQL can be very messy. I don't generally like languages with anything like "natural syntax". I'd be happier with a language with less natural syntax and more directed at composition. I think they do pretty well on composition given that huge initial disadvantage. They do a pretty good job of allowing you to ignore column ordering during composition. Row ordering I think they handle as best as possible: inner functions it is a optimizer hint for the engine, outer functions it is part of ORDER BY clause. I'm not sure how that could be done any better.

select distinct - a feature that should not be there

Since logically, a relation can be a functional composition of other relations by the use of joins, restrictions and transitive closures (as well as other functions), and a relation has no duplicates (by definition) there should not be any requirement for the DISTINCT keyword.

Unfortunately, SQL databases are not relational, hence the requirement for the DISTINCT keyword. Not being relational gives rise to all the problems that are associated with duplicates in the results.

Aggregation should be consistent and not depend on having to first cull duplicates from the source tables to get the required results.

It has long puzzled me why anyone would want to have a complete duplicate in a table, where it is much more efficient to just keep a count and which is not error prone over joins of any description.

performance

It has long puzzled me why anyone would want to have a complete duplicate in a table, where it is much more efficient to just keep a count and which is not error prone over joins of any description.

What I said above, time. When you start working with tables with tens of millions to tens of billions of rows, you don't want the engine doing anything extra automatically. You want maximum performance. The whole reason Oracle is popular is because it gives DBA's and developers such excellent ability to tune the database, including the execution strategy so that you can get performance of say 1/3 of C when you need it and relational abstractions the other 95% of the time. Take away that performance and relational just becomes unusable entirely for large datasets.

If you argument is that the meaning should be reversed so that
select acts like today's SELECT DISTINCT and
select fast acts like todays SELECT I'd agree with you.
But I don't think having the opposite default makes SQL non relational.

I think before you run around talking about how the RDBMS community doesn't understand basics you should read Oracle Database Concepts and Oracle Performance Tuning. At the very least understand how these engines work and why they do what they do. That's the level of detail at which the professional (DBA) community is dealing with queries. You need to start addressing the problems of databases that aren't 2 dozen tables with 1000 rows.

Performance means not getting wrong results

A significant part of performance is getting the answer right. Otherwise, it is of no use getting the wrong answer ever faster. When someone is looking at the results from a query, they have an expectation that they will get the correct result. If (for whatever reason) a table does not have a specifically defined primary key, and the user is unaware of the fact, it is (from my experience) an unexpected problem if the table contains duplicates. The business relying on the information stored then makes incorrect decisions.

One of my functions over the years has been to make redesign suggestions to allow the businesses (I have worked for) to get better information out of their databases. There has been times when the suggestions were ignored and later on the consequences of not redesigning have bitten them.

Irrespective of how much ORACLE is popular and irrespective of how much it gives the DBA's and developers the means to tune the database, it is the users and their information expectations that matter. When I first did my ORACLE V6 training (many long years ago), it was specified by ORACLE Training that specific kinds of queries would be identical in practice (as well as semantically the same)

In practice, I obtained a difference of 8 minutes using the ORACLE DBMS for one query and less than 0.1 second for submitting two queries (processing the results of the first to generate the second query using of all things MS-ACCESS) to get the identical results. This was on tables with 100 million records. As a courtesy, I even reported the problem to ORACLE Australia at the time.

select f1, f2, f3 from tb1 where f4 in (select fa from tb2 where cnd1)

time to get results > 8 minutes

as compared to

select fa from tb2 where cnd1

select f1, f2, f3 from tb1 where f4 in ("results of first query")

fa was a 1 or 2 character string value

time to get results < 0.1 seconds

I found the same problem occurred in V7 and V8, I have not tested in later versions. I have also seen the various techniques that ORACLE advises DBA's to undertake for optimisation of queries. Most of these are (as far as I can tell) fully able to be automated by the DBMS itself.

Personally, I don't believe any of the major database companies have a proper understanding of relational theory. This is based on the kinds of literature that is put out by them. I have worked in the industry for 20 odd years and when bringing up the relational theory side of things with DBA's and developers, I usually find that their eyes glass over and quite often get the feedback that relational theory is not necessary for building databases because it's not practical. YMMV.

I am not a DBA, and personally I have found that a database designed on relational theory tends to run better anyway. I know from personal experience that SQL query creation (particularly if you use IDE tools) can be fraught with efficiency problems, particularly with the various joins. I have procedures that I use which can be/should be automated when using IDE's to create queries that can in a lot of case increase the efficiency by up to a couple orders of magnitude. This should be a function of the DBMS and not me having to manually rewrite the query in the more efficient form.

I understand the problems that arise with trying to index on blobs of any kind (including large text fields). However, after 30 years I would have expected that the available DBMS's from the various major database companies would handle a lot of this kind of thing. We have compilers that do it everyday without the intervention of someone, surely it is no different for DBMS's. ORACLE now has hardware that has incredible performance (I'd love to get my hands on some of these boxes), surely part of this performance can be used for additional usability.

I have also found that most of the DBA's (while very, very good at their job) quite often cannot think outside the box to suggest database design changes that will greatly enhance the database efficiency. They are not allowed to, so there is no incentive to do so. They are caught in unenviable place of having to make something work without the ability to make design suggestions and improvements. This is a common problem in may areas (not just in database management).

Primary key is an added requirement

By default a relation has all of its attributes as part of the candidate key (in your terms primary index). This means that in a relational database, no relation will ever have duplicates. To have a relation that has a smaller candidate key than the total number of attributes requires user intervention to make this choice.

In SQL databases, the tables by default have no primary key, hence duplication of records is an automatic "feature." This requires intervention by the user to create a primary key to ensure no duplication.

So again, in relation to duplication in relational databases, it is the implementation that handles this not the user. The user can confidently assume that any relation that is created will NOT have duplicates under any circumstances.

Just because ORACLE has worked on this for many years and everyone has followed suite doesn't actually mean they have got it right.

What you must understand is that the logical model in use by the database designer is quite different to the physical model (Implementation) that is used by the DBMS. Any and all optimisations that the DBMS does in translating the logical model to a physical model should happen behind the scenes and not come back and impact the logical model except where it can highlight an inefficient logical model.

This is effect no different to the highlighting an inefficient algorithm used in some high-level language and finding a better algorithm to be written in the high-level language.

For all its popularity, the number of incorrect models that I have had to fix or at least highlight that they are wrong (including ye olde data warehouse solutions) has left me feeling that because the relational model and algebra has not been implemented properly, we have dug a miry pit that keeps many people in jobs just fixing the problems instead of developing more usable systems.

yes! but.

So say I agree with you (which I think I do :-). What are the extant practical alternatives? Is it just always making all columns part of the primary key? Is it hoping Tutorial D will actually ever be production ready?

Some types are problematic

One alternative is to make all columns part of the key, but then you do have to deal with blobs, etc and how to index them. What could be a simple solution is to in the general case do so over all fields but in the problematic cases, highlight the problem and even make the suggestion that these attributes be removed to a logically associated table.

I can think of a variety of different measures to be taken that could be presented to the designer for a choice to be made. What it would do in the problematic cases is force the database designer to consider the problem at the design stage.

As far as Turorial D and its ilk is concerned, you do have Dataphor (even with its mutation to the dark side with being able to now have nulls etc) as an example that can be worked on. I consider its biggest problem in being a DOTNet product (but that is only my opinion and in this case counts for naught). We could look at something like Rel (with it being based on Java) or any of the other implementations at TTM Projects

This is an area that could be a new field of study - who knows?

logical model

All models have bleed, they have to. Absolutely you want your logical model most of the time to be implementation free. But there isn't an optimal configuration that is known to the engine at a high level. So if it going to have performance then there have to be ties to implementation it is not just something you can waive over and ignore.

Take something simple like what percentage of the block should the system aim to fill with records. Assuming you are on HDD hardware: If I have a read only or mostly read only database I want my block size large and that percentage high to reduce block overhead. If I have a heavily transactional system I want free space to avoid row migration. Those tradeoffs change if I switch from HDD to SSD where row migration is less of a threat but and large blocks less of an advantage.

How can none of that bleed out to the end user? The software doesn't know what you intend to do with the system when you build it. You have to tell it.

___

As for Oracle getting it right: their main competition competes on price and ease of administration. That's a pretty good indication, competitors are usually good at pointing out a system's shortcomings. They've been battle tested, if they had gotten it wrong people would know. And beside the excellent features and management they are battle tested.

That's not to say that for education you might want something different. Oracle is a professional product, and professional products aren't the best for pedagogy. But stark language like "Oracle got it wrong" demands some pretty steep evidence.

Logical model

All models have bleed, they have to.

Why? We have many programming languages that you cannot get to the implementation, you are stuck with what the compiler gives you, unless you want to change the compiler itself.

In relation the implementation of a logical model, this is something that the DBMS can monitor and make appropriate changes or at least suggestions for review by DBA's. There is no requirement for the implementation model to bleed back to the logical model.

There are aspects to consider here:

1). Problems in the logical model itself and how to make this better

2). What the logical model gets transformed to by the DBMS for machine efficiency.

These are quite different and in reality quite separate. Irrespective to what the logical model is, there will be a regular set of transformations to a physical model that can be optimised. This is a standard process all compilers undertake - no different and known technology and theory.

Fixing the logical model is a completely different ball of wax, there are generally only a small number of transformations that can suggested to modify the logical model by the DBMS. This is where the database designer needs to be on his/her toes. This is no different to choosing the most appropriate algorithm when writing programs. This is the art and the skill.

Take something simple like what percentage of the block should the system aim to fill with records.

Like any design process there are specific rules that you design by. These are just as easily programmable in the DBMS as for you to do it. In fact, the DBMS can take the real time results and apply dynamic changes that can increase the performance. This can all be based on the type of hardware you are running on or that may in the future be incorporated into your infrastructure. From a programming runtime system this could be likened to JIT and other techniques used.

The software doesn't know what you intend to do with the system when you build it. You have to tell it.

Very true, but it can monitor the usage and there will be (if nothing else) various rules and heuristics that can be used to make implementation changes for efficiency. There are many things that software cannot do, it requires a man's mind to see the relevant patterns, but for specific kinds of patterns (once known), software can be used to transform to a better solution.

Like anything, if the market is moving in a particular way, it will continue to move it that way due to investment. If a competitor comes up with a better way that will disrupt the main market, you will generally see that competitor destroyed or bought out and products killed off. This is business.

If I remember it correctly, ORACLE came out with a product in competition to IBM using the same SQL interface. They did the DBMS a bit better than their competitor, so they made headway. But they used SQL which has its shortcoming in relation to relation theory. Once those shortcomings are mainstream, it is very difficult to fix, particularly if standards become involved.

We have lots of industries that have "got it wrong", so saying that about the commercial database industry is not a big deal. The name of the game is making money and not about getting it right. If what you do is a commercial success and you need to adjust your advertising to say you are in accordance with some idea or concept then you'll do it. If you have to change the original definition of the idea or concept to match what you do, then you'll do it. That is a simple fact of business.

So if to be involved in the new buzzword (RDBMS) you have to do a bit of marketing then a business will do it. They will gloss over the parts that aren't or make out that that facility is coming in the future or that particular things are unimportant. This is business. Again it is about making money and not about being technically correct.

Relational Model - Programming Language Design

I know there is much discussion on the TTM listgroup group in relation to Tutorial D as a language and the various considerations in regrads to TTM.

I liked to do a partial redirect here in relation to language design and the impact it would have on the Relation Model including Relational Algebra. Any thoughts on Types (whether strict, soft or dynamic) and the model, Thoughts on functional/imperative styles and the model. etc. Including higher functional application, etc.

I would like to see any points raised from non-database oriented practitioners that could be further discussed over on TTM.