Why Normalization Failed to Become the Ultimate Guide for Database Designers?

While trying to find marshall's claim that Alberto Mendelzon says the universal relation is an idea re-invented once every 3 years (and later finding a quote by Jeffrey Ullman that the universal relation is re-invented 3 times a year), I stumbled across a very provocative rant by a researcher/practitioner: Why Normalization Failed to Become the Ultimate Guide for Database Designers? by Martin Fotache. It shares an interesting wealth of experience and knowledge about logical design. The author is obviously well-read and unlike usual debates I've seen about this topic, presents the argument thoroughly and comprehensively.

The abstract is:

With an impressive theoretical foundation, normalization was supposed to bring rigor and relevance into such a slippery domain as database design is. Almost every database textbook treats normalization in a certain extent, usually suggesting that the topic is so clear and consolidated that it does not deserve deeper discussions. But the reality is completely different. After more than three decades, normalization not only has lost much of its interest in the research papers, but also is still looking for practitioners to apply it effectively. Despite the vast amount of database literature, comprehensive books illustrating the application of normalization to effective real-world applications are still waited. This paper reflects the point of view of an Information Systems academic who incidentally has been for almost twenty years a practitioner in developing database applications. It outlines the main weaknesses of normalization and offers some explanations about the failure of a generous framework in becoming the so much needed universal guide for database designers. Practitioners might be interested in finding out (or confirming) some of the normalization misformulations, misinterpretations, inconsistencies and fallacies. Theorists could find useful the presentation of some issues where the normalization theory was proved to be inadequate, not relevant, or source of confusion.

The body of the paper presents an explanation for why practitioners have rejected normalization. The author also shares his opinion on potentially underexplored ideas as well, drawing from an obviously well-researched depth of knowledge. In recent years, some researchers, such as Microsoft's Pat Helland, have even said Normalization is for sissies (only to further this with later formal publications such as advocating we should be Building on Quicksand). Yet, the PLT community is pushing for the exact opposite. Language theory is firmly rooted in formal grammars and proven correct 'tricks' for manipulating and using those formal grammars; it does no good to define a language if it does not have mathematical properties ensuring relaibility and repeatability of results. This represents and defines real tension between systems theory and PLT.

I realize this paper focuses on methodologies for creating model primitives, comparing mathematical frameworks to frameworks guided by intuition and then mapped to mathematical notions (relations in the relational model), and some may not see it as PLT. Others, such as Date, closely relate understanding of primitives to PLT: Date claims the SQL language is to blame and have gone to the lengths of creating a teaching language, Tutorial D, to teach relational theory. In my experience, nothing seems to effect lines of code in an enterprise system more than schema design, both in the data layer and logic layer, and often an inverse relationship exists between the two; hence the use of object-relational mapping layers to consolidate inevitable problems where there will be The Many Forms of a Single Fact (Kent, 1988). Mapping stabilizes the problem domain by labeling correspondances between all the possible unique structures. I refer to this among friends and coworkers as the N+1 Schema Problem, as there is generally 1 schema thought to be canonical, either extensionally or intensionally, and N other versions of that schema.

Question: Should interactive programming languages aid practitioners in reasoning about their bad data models, (hand waving) perhaps by modeling each unique structure and explaining how they relate? I could see several reasons why that would be a bad idea, but as the above paper suggests, math is not always the best indicator of what practitioners will adopt. It many ways this seems to be the spirit of the idea behind such work as Stephen Kell's interest in approaching modularity by supporting evolutionary compatibility between APIs (source texts) and ABIs (binaries), as covered in his Onward! paper, The Mythical Matched Modules: Overcoming the Tyranny of Inflexible Software Construction. Similar ideas have been in middleware systems for years and are known as wrapper architecures (e.g., Don’t Scrap It, Wrap It!), but haven't seen much PLT interest that I'm aware of; "middleware" might as well be a synonym for Kell's "integration domains" concept.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

live programming

This is sort of related to one of my principle for live programming: the program should always run in a reasonable, even if it has errors in it. That their is code and that something was specified should always be apparent in the program, even if what the code does is undefined because of its erroneous state. Likewise, defaults should be reasonable so that we can see things; e.g., if you create a rectangle and forget to set its size, it should not be invisibly small (like in WPF...), but rather something you can see and remember...oops I forgot to set the size. Likewise, NaN shouldn't mean put fly off the screen into imaginary space, perhaps you could just start shaking or something. The point is to provide visible feedback so the programmer can more quickly understand what's wrong.

Likewise, why are systems so brittle? In PL, we expect that a program has one rigid unambiguous meaning, which means that any bug/mistake will cause the system to explode vs. just degrading gracefully. So let's say you fail to read a file because it doesn't exist...why not just log the error and return some random file anyways? Sometimes, it won't even matter. Martin Rinard's work on run time software patching comes to mind here; e.g., the Living in the comfort zone paper from Onward 2007.

This is not mainstream PLT, but maybe it should be. At any rate, the systems community are pragmatic enough that they are exploring this area fairly well.

Normalization Failed?

Seems to me like academic twaddle.

In practice (as opposed to the world of Date) a good understanding of 3rd normal form is the essential starting point for any database designer. That is analysis, not design.

And then, as every serious analyss and design methodology has explained since 1980, you denormalise to support the required processes - as need be.

I agree

3NF (actually, BCNF) is extremely helpful, especially given the recent changes in hardware (solid state disks) and database research (Adam Marcus's MIT masters thesis on heap file structures suitable to "navigable" relational databases; see BlendDB: Blending Table Layouts to Support Efficient Browsing of Relational Databases). In my books, if the hardware guys can solve the "write problem" with solid state (which I don't believe they have), then you will see a dramatic reshaping of scaling practices. Solid state is simply a gamechanger; it removes the "denormalize for performance" advice from the equation, because with constant time disk access, redundant data actually slows clusters of disks down!

I am not just pitching this topic out there. I am fairly well-read in relational database theory. You can't just call it academic twaddle. There is real tension between systems theory, database systems theory, and PLT views on how to best solve problems. See: The Great MapReduce Debate and the follow-up Mike Stonebraker's counterarguments to MapReduce's popularity. Obviously, head technical folks at Google were very much in disagreement with Stonebraker, calling his comparison a "category error" and saying Stonebraker is no longer on the cutting edge (mind you, Stonebraker has the best track record for start-up ventures using cutting edge research of anybody in IT history; this was like saying Brett Favre should just retire). Outside of Google, others criticized Stonebraker as well. To me, this seems like a modularity problem with database systems, and opens the gateway for using MapReduce-like techniques to help build SELF-* based systems.

as every serious analyss and design methodology has explained since 1980, you denormalise to support the required processes - as need be.

Understanding behavioral requirements (processes) is non-trivial, especially in the face of mergers and acquisitions. This is why model checking tools like Alloy exist (and are based on relational logic). Where I work, we try to avoid enterprise-style integration wherever possible. For clients that don't need it, it is simply more costly and just a development hassle. I agree with Stonebraker here; there is just too much middleware:

I think my pet peeve is one of the things I talked about this morning in my invited talk at SIGMOD 2002: there is just too much middleware. The average corporation has bought a portal system, has bought an enterprise application integration system, has bought an ETL (Extraction, Transformation, and Loading) system, has bought an application server, maybe has bought a federated data system. All of these are big pieces of system infrastructure that run in the middle tier; they have high overlap in functionality, and are complicated, and require system administrators. The average enterprise has more than one of all of these things, and so they have this spaghetti environment of middleware, big pieces of moving parts that are expensive to maintain and expensive to use.

Everyone seems to recognize this problem, and the conventional commercial wisdom is to expand the role of an application server so it does what all of these packages do. Web Sphere, for example, from IBM, is becoming a very, very rich package which does a lot of middleware functionality.

I think a federated database system is a much better base on which to build a middleware platform than is an application server. And the reason is that application servers only manage code, and then the data is relegated to the bottom tier. If an application needs some data, it runs in the middle tier and requests data from the bottom tier. You end up moving data to the code. If you had a federated data system, so that the data manager was running at the middle tier and at the bottom tier---and object-relational engines are perfectly happy to store and activate functions--- then code and data could be co-mingled on a platform. And you could then do physical database design in such a way that you put the data near the code that needed it, and you wouldn’t end up shipping the data to the code all the time. I think that’s a much more long-term, robust way to build sophisticated middleware. So I’d work on trying to prove that that was a good idea if I had some more cycles at work---but I don’t.

The Art of Normalization

Fotache's paper reads more like a proggit rant than a paper.

I think the mistake many people make is to believe that normalization is a striclty mathematical process that can be followed mechanically and everything will work out.

If we rephrase the principle of normalization to be "group like with like and use relations to capture other relationships", it becomes clearer that much depends on your data analysis and design: what things are alike and what are the main relationships?

Analysis and design are both as much art as science, and there is also an art to bridging academic theory with industrial practice. There is no reason to expect that this should be a simple, mechanical process either.

Well...

I have never seen reddit produce any good rants, but I have not read it since my junior or senior year in college (3 years ago?).

I am not used to Internet rants having bibliographies. Maybe Paul Graham has foot notes from personal conversations, but that is about it. Fotache presents some real facts. He tells you what is in most books on modeling. Even for Model-Driven Architecture, Leon Starr says in Executable UML: How to Build Class Models that normal forms (for knowledge responsibilities and behavior responsibilities) should be used to double check results rather than as a starting process. I've read a lot of books on modeling, including the ones Fotache lists (except for the vonFlemming one, which was published when I was 5 years old) and his analysis of what practitioners are taught seems about right to me.

If we rephrase the principle of normalization to be "group like with like and use relations to capture other relationships",[...]There is no reason to expect that this should be a simple, mechanical process either.

Grouping like with like is difficult; this is why model checkers like Alloy exist to check your atom definitions against their behavioral uses. Behavioral is the best test we seem to have for understanding in production systems what 'like with like' even means. Do you know of other approaches?

The Silver-haired Bullet

Behavioral is the best test we seem to have for understanding in production systems what 'like with like' even means. Do you know of other approaches?

I'm all for tools and theoretical approaches, but my experience building enterprise (i.e. database) apps for over a decade and a half is that there is no substitute for thinking really hard about your problem space, selecting the simplest, most flexible analysis and design for your solution, and being willing to make possibly radical changes to these when you start to realize you've gone wrong.

Having knowledge of many different theoretical tools, such as normalization and mathematical modeling, can definitely help with analysis and design, but I'm always mystified by those who seem to think that there can be some kind of technological short cut to reliable results that by-passes human judgment and the wisdom gained through experience.

No argument there....

I don't think what you are portraying was at all the essence of what Marin said... that's all ;-)

It really is an extremely well-written, well-researched rant based on obvious experience. That is quite rare, considering reports show the average career in high tech is 7 years. You don't usually get good longitudinal experience reports like this.

Thanks and follow-ups

Thanks for a thought-provoking post, and the many very interesting links. I've been meaning for ages to catch up on the databases literature, and particularly like the Kent paper (it being coincidentally the only one I've had time to read in full so far).

Predictably, I agree with the sentiments in your final paragraph... there is huge scope for tools which embrace both the ongoing and inevitable imperfections in what the programmer writes (and indeed in the languages in which it is/was expressed), and the many alternative ways of expressing the same or similar things. This doesn't seem to fit the precise mathematical quest which much research sees programming as... although I'm not sure why it couldn't, since most of this stuff seems very amenable to formalisation. (That said, I think that designing a language is really a job for a cognitive psychologist more than for a mathematician... sadly I'm neither of those.)

(An aside: at the other end of the spectrum, treating programmers as random processes -- whose code can be speculatively and randomly mashed and spliced until it works -- feels dirty but can sometimes yield nice results. I remember a couple of papers last ICSE, one by Steve Reiss, another by West Weimer et al, which both did effectively this. Arguably there's scope for involving the user some more in both these approaches, although there's value in the "automatic" part too.)

I'd have to add that what I call "integration domains" is probably not middleware per se (but there's an overloaded word)... by "domains" I really meant languages or notations in a tool. So my paper was suggesting that maybe we should treat wrappers as the "normal" case, allowing us to throw away any "import" construct in our programming languages. By coding against an imaginary, idealised outside world (which special tools can then help us join up with reality...) it might let us write code that is less complex, less coupled and more flexible. (Hey, it's just an idea. :-)

Middleware

There are two sense of the word middleware. I use it in the sense Stonebraker does: the idea we should be building SELF-* systems (self-tuning, self-healing, self-configuring, self-maintaining, etc.). This is sort of what the whole SOAP manifesto was about, but expected a lot of configuration effort and none of it was well thought out. So the SOAP sense is the second sense of middleware (huge application servers, etc). Distributed transaction processing mystical guru Eric Newcomer had his company, Iona Technologies, bet big on this version of SELF-* systems in the early 2000's and promptly died. He's told me it was a bad bet, but he's recovered with other successes such as OSGi.

The Garlic paper was nothing new at the time it was presented at VLDB: I can dig up "database programming language" papers from the '80s advocating the same ideas, if you pressed me. I picked Garlic as an example just because the title of the paper was fancy and spirited.

Even more middleware

Let's make that "at least three" -- distributed system people use "middleware" to mean a distributed runtime system which hides distribution from the programmer. This seems to be different from both of the meanings you mention (in that it needn't run on huge application servers).

Those self-* things are all nice goals. I completely agree with Stonebraker's "too much middleware" sentiments (that's type 2, right?), though I'm not really qualified to say anything deeper.

I still feel compelled to mention that my paper wasn't about any of these. :-) There is perhaps some overlap with the "self-configuring" idea, i.e. that realising compositions of code could be much less effort than it is at present. But it doesn't need to be completely automatic (and I'd argue that it probably can't ever be). The main idea of the paper, "interface hiding", is a direct increment on Parnas's information hiding and doesn't depend on (or equate to) any self-* ideas. It's also strictly stronger than the ideas in Garlic, because it argues that "legacy" versus "non-legacy" is the wrong way to look at things. And of course in programming, it's not just schemas and data encodings... there are control structures to worry about.

ORM, Halpin, just do it.

Sigh! In practice most cases I have seen where people have given up on normalization it was either..

  • They didn't "get" normalization (one fact one place)
  • They didn't understand their problem domain anyway, so couldn't say what the constraints were!

The author gets as far as identifying Halpin as claiming ORM pretty much solves the problem.... and doesn't bother to analyze that claim.

Personally I strongly favour Halpin's ORM approach for a number of reasons. Not least of which it is dead easy to explain an ORM diagram to a customer/domain expert.... but good luck to you trying to explain an ER diagram!

It's not always failure to understand

Giving up on normalization for tables can also be due to technical or budgetary constraints. Right now disk space is cheapest component in the database server. If you've got "too many clients/too high latency/too many rows/no budget for more DB cpu" denormalizing the data can give enough of a performance boost to keep the organization running through the lean months. Big data sets with low latency constraints don't like thousands of concurrent multi-table join requests.

universal relation paper link

about the universal relation, with a link to a pdf of the original paper i guess.

original source

like so much of early database systems theory, it began in corporate R&D facilities and the original ideas were distributed as memos.

I'm afraid that is not the original paper, and that the original paper is not available on the Internet.

no support for cyclic dependencies?

I skimmed the paper and, although it briefly discusses cycles, it appears to ignore recursive dependencies such as in parent-child or message-isreplyto relations, which I would expect to be the biggest obstacle to defining a universal relation.