Formal methods for O/R mapping

Not sure whether this is OT for the new LtU, but in the past we used to discuss O/R mapping, and impedance mismatch (probably from a PLT perspective).

I am currently designing several tools that should work in an environment similar to Hibernate, and I stumbled upon utter lack of at least semi-formal model of what is going on. In half an hour I came up with a simple-minded model for both O and R sides, both using explicit state and mutable references, then considered to use Featherweight Java for O side, then decided it's too heavyweight.

I am sure people are working on things like that (defining formal models of both object and relational, um, models with mutable state and how they map to each other) - after all, it's the heart of almost any J2EE (and probably .NET) application, so the question is - does anybody have pointers to this reasearch?

Thanks.


To give you an idea what I am talking about - the object side of my model is:
  1. State as a finite map from References to Values
  2. Values as either Atoms or Records
  3. Records as finite maps from Names to References
  4. Atoms are Null, Integer, String; Names are Strings.
The relational side is more involved, but models state in a similar way. I am still pondering over mapping between them, and whether to bring type info into equation. At this stage I decided I could borrow some brain cycles.

An alternative question (or excuse :) ) - do you think that PLT community does not pay enough attention to mutable state? After all, it's a staple in most of the commercially used PLs.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

There is no "new LtU".

There is no "new LtU".

Third Manifesto?

I thought this is exactly what the Third Manifesto is supposed to be about.

Have you looked at Siddhartha

Have you looked at Siddhartha Rai and William Cook's paper on Safe Query Objects?

Their Latest

Looks neat and is apparently getting real.

Very promising

This technique addresses one of the main problems inherent in current ORM tools - remote query execution whilst retaining the semantics of the application language. (interestingly, this has been a solved problem in logic languages since Draxler's 1993 paper, but this is perhaps an easier exercise given the similar semantics of relational and prolog; and of course prolog lacks strong typing).

I wonder how applicable this is to the good old transitive query problem..

I'm not sure it addresses the original poster's concerns about mutable state

Can you give the whole name o

Can you give the whole name of Draxler's 1993 paper -- Google is not helping me much here.

Christoph Draxler's Home Page

Christoph Draxler's Home Page where you can fetch the 1993 paper "Prolog to SQL compiler" [along with the Prolog source of the compiler] and his 1991 PhD dissertation "Accessing Relational and Higher Databases through Database Set Predicates in Logic Programming Language".

What problem do you have in m

What problem do you have in mind ?

"I wonder how applicable this is to the good old transitive query problem.."

The n+1 problem

What problem do you have in mind ?

"I wonder how applicable this is to the good old transitive query problem.."

I assume he's referring to a generalization of what people sometimes call the n+1 problem.

Consider the following simple setup. There are authors and there are posts. Authors have names and posts have titles. There is a one-to-many relationship between authors and posts: given a post there is a unique associated author, and given an author there are zero or more associated posts.

Now suppose you have an author and want to do some kind of processing involving the titles of all posts authored by him. In pseudo-code,

for post_id in get_posts_by_author(author_id):
    post = Post(post_id)
    process(post.title)



At the SQL level, get_posts_by_author would perform the query "SELECT posts.post_id FROM posts WHERE posts.author_id = author_id".

The instantiation on the first line of the loop would then populate the 'post' object by performing the query "SELECT * FROM posts WHERE posts.post_id = post_id".

Thus, assuming an author has authored n posts, a total of n+1 queries would be made. But it could have been done in a single query by instead using "SELECT * FROM posts WHERE posts.author_id = author_id".

This is the very simplest kind of illustration of the problem. In general, reducing the number of queries in more complicated scenarios can involve a significant amount of relational joins. The problem isn't constructing these queries but figuring out when they should be used, i.e. the fetching strategy.

Common solutions involve letting the programmer provide manual hints to functions such as get_posts_by_author that tell the object-relational mapper to "fetch ahead". Thus by the time the post is instantiated inside the loop, the data for the author's posts has already been retrieved and cached locally.

Why Not Use A Database View?

As for the "n+1 problem", why not use a database view that joins the tables? Something like:
"SELECT * FROM 
AUTHORS LEFT OUTER JOIN POSTS  
ON AUTHORS.AUTHOR_ID = POSTS.AUTHOR_ID
WHERE AUTHORS.AUTHOR_ID = 234"

Using a view is efficient and, if update is necessary, views are often updateable in modern databases.

But I think that

"I wonder how applicable this is to the good old transitive query problem.."
is a reference to the "transitive closure problem" (and in particular, a reference to handling tree-structured data in SQL). Transitive closure is not possible in traditional SQL. The URL also discusses existing proprietary solutions (IBM and Oracle) and proposed extensions to SQL to address the transitive closure problem. For an amusing perspective of the transitive closure problem see Henry Baker's ACM letter "Relational Databases".

Current standard SQL solutions include stored procedures and performing multiple SQL requests to the database.

abstract interpretation

Why Not Use A Database View?
Of course a programmer can perform this optimization. But Per, I believe, is describing the issue with an ORM tool.
Here is the way I understand it:

The tool knows how Author object is mapped to the AUTHORS table, and how Post is mapped to POSTS. But it has no idea that you intend to use them together. When you retrieve data for an Author your next action may be as well getting an Address so from the ORM point of view it is impossible to decide which tables to join 'ahead of time'.

Now speaking about ahead of time, and trying to be more ltu on-topic, I was wondering for quite some time now if the following is possible:

Before doing a DB query, we take a function that we are going to apply to the query results. We perform its abstract interpretation (or compiler does type inference) and we end up with enough information to build a proper query statement with tables and fields we actually refer to in the function. Is there a language already that can do this? Has this been explored before?

Database view

You wrote:

"As for the "n+1 problem", why not use a database view that joins the tables?"

Exactly. The "n+1 problem" sounds contrived.

You wrote:

"might be a reference to the "transitive closure problem" "

I believe that in modern RDBMS the transitive closure can be handled quite nicely either with the recursive query, or by some other means.

transitive closure

Yes, my original comment was indeed about the "transitive closure problem" (I had not heard of the n+1 problem refered to as such before, but I believe that this is exactly the problem the Cook & Rai paper addresses).

The SQL99 spec includes transitive closures, and is implemented by some databases (DB2, I believe Oracle's transitive query syntax is non-standard) - unfortunately not PostgreSQL yet.

The Cook & Rai paper does not address whether or not their query formalism allows transitive queries, and whether there would be any major difficulties translating the query to SQL99

TC

I took a brief look at the article, but did not find any reference to the transitive closure.

As I mentioned before, TC is not a problem in the modern RDbs because either they implement the SQL'99 recursive query or you can use other methods to deal with hierarchies (node path encoding, nested sets, etc.).

Or, why not use OO...

...since the thread is about O/R mapping?

Why would

author posts do: [:p | p process]

need to suffer from the n+1 problem? I'd expect an O/R layer that understood the O side to generate 1 query for this, not n.

If

author 

did another 1 query then that'd be 2 queries, right?

Abstraction reversed

I think problems like these exist because in traditional O/R mapping you reverse the typical "abstractions upon abstractions" approach. In O/R mapping you're basically treating the SQL as a target language, effectively compiling down to it. But SQL is at a much higher level of abstraction and offers far more power than most of it's host languages. When you try to squeeze all that power into the comparably anemic host languages you're bound for trouble.

In functional languages map and fold are powerful constructs that in most cases can replace the use of lower level concepts such as explicit iteration. Imagine if someone wrote a compiler which compiled Java to some dialect of ML, and then proceeded to write Java code rather than use the underlying power of ML. Would you consider that to be a good idea?

OO may be a good way to structure code, I don't know. But compared to the relational model it isn't a good way to structure data. I see O/R mapping tools as a problem, a symptom, rather than as a solution.

You wrote: "But compared t

You wrote:

"But compared to the relational model it isn't a good way to structure data. I see O/R mapping tools as a problem, a symptom, rather than as a solution.
"

I tend to agree with the above in the absense of convincing argument to the contrary.

I tried to ask what exactly the object data model might be in a hypothetical RO mapping, but did not get any response yet.

Controlled mutation

Maybe it would be a good idea to start with a very limited model of mutability - say, single-assignment cells which are either non-assigned or assigned, or cells that are mutable only within the scope in which they were created, or containment of stateful effects via the type system (e.g. via a state monad) - and see what they trade-offs are between restriction of mutability and ease of mapping.

One of the hardest things about O/R mapping is keeping track of state changes: keeping "local" copies of data (inner fields of objects) synchronised with the central store (tuples asserted in relations). A lot of hard problems just disappear if all anybody ever wants to do is create new immutable instances of objects and query for existing instances. So maybe it's a good idea to treat mutability with suspicion - start with none at all, then see what you have to do if you allow just a little and build up from there.

Also, consider what Oleg's ZipperFS (or some other solution based on crafty manipulation of control effects) might enable you to do in this area...

Postgres

Interestingly enough, Postgres does not actually update tuples at all! SQL UPDATE is implemented as an atomic DELETE and INSERT.

"do you think that PLT commun

"do you think that PLT community does not pay enough attention to mutable state? After all, it's a staple in most of the commercially used PLs."

A quote from the abstract of Peter Van Roy and Seif Haridi CTM book:
"Because a wide variety of languages and programming paradigms can be modeled by a small set of closely related kernel languages, this approach allows programmer and student to grasp the underlying unity of programming."
The unity they describe seems to include imperative programming.

object data model

"To give you an idea what I am talking about - the object side of my model is:
"

You are talking about implementation details, but I wonder what your suggested object data model is.

Regarding the relational side, I imagine you mean the standard RM as defined by Codd, Date at al.

Something that bothers me about OR mappings

There's something that's always bothered me about OR mappings, and I hope someone here could provide an answer. As far as I understand, OOP is an organisational tool, one of the basic tenents of which is encapsulation. You usually get encapsulation by having a well defined interface, which is basically it's type. The type of most things in most OO languages are complex, involving functions, not just values. The naive RDBMS mapping however, would (it seems to me) expose the internal data structure/private reprensentation of the type, and thus break encapsulation. You can recover by forcing application code to only access thing via the Object side of the mapping, so only after you instantiate an instance from the database, but that looks like you would miss the point of having an RDB, which is surely the complex queries? It feels then that you've done nothing except to get a persistence engine, which whilst useful, does not solve the same class of problems and is in fact orthogonal. I have yet to find a decent answer to this issue from the Google-brain or from my Rails/Hibernate friends, who seem to be very happy with their tools nevertheless.

I personal take is what I understood from veiled references to the 3rd Manifesto (I can't actually read the original, since it would need money -- I'm too poor to justify the spending, and I haven't tried my new local library yet). It seems that Date&Co felt that Objects as seen from the perspective of the DB, should be opaque types, defined solely by their external interface. These types are then the fields/attributes of the tables/relations. You do queries on them by using their methods and functions. Postgres seems to follow this route, by allowing extension to the type system, functions, operator, indices and algebraic reduction rules. Unfortunately, it requires the use of C. I like C as much as the next person, but no more, and I'm not sure how safe it is to allow unsandboxed binary modules into my database.

Perfection is down the hall

The naive RDBMS mapping however, would (it seems to me) expose the internal data structure/private reprensentation of the type, and thus break encapsulation. You can recover by forcing application code to only access thing via the Object side of the mapping, so only after you instantiate an instance from the database, but that looks like you would miss the point of having an RDB, which is surely the complex queries?

A good O/R mapping system should have its own query language. Hibernate does, for example.

However, I usually tend to consider an O/R mapping to be "just" a view of a database. In enterprise systems, you often can't get away with a single language and approach for accessing the database - you've got multiple systems and departments and development teams, etc. But if the O/R mapping is done properly, and the database has properly designed and enforced referential integrity, it's perfectly possible to have both an O/R-mapped application coexist alongside other applications which access and update the database.

Obviously, in this scenario it's possible for the non-OO applications to put data in the database in a way that wouldn't be allowed if the object side were used. How you deal with that is highly dependent on the applications in question, but in practice it's not a particularly difficult issue to deal with. For example, the object application might be the offical creator of certain kinds of data, and other applications might be forbidden from performing certain kinds of operations on the affected tables. This can usually be enforced via database security, triggers etc.

The bottom line is that using O/R mapping tools with a relational database can be a best of both worlds scenario, which might explain why your Rails/Hibernate friends "seem to be very happy with their tools nevertheless".

The question of how things should be is an entirely separate one. OO languages on their own are limited by not having a good story for persistence, and RDBMSes are limited by not having a good story for complex application development. The best of both worlds I mentioned is the best of two somewhat incomplete worlds. It's still rather Frankensteinian — O/R mapping just makes the bolts in the monster's neck less noticeable. More principled solutions are going to require some radical changes to the status quo, as usual.

Proper mapping

Please elaborate on what exactly you mean by this:

"But if the O/R mapping is done properly,
"

In other words, what is proper O/R mapping ?

Re: Proper mapping

By "proper O/R mapping", I'm just referring to the sort of mappings between objects and database entities (tables and relationships) that the manual of a good mapping tool should describe. For example, in Hibernate's case, the sort of thing described in Mapping Associations. As that document points out, you have to be concerned with properties of relationships such as directionality (uni- or bi-directional) and multiplicity (e.g. one to many, many to many), and the semantics of the OO collections you choose to model those relationships.

It's not difficult to get these right, but if you get them wrong or take shortcuts, while you might get away with it for a particular application, it tends to have a bad effect on the generality of the model. If your goal is to be able to have applications successfully deal with the database directly, as well as through the O/R mapping, you want the two to be as semantically close to each other as possible. A proper mapping achieves that.

[Edit: Per Vognsen is right to point out that I'm talking about domain models. Proper O/R mapping could be described as ensuring that your mapped object model and database express the same domain model as closely as possible.]

Proper mapping

Thank you for the reference.

In the relational model, there is no notion of pointer and directionality. The relational model represents facts as mathematical (ignoring minor differences) relations. So how exactly do you suggest the relations should be mapped to the object world things ? If you translate the original relations into some kind of graph like structure, you necessarily expose pointer-based physical implementation with all the related deficiencies thereof (I think this kind of [network] representation has been discussed quite a lot so I won't spend much time on it).

My question is: did you have some sort of network model in mind or something entirely different ? What advantage does mapping the database relation into a hypothetical object data model would have? I am genuinely curious as I have not seen a better alternative to what the relational model already has to offer (not that I am saying there is not).

Do not adjust your set

In the relational model, there is no notion of pointer and directionality.

Foreign keys refer to specific tuples in another relation. Logically, this is a pointer, and in an OO language, it is modeled by an object reference. Directionality is more of a view issue, which is why I started out by saying that I see O/R mapping as being more akin to a view of a database. Nothing stops you from having multiple views of the same database, possibly in different applications. (Also, you can map every relationship as bidirectional if you like, but in practice application requirements usually don't call for that.)

So how exactly do you suggest the relations should be mapped to the object world things ?

See the Ambler reference in my other post. But it's simple enough: think of relations as OO classes, and foreign keys as object references, and you're a big chunk of the way there.

What advantage does mapping the database relation into a hypothetical object data model would have?

Object models encode the semantics of applications, not just of their data. You can think of it as a way of defining and organizing applications that's analogous (and quite closely related) to the way in which the relational model organizes data.

I apologize for sudden lack

I apologize for sudden lack of formatting in my replies.

You wrote: Foreign keys refer to specific tuples in another relation. Logically, this is a pointer,

I am afraid you are confused about the relational model. The foreign key is a constraint [on the range of values a column is allowed to assume] in a relational database, just like any other constraint. There is no explicit pointer at all. The foreign key in now way restricts relational database queries, as the explicit pointer in say the network model would. All this stuff is pretty well known and I hope I'd be excused for pointing out the obvious.

You wrote: think of relations as OO classes, and foreign keys as object references, and you're a big chunk of the way there.

Yes, that's a traditional and flawed approach as it converts relational representation, independent of its physical implementation, into a network model with its well known defects.

Object models encode the semantics of applications, not just of their data.

Assuming it's an important consideration, how do you propose to encode the behaviour (application) in the relational database when you map from the object world to the relational one ?

CODASYL rules OK!

I am afraid you are confused about the relational model. The foreign key is a constraint [on the range of values a column is allowed to assume] in a relational database, just like any other constraint. There is no explicit pointer at all.

The fact that the relational model doesn't mention the word "pointer" doesn't affect the fact that foreign keys model the behavior of pointers, when viewed from certain perspectives. You appear to be insisting on viewing things from "inside" the relational model, and so you become a human victim of Gödel's Incompleteness Theorem, i.e. the view from inside a given model is always restricted. We're discussing a mapping between models, so you're going to have to shift your perspective to gain any kind of real understanding.

Yes, that's a traditional and flawed approach as it converts relational representation, independent of its physical implementation, into a network model with its well known defects.

In support of my point above, notice how when you're talking about the relational model, you're very precise about terminology and unwilling to acknowledge an obvious conceptual connection, whereas you're perfectly willing to treat an OO class as though it were a feature of the network data model, presumably described somewhere deep inside the CODASYL data model specification.

Marc Hamann has provided a good response to the substance of your point above, elsewhere in the thread.

Assuming it's an important consideration, how do you propose to encode the behaviour (application) in the relational database when you map from the object world to the relational one ?

That's precisely the point — the relational model isn't designed to encode such behavior. O/R mapping deals with the encoding of the data aspect of an object-oriented program onto a relational database. The requirement for this arises because general-purpose programming languages have evolved models for dealing with data which are inspired by a different set of requirements and constraints than those that underly the relational model.

Domain model

I think the advantages you describe tend to come from enforcing a well-specified domain model in a layer of the middle tier, which all other business logic has to go through in order to access the database. In other words, the business logic sees only the domain model. Object-relational mapping helps in implementing this domain model but the mapper itself is not the source of the benefits you list; the domain model is.

Domain model

Please explain what you mean by the domain model and what specific advantages it has in comparison to the relational model.

Thank you.

Domain model

Do you know what a "domain" is? Do you know what a "model" is?

As I did not present it as competitive with the relational model (in fact, if you'd bother to look up the meaning of "domain model" you'd see that it would be like comparing apples and oranges), I am unable to comprehend why you think I need to explain "specific advantages it has in comparison to the relational model".

Thank you.

"Do you know what a "domain"

"Do you know what a "domain" is? Do you know what a "model" is?"

Your questions do not make sense without a context, but I'll try answering them anyway.

In the relational database world, "domain" is a synonym for "data type" (integer, character, user-defined type, etc.). Now, what do you mean by "domain" exactly?

As to the word "model", I'll venture to suggest, as an example, the notion of relational model which represents the facts of interest as mathematical relations. What's your understanding of the word "model" ?

I did bother to look up the "domain model" definition. I found many slightly different ones. The closest I could relate to was the old network data model. Is it what you had in mind, or you meant some other definition of the domain model ? Please clarify.

Regarding " you'd see that it would be like comparing apples and oranges". Are you saying that mapping the relational model to the domain model is akin to mapping apples to oranges ? If so, why did you refer to the domain model in the context of mapping relational data to some object representation ? Please clarify.

Regarding "I am unable to comprehend why you think I need to explain "specific advantages it has in comparison to the relational model".

Please correct me I am wrong, but my impression was that you'd suggested the domain model (whatever it is) as the target for mapping relational data to. If so, asking for an explanation whether such [domain] model possesses some advantages in comparison to the relational model is quite legitimate.

Tengo una maleta blanca

"Do you know what a "domain" is? Do you know what a "model" is?"

Your questions do not make sense without a context, but I'll try answering them anyway.

My intent with those silly questions was actually to make you look up the terms. I had assumed since you're posting to a programming language site, you'd be familiar with the term "domain" in a couple of the contexts commonly discussed here. The one I had in mind is the same kind of "domain" used in the term "domain-specific language". The very first Google hit I got for "domain model" (a page on Martin Fowler's wiki) uses the term in this sense.

Regarding " you'd see that it would be like comparing apples and oranges". Are you saying that mapping the relational model to the domain model is akin to mapping apples to oranges ? If so, why did you refer to the domain model in the context of mapping relational data to some object representation ? Please clarify.

I think my use of the apples/oranges comparison was appropriate. When someone speaks of "the" relational model, they are referring to a general theory put forward by Codd. On the other hand, one doesn't speak of "the" domain model; a domain model is something specific to an application. Roughly speaking, it encompasses something like an entity-relation model as well as accompanying behavior.

Domains and such

I had assumed since you're posting to a programming language site, you'd be familiar with the term "domain" in a couple of the contexts commonly discussed here. The one I had in mind is the same kind of "domain" used in the term "domain-specific language".

You've assumed right. However, the naked "domain" word is so generic as to be devoid of any meaning.

Now, since we are talking about relational data mapping to the object world, one naturally assumes that 'domain' is used in the sense of being a set of possible values [with a set of operations defined over such set of values]. How do you define the word domain in the domain model without resorting to analogies like "the same kind of "domain" used in the term "domain-specific language". ?

it encompasses something like an entity-relation model as well as accompanying behavior

If the "domain model" includes "behavior", in addition to data, how useful is it for R<->O mapping if the relational model does not include such behavior ?

More to the point, forgetting about the nebulous "domain model" which is apparently irrelevant ("apples" and "oranges") for ORM, what exactly do you suggest we want to map relational data to ? Lists ? Graphs ? Or something else ?

Tedious

More to the point, forgetting about the nebulous "domain model" which is apparently irrelevant ("apples" and "oranges") for ORM

What I said was that it does not make sense to compare "domain model" and "relational model" as if they were competing aproaches, at least if you use "relational model" in the way you did in previous posts.

what exactly do you suggest we want to map relational data to ? Lists ? Graphs ? Or something else ?

I did not purport to be privy to a better way of doing object-relational mapping than what is already in common use, did I? Please stop projecting these behaviors and opinions onto others, it is getting tedious.

I think we've wandered sufficiently off-topic. I don't have a problem backing up claims I actually made but repeatedly trying to make me responsible for claims I never made is offensive and a waste of time.

References

An intro to these kinds of topics is out of scope here. If you're interested in this subject, look at the work of people like Martin Fowler, such as the book Patterns of Enterprise Architecture. You could also try this this overview of domain engineering (follow the domain analysis, design, and implementation links from that page). For discussion of O/R mapping, Scott Ambler's work should be helpful, such as Mapping Objects to Relational Databases.

Regarding comparing relational models to domain models, domain models go well beyond just data, covering the definition and behavior of applications. The relational model is not a competitor in that space, but rather an important tool that deals with just one part of the picture.

1. What I've been hoping to f

1. What I've been hoping to find out by my questions was the following:

How exactly do you (or anybody else who cares to answer) propose to map a relational model representation to some object one ? What specific object model is suggested as a mapping target ? Is it the old and unsatisfactory network model or something else ? Please give a straightforward answer without referring to other publications saying that "intro kinds of topics being out of scope here" and such.

2. I did read some references including those you've provided.

Domain engineering is a process for creating a competence in application engineering for a family of similar systems. Domain engineering covers all the activities for building software core assets.

The above sounds like a random collection of words devoid of any technical content, more appropriate for a marketing expose than for a technical article.

I am familiar with Scott Ambler's writing. My impression is that his understanding of what data modeling and the relational model are is very weak to say the least. For example, he often confuses the conceptual, logical and physical models and even rejects the notion of logical modeling:

What I’ve seen happen is that people, often the data modelers doing the work, quickly realize that logical persistence modeling is simply a waste of effort that can be cut out of the process

Network model

Your adversarial tone and attitude is pretty amusing.

Is it the old and unsatisfactory network model

Assuming you are referring to the same kind of network model you mentioned in a previous post, then yes. At the end of the day we have to program against the data model somehow, whether relational or otherwise, and most people have found this a pretty convenient way to do so in many cases. No-one is claiming it is ideal.

I apologize if my way of putt

I apologize if my way of putting things offended you.

I honestly do not understand why one would want to regress to the inferior (network) data model (dependency on pointers, query language problems, optimization problems, etc). If I misunderstand the idea [of why one would want to map the relational data to something else] , please explain.

I'll try a simple explanation

I'll try a simple explanation of what "domain models" are, along with my opinion on them.

Let's say we have a Customers table, an Orders table and a Products table. And a use case: a Customer orders a Product, creating an Order; some money is deducted from the Customer's account.

Throw in some funky rules like: if the Customer is an Old Timer, the Order gets a 10% discount. If the Customer has made more than 5 orders this month, he/she receives a small Teapot along with the Order. And so on.

The natural approach of an object-minded person is to factor all those rules - the behaviour of the data - into objects/classes. But we need some objects and classes to begin with! OK, says the object-minded person, we'll create a domain model. It'll have classes for Customer, Order and Product. We'll put the behaviour there, and life will be good.

After this decision has been made, life certainly doesn't become good by any stretch of imagination. Witness EJB and EJB-QL, witness Fowler's book "Patterns of Enterprise Application Architecture" (already mentioned by Anton Van Straaten), et cetera.
A large part of the Java industry's current occupation is trying to map "business domain data" onto objects.

You are right in implying that O/R mappings don't work. However, funny as it may seem, they're the standard approach now, because they appear to work in simple cases, and because people need to put the behaviour somewhere. For example, the most popular Ruby framework for webapps, Rails, has an O/R mapping.

I hope something better comes along soon. Maybe that "something better" is "rules engines", or generic functions with predicate dispatch on the data; but I'm not quite sure yet.

After this decision has bee


After this decision has been made, life certainly doesn't become good by any stretch of imagination.

That's my experience as well.


A large part of the Java industry's current occupation is trying to map "business domain data" onto objects

Quite misguided it appears.


You are right in implying that O/R mappings don't work.

A natural question arises: why make the same mistake over and over again ? Why not just *take* the required data (using the relational query facilities) into the application and use it there ? Why try to shoehorn the relational representation into an inferior network representation ?

A shadowy head rises from the surface of the loch...

Why try to shoehorn the relational representation into an inferior network representation ?

Your premise here is faulty: a network representation is not inherently inferior in all ways, it is merely inferior with respect to a few features of a relational representation: uniformity and, from that uniformity, certain mathematical generalities.

But there is a cost to the generality and uniformity: you must traverse a lot of relations to assemble all the data that you need for any given purpose. Once you have taken your data out of persistence, you want it in some convenient form that models what you need it for.

In such a case, a specifically-designed network model of the data "normalized" for that specific use rather than for uniformity or generality is in fact the superior representation.

So, to make this pithy, if you want to build a general-purpose DBMS, use a relational representation, but once you have the data in a particular application you are much better off with a purpose-built
network model.

Hence all the effort put into ORM.

The well know network data mo

The well know network data model deficiencies are:

1. Access path dependence
2. Impoverished navigational query language
3. No "network" algebra similar to the relational algebra
4. No simple mechanism for data integrity (constraints) enforcement.

(3) sounds too abstract, but has important implications for the query language implementation and query optimization.

The deficiencies are absolute, not relative to the relational model.

You wrote:
you must traverse a lot of relations to assemble all the data that you need for any given purpose.

That's a srawman. Firstly, joining relations although widespread, is not needed in
"any" (every) case; secondly, modern relational databases execute joins very efficiently.

A more practical example would the following.

An accounting system was implemented using a commercial ORM tool where objects were mapped to relational tables.

At the testing stage it was discovered that performance was far from being satisfactory. It turned out, that complex objects consisting of a multitude of other contained objects had to be assembled from respective tables which resulted in a corresponding number of autogenerated [by the mapping layer] simplistic SQL select statements. E.g. one screen record resulted in a dozen or more database round trips.

If a join were used instead, the record could have been obtained at one one with much better performance (the actual experiment showed ten-fold performance gain).

Unfortunately, the system was too far in its life cycle fr the architecture to be changed. The solution was buying more hardware to achieve mediocre barely acceptable performance.

Joins

If a join were used instead, the record could have been obtained at one one with much better performance (the actual experiment showed ten-fold performance gain).

Practically all ORMs provide ways to let you annotate classes with hints to the mapper such that this joining will occur. Using an ORM does not mean that you don't profile and optimize queries, or that you don't understand the performance trade-offs involved in joins vs database round-trips. Now, I do agree that using an ORM can fool you into believing that you don't have to care about these things. But a bad programmer is a bad programmer.

Joins

You wrote:

Practically all ORMs provide ways to let you annotate classes with hints to the mapper such that this joining will occur.

That is partially correct. The mapper was still unable to produce a rather simple multi-table join in the case I mentioned. It did allow to create a "custom class" where one could specify any kind of query.

As soon as one does so [the custom class], a natural question arises: why bother at all and not use the real thing, the relational query, instead ? We could not due to the issues I mentioned earlier.

You wrote:
Using an ORM does not mean that you don't profile and optimize queries, or that you don't understand the performance trade-offs involved in joins vs database round-trips.

If so, then what good does ORM do for me, as an application programmer ? I could just as well *not* use it and improve performance by avoiding the unnecessary mapping layers and just writing my custom classes, after all I'd need to know SQL and how to tune it anyway.

Additional problems, not only with this project but also with some others where a decision was made to use an ORM, were almost unsurmountable concurrency control issues as well as lack of data integrity enforcement about which data object "architects" had no clue (the "design" proceeded from the objects/applications towards the database, UML being the "model") .

Joins

(By the way, a reasonably easy to mark text you're replying to is to use <i></i> tags. Yeah, I know the forum software sucks.)

If so, then what good does ORM do for me, as an application programmer ? I could just as well *not* use it and improve performance by avoiding the unnecessary mapping layers and just writing my custom classes, after all I'd need to know SQL and how to tune it anyway.

Cool, I think this gets at the heart of the matter. My answer is that you don't need to carefully tune performance in most places by writing custom SQL queries. Thus, if you go ahead and write all your custom classes from scratch, you would likely end up repeating yourself a lot and find yourself with a substantial amount of boilerplate code. Besides the initial expense of writing all this code, it also constitutes an additional maintenance burden. I think attempts at eliminating this repetition and boilerplate invariably lead towards something that at least smells like an ORM.

A symptom?

As andnaess has opined above, the fact that you "naturally" get something that smells like an ORM maybe a symptom rather than a solution. It might point to deeper problems in how one is thinking and structuring the solution.

An axiom to grind

The deficiencies are absolute, not relative to the relational model.

They are only absolute if you take those particular criteria as axiomatic for judging a data representation.

The whole point is that there are other equally compelling properties for which a non-relational model can demonstrate superiority.

That's a srawman. Firstly, joining relations although widespread, is not needed in
"any" (every) case; secondly, modern relational databases execute joins very efficiently.

The devil is in the details: in some cases assembling the data with a query has superior performance, and in others not.

That aside, performance isn't necessarily the criterion under consideration. Logical simplicity of your application model may be the most important criterion, or some domain-specific data requirement.

If there were simply the One True Data Representation that was best for everything, we would just use it. Almost nothing related to PLs or computation works that way, though.

The point again, in more detail

An object-minded person naturally thinks about stuff in terms of opaque objects with behaviour.

1) If we go the "naturally relational" way - just take the data out of the DB, and then apply the "behaviour" in sequence - we end up with a lot long functions, each having lots of nested if statements and such. This kind of code "smells" to the object-minded person; a bell goes off in his/her head that suggests, "this is bad code, refactor this". And the very first refactoring looks very much like an O/R mapping.

2) "Low-level" vs "high-level" have a different meaning in the OO world. Any kind of data seems "too low-level" to the object-minded person - hence the huge numbers of "getter and setter methods". "Abstraction" in the OO world is abstraction away from naked data, and towards opaque behaviour; so, SQL is perceived as being too close to the data, hence "too low-level".

3) Java's database API (JDBC) is very clunky to use, compared to e.g. Perl's DBI. Basically, to have a decent database API, you'd want varargs (binding parameters to queries), dynamic typing OR really expressive parameterized types (to avoid the need to cast the results), and a convenient iteration protocol (to step through the result set). Pre-1.5 Java has none of the above, and Java 1.5 isn't yet understood even by experts.

4) It's 2x cheaper to teach a person ONE language (Java) than to teach him/her TWO languages (Java and SQL). Quick training is one of Java's major selling points.

And so on...

Strawman

I think there's a pretty substantial strawman lurking in your argument: that someone who wants to use any form of O/R mapping has to be an OO purist, one who is in love with "huge numbers of getter and setter methods" and has kneejerk reactions to any kind of non-opaque interfaces and the like.

All I want is a well-defined abstraction boundary that (1) makes it convenient to program against the data; and (2) enforces invariants and behaviors that cannot or should not be expressed at the RDBMS level. Wanting to have the application go through such a boundary is not due to OO purism but due to practical concerns for consistency and modularity. O/R mapping helps directly with (1) and, as a side benefit, provides a locus on which to hang code for (2). It is not much of a solution in itself but it can be part of a solution.

2

It's funny. (2) seems objective enough to me; but (1) sounds awfully OO-minded.

O/R mapping certainly makes it convenient to program against the data - in the OO definition of "convenient". Ditto, "consistency" and "modularity"; in monad-speak, those are OO consistency and OO modularity.

At the risk of sounding controversial, I'll try to explain a bit more...

Let's start with the "practical concern for modularity". An interesting thing about OO: it doesn't like long, free-standing functions (grouping behaviour by task). All functions must belong to classes instead, and classes shouldn't be overly long. This is modularity in the OO sense, and OO languages actually go out of their way to force this notion on the programmer; for example, Java discourages large classes by making every class reside in one source file.

The procedural notion of modularity is quite a bit different. Unlike OO methods, procedural-style functions typically have task coherency. This means they are longer on average, but have a simpler structure of arguments (no opaque objects with methods - just strings, numbers, arrays and flags). This style of modularity fits very well with databases, because function length is less of a smell - you don't pay a "modularity penalty" for database interactions.

In other words, if some pesky concern (like database interaction) is inflating the size of your methods, why not factor it out instead? - the OO person thinks, and writes an O/R mapping. Pure ideology; the problem didn't ask for factoring, in the first place.

(The procedural notion of modularity has other things going for it, too. Let's say you're writing Java, and need to recursively delete files from a directory. OK, the method isn't tricky to implement; but where do we put it? We can't put it in File, because that's a system class. We can't put it in a public static method in com.whatever.Util, because this is "poor style". You may say, well, filesystems are "like that". Databases are "like that", too; a poor fit for OO.)

"Consistency" is also a loaded word in OO. What could be more consistent than SQL? However, relations don't provide a collection-traversal interface, or getter/setter methods. (Funnily, those trivial things are perceived by many as THE reason to use O/R mappings.) SQL isn't "consistent" with the object-minded view of the world; no more, no less.

As for convenience of programming, let's just say convenience is subjective unless measured. Maybe we should compare and contrast: raw JDBC vs Hibernate (or EJB, for laughs) in Java, DBI vs ActiveRecord in Ruby. One of the "convenience smells" is caching/laziness: JDBC and DBI don't ever make users think about caching. (For Hibernate, the "n+1 problem" is a Frequently Asked Question.)

In summary, yes, we don't have a natural approach to working with databases from OO languages. The problem, however, lies with OO - in the OO understanding of "factoring", "modularity", "abstraction levels" et cetera. Nobody wanted an O/R mapping in pre-OO PHP; nobody wants it in Perl; nobody wants it in Haskell.

3

I won't comment much on your points as they seem to rail against an OO purism that was nowhere to be found in my original post. Suffice it to say that I agree with many of your criticisms of OO purism but fail to see how they counter my points.

Unlike OO methods, procedural-style functions typically have task coherency. This means they are longer on average, but have a simpler structure of arguments (no opaque objects with methods - just strings, numbers, arrays and flags).

You seem to be confusing data abstraction with OO. If you look at the code in, say, SICP you'll see gobs and gobs of data abstraction but no OOP in sight. Data abstraction does not imply "opaque objects with methods". Note that I would never claim that almost every function has to operate on an object, method-style, or be associated with a class--that's the kind of blatant absurdity that some OO purists like to peddle.

As for convenience of programming, let's just say convenience is subjective unless measured. Maybe we should compare and contrast: raw JDBC vs Hibernate (or EJB, for laughs) in Java, DBI vs ActiveRecord in Ruby. One of the "convenience smells" is caching/laziness: JDBC and DBI don't ever make users think about caching. (For Hibernate, the "n+1 problem" is a Frequently Asked Question.)

Perhaps the problem (if there is one) is that many ORMs try a little too hard to be persistence frameworks built on top of RDBMS's.

N/R mapping

However, relations don't provide a collection-traversal interface, or getter/setter methods. (Funnily, those trivial things are perceived by many as THE reason to use O/R mappings.)

A big reason for those kinds of perceptions is that it's very convenient and economical to be able to write an expression like "a.b.c" to express a particular query over three relations involving two joins and four or more key columns. When you scale that convenience and economy over a large application or set of applications, the improvement can be substantial.

It's even more useful that the expression "a.b.c" is not an expression in some domain-specific language, it's a native expression operating on native variables and data values in the host language. This is important enough to be considered a "killer feature" of the approach. This has little to do with object-orientation per se - one could imagine a similar capability in e.g. Haskell or ML being equally useful. HaskellDB seems quite nice as a query language, but doesn't seem to address this kind of mapping.

Your objections regarding things like caching merely reflect that O/R mapping is middleware, that suffers from not being implemented inside the database it's mapping. Database engines also performing caching, but it tends to be more transparent, if only because it doesn't have to be implemented by a separately-installed caching engine.

Synonyms

N/R mapping seems to be a genuinely useful thing. I wonder if it can be done without O/R mapping...

In one of my apps, I've stumbled upon a "join synonyms" approach. It's a bastardization of some ideas from HaskellDB, but unlike HaskellDB, it doesn't require Haskell :-)

For a somewhat contrived example, "select * from order, customer where {order [belongs_to] customer} and customer.id = 1111". Conceptually, "belongs_to" is something that takes two names ("order" and "customer") as arguments, and returns an appropriate join condition (a "synonym") in SQL.

Several such "synonyms" can be defined, simplifying your "a.b.c" problem a bit. A fun point about those "synonyms" is that they're quite easy to define and combine - they're just functions in our programming language, which receive some table names (and maybe some literal values) as arguments, and return SQL code. The code may contain nested subqueries (e.g. "exists") and other helpful stuff, making the approach quite powerful.

I didn't care about strong typing. Syntax problems are detected on the first run, when the database parses our SQL :-)

So, going back to your example, you only have to define the "a.b.c" join condition once, and you can do it in terms of "a.b" and "b.c" (and SQL's "exists"). This doesn't even require objects - I've done this in PHP. In fact, it could just as easily be done in C.

I have yet to try this approach for larger-scale apps...

Behaviour

A simple point to make re. "behaviour" aka "application".

One would have imagined that importance of data independence from a specific application was realized about 30 years ago and the issue has been settled once and for all since.

The idea of attaching "behaviour" to data *in the data model* is quite bizzare. What's more important for an enterprise, what has a longer life, the collection of data representing/modelling the enterprise or a volatile set of applications such that new applications may appear and the old ones go away?

If you have an entity, say, person(name, age, weight, etc), what "natural" behaviour would you attach to it in the data model ? The behaviour/application entirely depends on the context while the data part remains unchanged.

I can easily imagine an application requiring a data store for nothing much but to serialize its objects. However, trying to apply this kind of thinking to a vast majority of real life applications where the database plays a central role would be counterproductive at best.

N.B. "you" in the above is an impersonal pronoun ;)

Data ain't so hot

One would have imagined that importance of data independence from a specific application was realized about 30 years ago and the issue has been settled once and for all since.

O/R mapping addresses this precise issue, by persisting data to a relational database. Again, this is a "best of two imperfect worlds" scenario: an application using O/R mapping gets the benefits of being able to use a very specific view of data, customized to the application in question and abstracting away many low-level details of the underlying relational model; while the data still ends up in relational form, where it is accessible to other applications which may have a different perspective, whether OO or otherwise.

The idea of attaching "behaviour" to data *in the data model* is quite bizzare.

No-one attaches behavior to data in the data model. They attach behavior to data in an object model.

You seem to be confusing levels, as though O/R mapping or object models were intended as a replacement for the relational model. If that were the case, we wouldn't be discussing how to map data to the relational model, would we?

What's more important for an enterprise, what has a longer life, the collection of data representing/modelling the enterprise or a volatile set of applications such that new applications may appear and the old ones go away?

If you think of an object model as something specific to a particular application or set of applications, your objections are moot.

Besides, there's no simple one-sided answer to the question "what has a longer life". Some kinds of data have a longer life than some kinds of behavior, and vice versa. And sometimes, data and behavior aren't that easy to distinguish from each other ("code is data"). However, such concerns are outside the scope of the relational model.

Persisting

"O/R mapping addresses this precise issue, by persisting data to a relational database. "

"Persisting data" without relational modelling results in a bit bucket useless without the attached application unless "persisting" implies that relational modelling had been done beforehand.

Again, this is a "best of two imperfect worlds" scenario: an application using O/R mapping gets the benefits of being able to use a very specific view of data, customized to the application in question and abstracting away many low-level details of the underlying relational model;

There are several objections to this point of view.

o The application *loses* ability to issue efficient declarative queries and instead has to resort to a low-level object-by-object graph traversal (see my ealier replies).

o The relational model is necessarily of a higher-level that its poor object cousin. As I mentioned before, one formulates queries using a high-level, application independent declarative language operating with *sets* instead of individual objects. Nothing of the kind is possible with the network data model.

o In practice, using ORM is the *worst* of the worlds due to the limitations I briefly mentioned above and such, often ignored, things as concurrency control and data integrity.

You seem to be confusing levels, as though O/R mapping or object models were intended as a replacement for the relational model.

I hope I am not. My imaginary picture is: relations <-> object network [with attached behaviour as member functions].

If that were the case, we wouldn't be discussing how to map data to the relational model, would we?

I am trying to understand what the state of the art in the area is since what I've seen so far makes ORM layers appear rather pointless.

Re: "Data ain't so hot".

It's a serious misconception, at least for enterprise applications. Data is very hot as can easily be seen by observing a lot of projects migrating presumably 'old' data from the mainframe elsewhere. It's much less costly to rewrite an application than to lose vital data.

Titles here shouldn't be taken too seriously

"Persisting data" without relational modelling results in a bit bucket useless without the attached application unless "persisting" implies that relational modelling had been done beforehand.

Who said anything about not doing relational modelling? Obviously, if one is going to persist data to a relational database, one had better develop a good relational model to persist to.

o The application *loses* ability to issue efficient declarative queries and instead has to resort to a low-level object-by-object graph traversal (see my ealier replies).

Tools like Hibernate's query language refute this point.

o The relational model is necessarily of a higher-level that its poor object cousin. As I mentioned before, one formulates queries using a high-level, application independent declarative language operating with *sets* instead of individual objects. Nothing of the kind is possible with the network data model.

(Aside: injecting pejoratives like "poor" next to things you don't like doesn't actually achieve anything, except possibly to emphasize your lack of a better argument than what amounts to name-calling.)

The reality is that what is high-level and low-level is relative to the problem being solved. If one defines low-level as having to deal with details not directly relevant to the problem being solved, then in many situations, direct use of relational databases via relational queries is low-level.

Object query languages support "formulating queries using a high-level, application independent declarative language operating with *sets* instead of individual objects," so your claim that "Nothing of the kind is possible with the network data model" is incorrect.

In practice, using ORM is the *worst* of the worlds due to the limitations I briefly mentioned above and such, often ignored, things as concurrency control and data integrity.

How do you know this, "in practice", since by your comments you've made it clear that you have no direct experience with ORM in practice?

In practice, I've found it very effective, as have many other people. I'm not a blind advocate of any of these approaches. I'm coming from the approach of what has worked and solved problems for me and my clients in real applications, working with the tools that are available to us. Often, the problems we're dealing with have a lot to do with the limitations of tools: both SQL databases and object-oriented languages have all sorts of limitations and warts. In this context, the ability to combine two systems and exploit the strengths and avoid the weaknesses of both is very valuable.

I am trying to understand what the state of the art in the area is since what I've seen so far makes ORM layers appear rather pointless.

The point is incredibly simple. It is to allow programmers working in certain programming languages to access persistent data using the native data model of their language. Your point really amounts to saying that you don't like the data model of those languages. My suggestion to you then is, don't use those languages, in which case the issue of O/R mapping will not arise for you.

HQL

Tools like Hibernate's query language refute this point.

Tools like HQL are navigational tools in sheep's clothing. E.g. an HQL query like this:

select p.id, d.name from Person p inner join p.department

looks attractive, almost like real SQL, but in order to execute it, one needs to define an association between Person and Department in an XML file using a couple dozen of lines. The mapping XML file, in fact, describes a network of classes and their associations with relational tables and imposes an application dependent structure upon the original relational data, so that one *cannot* just say, as one would in SQL:

select p.id, d.name from Person p inner join Department on arbitrary_predicate

The possible access paths are defined in the mapping XML file and determine how the ORM will generate SQL statements operating with relational tables. So I am not sure how exactly HQL refutes my point about a graph traversal if the above query explicitely describes a navigation from a Person object to a Department object. The 'join' word is quite redundant really, the 'p.department' string is sufficient to describe what's going on:

Other issues with HQL are the usual ones:

(a) lack of closure (in other words, there is no "network algebra" similar to the RA); (b) it's impossible to use "derived networks of objects" (see relational derived tables aka subqueries); (c) it's impossible to use usual optimization approaches because of (a)

When we looked at Hibernate a while ago, we discoverd the following:

o H. sometimes generates unnecessary select statements;

o H. exhibits a significant slowdown when dealing with more than 10 columns;

o H. has various insert problems (batching);

How do you know this, "in practice", since by your comments you've made it clear that you have no direct experience with ORM in practice

If you'd bothered to read carefully what I wrote before, you'd discovered that I briefly described just that kind of experience.

My suggestion to you then is, don't use those languages, in which case the issue of O/R mapping will not arise for you.

I do not see much point in using a mapping layer that converts relations to a network of objects, that's true, but fortunately, one does not need such drastic measures as giving up on object oriented languages altogether, not using a 'relations <-> network' mapping layer would be quite sufficient.

Semantic gap filler

I agree with some of your objections, related to the point that an object mapping of a relational database doesn't provide a completely general solution to any possible data access pattern to that database (although in practice, a complete mapping of a relational schema does tend to satisfy most of the requirements of many kinds of application). However, expecting a fully general query capability misses the point of O/R mapping.

The point is to provide a capability to a particular application or set of applications, which have a particular perspective on the data and particular requirements for accessing it. You mention that it "imposes an application dependent structure upon the original relational data", as though that were a bad thing; but in the context of the application which uses that structure, that is precisely the benefit it offers.

Part of the point is also to make sure that the data resides in a relational database, so that it can support other kinds of access patterns. If you find that O/R mapping is not suitable for a particular kind of application, or even a particular part of an application, then don't use it in that context!

But in the contexts where O/R mapping does make sense, the XML mapping file you've objected to is a huge saver of time and code — if you take that away, you're left dealing with the infamous semantic gap with hand-written code, much of which is essentially boilerplate. To get rid of the boilerplate, you need some kind of layer. Even tools like HaskellDB and SchemeQL don't completely deal with this issue — there remains a real need to integrate relational data at the level of values in a language (see my comment entitled "N/R Mapping"). You haven't made it clear whether you have a proposed alternative solution to these kinds of issues.

Regarding your critique of Hibernate, if you were serious about using it, a report of those problems in the appropriate forum would likely have resulted in fixes where it made sense. But if those are the worst problems you could find, it sounds like you're really reaching for reasons not to use O/R mapping. "Unnecessary select statements" sounds like every critique of every high-level computing technique from the first high-level languages ("unnecessary machine instructions"), to garbage collection, to databases themselves ("unnecessary disk reads"). As for performance, your reported result doesn't match my experience, or that of my clients. I'd be interested to see your benchmark if you have it.

In any case, such objections to a specific package are besides the point. You originally claimed to be "genuinely curious" about the advantages of mapping relational data into an object model. Various people have made honest attempts to explain those advantages, and put them in context. If you were indeed genuinely curious, I think you would agree that some light has been shed on those advantages. However, your posts give the strong impression of someone mainly interested in finding someone to argue with about a pet peeve.

Let me feed this back to you

Let me feed this back to you, to check that I've understood the idea. The reasons to use an RDB-backed data storage is two three fold:

One, you get persistence and transactional semantics.

Two, you get alternative data access patterns that the designers of the object layer did not forsee.

Three, you get the database to enforce some (all?) data consistency checks.

Is that it, or have I missed something vital?

Where the data is

Another pretty vital reason is that many companies essentially have to use RDBs, for all the usual reasons, most commonly that they're using them already. Some people rob banks because that's where the money is; OO applications use RDB-backed storage because that's where the data is.

That relates to your point two, which is much stronger than you make it sound. It's not just the question of not forseeing data access patterns, it's also things like needing to share an enterprise database of which any given application may only use a small part.

I do think that a point which seems to go whooshing over the heads of those who don't get it, is that this is all happening in a context over which individual projects or companies often don't have much control. RDBs and OO languages are both dominant in their respective areas. If you have to use both in the same project, O/R mapping is just a tool to make the impedance mismatch more manageable. In my experience, it's pretty effective at that. That doesn't mean it solves all problems, or that you can pretend you're using an OODB all the time, or that you can develop every aspect of your systems using programmers who don't know SQL. It just means that the benefits outweigh the disadvantages in many cases, and that it competes very effectively with the realistically available alternatives.

RM raison d'etre

At the risk of alienating some nice folks even further, I'll venture to comment on this:

Another pretty vital reason is that many companies essentially have to use RDBs, for all the usual reasons, most commonly that they're using them already. Some people rob banks because that's where the money is;

Unless I am mistaken, the implication is that there is a better alternative to the RM and the companies use RDBs just due to legacy reasons.

The actual reason is that there is simply no better alternative as yet. Surely, the network model is not as was discovered some 30 years ago ( http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf ) for the well-known reasons like application dependent access path, lack of query algebra and the associated problems, etc.

LINQ?

What do you think of LINQ?

Mapping

But in the contexts where O/R mapping does make sense, the XML mapping file you've objected to is a huge saver of time and code

OK, in order to execute this:

select mate 
from eg.Cat as cat 
    inner join cat.mate as mate

one needs to:

o define a class per each paticipating table (I'll skip the example to save the space)
o define an association:

<?xml version="1.0"?>
<!DOCTYPE hibernate-mapping PUBLIC
        "-//Hibernate/Hibernate Mapping DTD 2.0//EN"
        "http://hibernate.sourceforge.net/hibernate-mapping-2.0.dtd">

<hibernate-mapping package="eg">

        <class name="Cat" table="CATS" discriminator-value="C">
                <id name="id" column="uid" type="long">
                        <generator class="hilo"/>
                </id>
                <discriminator column="subclass" type="character"/>
                <property name="birthdate" type="date"/>
                <property name="color" not-null="true"/>
                <property name="sex" not-null="true" update="false"/>
                <property name="weight"/>
                <many-to-one name="mate" column="mate_id"/>
                <set name="kittens">
                        <key column="mother_id"/>
                        <one-to-many class="Cat"/>
                </set>
                <subclass name="DomesticCat" discriminator-value="D">
                        <property name="name" type="string"/>
                </subclass>
        </class>

        <class name="Dog">
                <!-- mapping for Dog could go here -->
        </class>

</hibernate-mapping>


o run query and obtain a collection of objects:

Collection c = x.executeQuery("select mate 
from eg.Cat as cat 
    inner join cat.mate as mate
");
/* I do not remember the exact syntax */

Whilst with  SQL,  one just runs a query and obtains a collection of objects:

ResultSet rs = connection.execute("select mate from cat join mate on mate.id=cat.mate_id");

I do not see much economy here.

Other disadvantages I've briefly mentioned before:
o serious performance degradation in comparison to running straight SQL;
o Rigid navigational structure defined in the XML association.

What if I want to obtain three cats with matching colors ? I'd need to add addional links in the network. In SQL, I'd just use a join with the appropriate predicate. So essentially, in order to make new queries possible one needs to change the network "schema".

"Unnecessary select statements" sounds like every critique of every high-level computing technique from the first high-level languages ("unnecessary machine instructions"), to garbage collection, to databases themselves ("unnecessary disk reads").

I do not think high-level language overhead argument is fair. "Unnecessary" should be interpreted as compared to running straight SQL (which is a high level language) in order to obtain identical results.

As for performance, your reported result doesn't match my experience, or that of my clients. I'd be interested to see your benchmark if you have it.

The tests were run about 8 month ago and not preserved. However, from my previous message (10 roundtrips vs. one), at least one of the major reasons for performance degradation should be pretty obvious. Addmittedly, there are tricks like caching (with its associated problems), prefetching the entire collection of children (using "left join fetch") and such, but the point is one just does not need to go through these contortios in order to obtain essentially the same result with much better performance. Besides, we had a class of hierarcical queries which were literally unusably slow when implemented in Java code and executed through the ORM layer.

You originally claimed to be "genuinely curious" about the advantages of mapping relational data into an object model. Various people have made honest attempts to explain those advantages, and put them in context. If you were indeed genuinely curious, I think you would agree that some light has been shed on those advantages. However, your posts give the strong impression of someone mainly interested in finding someone to argue with about a pet peeve.

I am genuinely interested in alternatives to what we do now, namely, using stored procedures or XML files in order to avoid sprinkling Java code with SQL statements. I think the stored procedures approach is arguably the best we can have today. However, I am quite open to suggestions, but other than a relations <-> object network mapping for the reasons I tried to explain in my messages. I probably should have stopped discussing the ORM suggestions as soon as I realized that's what's offered is a network model, but I thought it would have been reasonable to offer some argument of why I think such mapping is unsatisfactory, otherwise, it would have looked, as you put it, "a pet peeve".

To sum up, our reasons for not using a network ORM were in this specific order:
o Network schema rigidity
o Unacceptable performance
o the rest (caching issues, concurrency problems, insert batching)

Code is data?

And sometimes, data and behavior aren't that easy to distinguish from each other

Sorry, but I'll rant a bit...

Both code and data are bit patterns in the computer. Both code and data influence the computer's behaviour. True, but not very interesting.

The difference is in the statistical properties. A 100 megabyte file of comma-separated values (e.g. stock ticker data) can be read by a very simple program; a moderate-sized C source file requires a compiler, an OS and lots of libraries to make any sense.

I prefer to have a sharp separation between code and data. A program should be only about behaviour. Data should be off to a side, in as simple a format as possible, NOT intermingled with code.

(For a simple example, the text of error messages in Windows is split away from code, because of internationalization issues.)

Yes, this is a very non-Lispy and non-FP point of view. For me, this has gotten to the point where I view every use of a collection class as suspect; and this has helped me develop programs a lot. OO techniques make much more sense if you say to yourself, "data doesn't belong here". If you say to yourself, "the runtime structure of the program should be as static as possible, and shouldn't reflect the data the program is operating on".

You don't want numbers to be objects with identity. This is bad for you. (Is this bignum equal to this other bignum?)

You don't want strings to be opaque objects with methods. This is bad for you. (Does match(regexp,string) belong in Regexp or in String?)

You don't want EventSource.addListener to add a listener to an internal collection, and EventSource.removeListener to remove it. This is bad for you. (A friend of mine recently had to chase some complex memory leaks due to interactions between GUI state, caching, weakrefs and GC in this scenario. All those problems could be solved by a simple SQL DELETE FROM, if the data were explicit.)

Yes, data and behaviour are hard to distinguish from each other. But distinguish we must. The relational model is (so far) the most effective way: if something fits in the relational model well, it's data, and doesn't belong in code.

"As is" vs. "Should be"

Sorry for late response, I was taken away from civilization for several days.

In OP I actually meant modelling ORM as it is practiced in current Java framework(s), as opposed to modelling it as it should be done. Thus the interest towards mutable cells.

The most interesting issue to me is how to map between two stateful models, one having explicit pointers and the other (equality) constraints on keys, and whether matching this "impedance mismatch" can be factored separately from state issues.

Oz model?

Probably the only productive way to describe ORM is via embedding both O and R sides into some common formalism. I wonder, whether I should go as far as using general categories (just kidding), or more specific formalisms like constraint languages (along the lines of The Oz Programming Model, or some of research by Vijay Saraswat). I guess I am facing the usual trade-off between generality and depth...

DB/PL: What's the problem?

I am sorry to be getting here late, but nobody let me know that this discussion was going on here. There are so many points of view expressed here that it would take hours to respoind to or sort them all out. And even then we probably wouldn't be able to agree because we are all "feeling different parts of the elephant". I'll try to come back and send in some more detailed comments, but for now let me just point out a new paper that I did recently called "Programming Languages & Databases: What’s the Problem?". It covers some of the issues raised here. I would very much appreciate comments.

-- William Cook

Nice paper

I like it.

I do have one comment to offer, though... You pretty consistently use Java-like iterative code snippets as example queries... I can't copy/paste from the PDF for some reason, but there are numerous examples.

I find this interesting, first of all because that's by no means the most common way that database queries are written. Most database queries are (I think) written in a declarative relation-oriented language like SQL. Second, in an exploration of database/language integration, it seems a bit early to be implicitly committing to a particular programming model.

I realize, of course, that this is just syntax and that the implied semantics of your for/if/etc. are really more like comprehensions than iteration. So it's really not a complaint. I just thought it was interesting, since it seems to imply that you believe a priori that the "most natural" way to program with relations is to iterate over their elements.

Or, why not use OO...

botch, see reply in thread above.

Damnit! Why does the "reply to this comment" link seem to come and go?

For those of you who think

For those of you who think the relational model is the thing - How about these issues ?

In RDBMS - the database VENDOR defines the types. In OO the programmer can define new types. This is a big part of what OO is all about.

Joins - The point is not that joins are "fast". The point is that joins are O(nsquared). Try doing deeply nested/interconnected subcomponents with an RDBMS.

Basically what people do with RDMBS is simplify - Entities are defined in terms of canned datatypes - Int,Float,String, Date. Complex relationships are not represented, since the joins would be unwieldy.

Complex relationships are

Complex relationships are not represented, since the joins would be unwieldy.

Mainly due to lack of academic interest, which is now changing. The relational model does not support encapsulation, as you say, but formally expressing relationships on the public members of an interface is now possible: Active Patterns in F#.

That would be me

For those of you who think the relational model is the thing ...

That would be me.

In RDBMS - the database VENDOR defines the types. In OO the programmer can define new types.

In SQL, the SQL vendor defines primitive types such as int and char, and the developer defines application-specific types built up from this via create table. In OOPLs, the language designer defines primitive types such as int and char, and the developer defines application-specific types via class. Seems closely comparable to me.

The point is that joins are O(nsquared).

Where did you hear that?

The term "join" describes a logical operation, not a specific implementation technique; it has no associated Big-O value. It would make as much sense to say "functions are O(n^2)." Sure, some specific instance of a function implemented with some specific algorithm might be, but that doesn't tell us anything about the general case.

Join's Big-O performance depends on what query plan is chosen, which in turn depends on a lot of things such as available indexes, etc. In practice, I have found n^2 joins to be vanishingly rare.

Try doing deeply nested/interconnected subcomponents with an RDBMS.

I have done-- a lot actually. I continue to be amazed at how well they work.

Very very very OT...

... but the horizontal rule you used is obnoxiously wide. (Many apologies)

For those of you who think

For those of you who think the relational model is the thing ...

That would be me.

In RDBMS - the database VENDOR defines the types. In OO the programmer can define new types.

"In SQL, the SQL vendor defines primitive types such as int and char, and the developer defines application-specific types built up from this via create table. In OOPLs, the language designer defines primitive types such as int and char, and the developer defines application-specific types via class. Seems closely comparable to me."

Well, in Java, for instance, you have the whole JDK class library with vendor supplied types which are extensible, including their behavior. Building Entities out of the primitive types is quite a bit more restricted, more like structs than classes.

The point is that joins are O(nsquared).

"Where did you hear that?"

Not the best way to put it. Actually what I've found is that it's pretty easy to run into joins that require a "table scan".

"The term "join" describes a logical operation, not a specific implementation technique; it has no associated Big-O value. It would make as much sense to say "functions are O(n^2)." Sure, some specific instance of a function implemented with some specific algorithm might be, but that doesn't tell us anything about the general case."

"Join's Big-O performance depends on what query plan is chosen,which in turn depends on a lot of things such as available indexes, etc. In practice, I have found n^2 joins to be vanishingly rare."

Try doing deeply nested/interconnected subcomponents with an RDBMS.

"I have done-- a lot actually. I continue to be amazed at how well they work."

More than, say 5 tables ? It's been a while but I found that you really couldn't trust the optimizers past 4 or 5 table joins.

I've lost track of what your critique is of

Well, in Java, for instance, you have the whole JDK class library with vendor supplied types which are extensible, including their behavior. Building Entities out of the primitive types is quite a bit more restricted, more like structs than classes.

Okay. So your point is about the binding of functions to data structures, rather than about defining new types per se?

Actually what I've found is that it's pretty easy to run into joins that require a "table scan".

If the expression you're evaluating needs to inspect all the members of a collection, then the execution plan will need to inspect all the members of a collection. Sometimes table scans are what's required.

What would yo say to someone who critiqued Java by saying that sometimes they had to iterate over an entire ArrayList?

More than, say 5 tables?

Sure. Nine, ten, eleven, whatever. What is your concern? Have you ever had a case where methods of more than say five classes were on the stack at once? Would you accept it if I suggested that was a problem with OOPLs?

...I found that you really couldn't trust the optimizers past 4 or 5 table joins.

Again, I am unclear what you are critiquing here. Is it the relational model? The SQL standard? A specific DBMS you have used? "The optimizer" you are referring to above is only a feature of that last one.

Objections

Regrettably, your comments reflect the usual lack of knowledge prevalent in the industry about what the relational data model is and how vendors implement it(not so terribly as one might to think).
Well, in Java, for instance, you have the whole JDK class library with vendor supplied types which are extensible, including their behavior. Building Entities out of the primitive types is quite a bit more restricted, more like structs than classes.
There is both theoretical work ("The Third Manifesto"), as well as vendor implemetation of user-defined types in the relational database. E.g. UFS have been supported by Oracle and DB2 for at least a decade.
More than, say 5 tables ? It's been a while but I found that you really couldn't trust the optimizers past 4 or 5 table joins.
You keep saying that, but it's actually not true at all. Oracle/DB2/Microsoft SQL Server can easily perform joins with dozens of tables -- it is nothing special. The 4-5 table join myth persists from the time when Sybase had the limitation, but was like 15 years ago !

Please restrict the

Please restrict the discussion here to aspect relevant programming language issues. Please keep the tone civil if possible.

"The relational model does

"The relational model does not support encapsulation"

Actually I would say the relational model is exactly the opposite of encapsulation. - The user is free to use the data in ways unplanned by the designer.

Encapsulation is better than nothing

Encapsulation is better than nothing. Certainly having private data members that are protected by setters and getters that can validate changes is better than unprotected structs. However this manual facility is definitely not as good as having a declarative way to enforce integrity, such as provided by the relational model.

....Okay. So your point is

....Okay. So your point is about the binding of functions to data structures, rather than about defining new types per se?

Well, that's how you define a new type, right ? You need data and operations. In the SQL Date objects, there are "fields" underneath the representation, right ? But you can't subclass and have access to those, or define something similar yourself. Tables don't seem to be types, to me.

...What would yo say to someone who critiqued Java by saying that sometimes they had to iterate over an entire ArrayList?

Well, if they had to iterate over it (which in RM, would be a collection of all objects of that class) to find the contained object, that would be a very unusual, and might cause the program to not perform adequately.

...Again, I am unclear what you are critiquing here.

It seems to me that some advocates of the relational model (which I actually think is very good) don't acknowledge that there are limitations imposed by that model. For instance if you always have a set of all instances of a particular type and you have to retrieve contained objects from that set, that can impose a performance penalty, which is important (and if this isn't theoretically a limitation, it seems to be in practice).

... Is it the relational model? The SQL standard? A specific DBMS you have used?

I feel the model has to stand by it's implementations, so I am using a real world example, in this case Oracle. And what I found was that you can't just start doing more that 4 or 5 joins without a good chance of having unexpected performance problems.

What are you looking for?

I feel the model has to stand by it's implementations...

Then I'm not sure what you're expecting anyone to say. Proponents of the relational model are often the first to point out that most/all "implementations" of it, that is RDBMSs, are poor precisely because they don't actually implement the model. Rather they implement a subset, and then add their own features. I guess you could say the model has no implementations to stand on...?

I don't think any of this is on topic for LtU though, is it? You could make an analogy between the relation model and RDBMSs to a language and compilers, but I think that's stretching things a bit.

Pls. clarify what the

Pls. clarify what the following means:

For instance if you always have a set of all instances of a particular type and you have to retrieve contained objects from that set, that can impose a performance penalty, which is important (and if this isn't theoretically a limitation, it seems to be in practice).

And and example of this would be nice:

And what I found was that you can't just start doing more that 4 or 5 joins without a good chance of having unexpected performance problems.

It is a very odd statement perhaps reflecting lack of real-life experience with the database. Oracle optimizer is not perfect, but it is pretty good even with poorely written SQL, and "4 or 5 joins" is no problem for it as I said before.

....Encapsulation is better

....Encapsulation is better than nothing. Certainly having private data members that are protected by setters and getters that can validate changes is better than unprotected structs. However this manual facility is definitely not as good as having a declarative way to enforce integrity, such as provided by the relational model.

I think encapsulation is a trade off, between the class designer and the class user. It seems to mostly be to the advantage of the designer, as it prevents him from being blamed if the client breaks his class, and also allows him to evolve his class without concern for uses he didn't anticipate. Unfortunately, for every library class designer, there are millions of users, so I would say that the user's interest in re-using the class in unforeseen ways outweighs the designer's interest. This seems to be the idea in the relational world, with emphasis on user understandable keys, and being able to create, retrieve, update, delete any data, subject to integrity constraints.

....It is a very odd

....It is a very odd statement perhaps reflecting lack of real-life experience with the database. Oracle optimizer is not perfect, but it is pretty good even with poorely written SQL, and "4 or 5 joins" is no problem for it as I said before.

An odd statement ? It's my experience, I'll stop being lazy and check it myself, but I will be surprised if the optimizers have improved as much as you say.

....There is both theoretical work ("The Third Manifesto"), as well as vendor implemetation of user-defined types in the relational database. E.g. UFS have been supported by Oracle and DB2 for at least a decade

Yes, I see they are there now in Oracle. Thanks for the pointer.

...Pls. clarify what the following means:

Well, this is the RM, right ? If I want a Person object, I retrieve it out of the Persons Table which is the set of all Persons. So if I have a contained Person object, I will have to find it in that set, as opposed to more direct ways of retrieving it.

Relational model

Well, this is the RM, right ? If I want a Person object, I retrieve it out of the Persons Table which is the set of all Persons. So if I have a contained Person object, I will have to find it in that set, as opposed to more direct ways of retrieving it.
The relational model deals with tuples, not objects. To get a specific tuple, or a subset of tuples, you'd use the relational algebra 'select' operator. I do not understand what you mean by "more direct ways of retrieving it".

"More direct way"

refers to holding a pointer which refers directly to the object (whether by machine address or some other means), rather than by constructing a query.

Of course, a fully-constructed (but not executed) query which retrieves the tuple in question is functionally equivalent to a pointer in many ways. Performance may be different, especially if the data lives in an RDBMS running on different a process or machine than the application, but the effect is the same. (And the performance penalty for an inter-process query is paid regardless of the database technology).

"More direct" refers to the fact that pointer deferencing often requires only a few machine instructions to "get to" an in-memory instance of some data, whereas a query will take longer to execute.

At any rate, this thread appears to be degenerating into Yet Another Object/Relational Flamewar. We only need tablizer to now make an appearance...

[Admin]

As far as I can see this discussion has gone totally off-topic for LtU. Please take it offline.

A critique of the SQL database language

In RM raison d'etre, vc wrote:

Unless I am mistaken, the implication is that there is a better alternative to the RM and the companies use RDBs just due to legacy reasons.

The actual reason is that there is simply no better alternative as yet. Surely, the network model is not as was discovered some 30 years ago (Communications of ACM, Vol. 13, No. 6, 1970) for the well-known reasons like application dependent access path, lack of query algebra and the associated problems, etc.

There is an interesting interview with Michael Stonebraker in the latest issue of ACM Queue. The interviewer is Margo Seltzer of Sleepycat Software:

SELTZER
One of the big arguments, if I recall correctly, was that you could prove things about the relational model. You could make strong mathematical statements. Is that important in building systems or in designing and developing this kind of database software?
STONEBRAKER
If you look at what Ted Codd originally did with the relational model, and you compare it with SQL, you can prove almost nothing about SQL. In fact, there's a terrific paper by Chris Date (A Critique of the SQL Database Language, ACM SIGMOD Record, 1984), that basically spent page after page, in area after area, explaining why SQL has terrible semantics. I think we've drifted far away from Ted Codd's original clean ideas.

Better data types

I first learned to play with databases on Access -- simple data models, almost no "application" layer. Then I got to work on a real project, using MySQL (and PHP), which has left such a bad impression that I've resolved to never ever use them again. I then used Postgres, and loved the clean engineering, with the ability to build your own datatypes, operations on them, and even algebraic equations or custom indexes to optimize the performance. However, you had to define the datatypes in C, bastardised with Postgres macros -- not horrific to most, but did I mention that I basically now program exclusively in Haskell? Experiences with Haskell has taught me the value of a good datatype -- most of my programs begin life as a type definition. Making good types should be easy, not hard, and programmers should be encouraged to do so. The relational model, IMO, puts too much emphasis on its tuples, and not enough on what goes inside them. I'd love to see ML-style data definitions being allowed.