Database Abstraction Layers and Programming Languages

From time to time I like to return to the issue of database integration, only to once again remark that the difficulty in creating good database APIs (as opposed to simply embedding SQL) is the result of the poor programming facilities provided by most programming languages (e.g., no macros for syntax extension, no continuation or first class functions to handle control flow etc.).

Why return to this topic today? Jeremy Zawodny aruges on his blog that Database Abstraction Layers Must Die!

Along the way he says,

Adding another layer increases complexity, degrades performance, and generally doesn't really improve things.

So why do folks do it? Because PHP is also a programming language and they feel the need to "dumb it down" or insulate themselves (or others) from the "complexity" of PHP.

Ouch!

Why do we need an abstraction layer anyway?

The author uses an argument I hear all the time: If you use a good abstraction layer, it'll be easy to move from $this_database to $other_database down the road.

That's bullshit. It's never easy.

Double ouch, but true enough. Databases are like women (can't live with them, can't live without), and getting rid of one can be as painful as divorce...

So what's the solution? Surprise, surprise: use a libary. But isn't that an abstraction layer? Of course it is.

What Jeremy advocates is plain old software engineering and design. Everyone should do it. I can't beleive anyone does anything else.

But wait. I just told you it's hard to build such a library, since programming languages makes the design of such libraries hard (e.g., should you use iterators, cursors or return record buffers? should your library database access routine be as flexible as a select statement?) So we design libaries that aren't very good, but hopefully are good enough.

And that's the question I put before you. We all know about coupling and cohesion. We all know about building software abstractions. Are our tools for building abstractions powerful enough for this basic and standard abstraction: the database access abstraction layer?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

By the way...

1. SQL is itself, of course, a database access abstraction layer of the best kind: a language...

2. SchemeQL. Enough said.

Spurious Generality

Part of the fallacy here is that "database access" is a straight-forward, homogeneous set of functionality.

DBMSs are all different and individual schemas and configurations are all different, and the commonality among them is pretty thin.

So trying to build a "one-size fits all" data access layer seems to me to be a fools errand. (Not to say I've never been a fool myself ;-) )

On the other hand, abstracting data access for one's individual application is essential, IMHO. That is to say that at the higher levels of my app, I should only know that there are "information getters": I give them the info I have and they return the info I need.

That this happens to come from a database is an implementation detail that I shouldn't be thinking about at that level.

This level of abstraction is already available in any language with modules, classes and other implementation hiding mechanisms.

Thanks

Thanks for the compliment...

I think experience with complex applications like the ones I have in mind might ha have led you to a better understanding of the generality required from the database access layer of the kind I am thinking about. It's not one size fits all, but it's also very far from simple. I am talking about one application, but an application that provides flexible services and easy extendability.

And by the way, you can build such a library even in Assembler. But that doesn't mean that's the best language choice, so discussing language features is relevant as usual.

Uncomplimentary?

Thanks for the compliment...

I must be missing something here...

database access layer

Perhaps we need to be more explicit about what we mean by this expression. It is commonly overloaded to mean both a general framework for database access, and an application-specific library for handling the app's data needs.

Perhaps you even mean something else.

In my experience, the frameworks (e.g. Hibernate) simply pass the complexities off to a "configuration" level, and often hobble what you can do with your DB to a common denominator of functionality.

The best thing we could do to help with this problem would be to define a truly standard data language that all DBs used, though history hasn't been kind to this idea.

I think I know what it is

I must be missing something here...

I am very thin skinned when I think I am being accused of "Spurious Generality". It just happens so often it gets on my nerves...

Apologies

I apologize if my post seemed hostile. It was not intended that way, nor was I directing my comments at you individually.

I suppose "Spurious Generality" IS a bit grumpy sounding . I guess it is just my (not-so) inner curmudgeon shining through. ;-)

Refuting Zawodny

In my experience, the frameworks (e.g. Hibernate) simply pass the complexities off to a "configuration" level, and often hobble what you can do with your DB to a common denominator of functionality.

I was going to raise Hibernate as a counter-example to Zawodny's claims.

Although it requires a configuration level to define the mapping between language objects and the database schema, the point is that the mapping is defined in one place, and then doesn't need to be dealt with all over the application.

This provides the opposite of what Zawodny is claiming: it reduces complexity and clutter at the application level, it's roughly neutral on application performance, and a net win for developer performance. It generally does improve things. In addition, contrary to Zawodny's claim, it does provide independence from the underlying database, and the degree of independence can be controlled by the developers - it can be a tradeoff, but it's a controllable one.

However, it appears Zawodny is talking about PHP, and no-one is solving the sorts of problems using PHP that are being solved with Java and Hibernate. I can well imagine that the limitations of PHP could make developing a credible database abstraction layer problematic.

The best thing we could do to help with this problem would be to define a truly standard data language that all DBs used, though history hasn't been kind to this idea.

Tools like Hibernate ultimately can't completely eliminate the semantic gap between the programming language and the data query language. Having a better standard data query language won't really help here.

From the perspective of a developer in a given language, a much better solution is to have the data query language be more tightly integrated into the programming language. SchemeQL provides an example of this, but it's still hobbled by having to deal with SQL on the back end, and as a result it isn't really able to rise above SQL in terms of the abstraction power that can be achieved in database queries. So you're still stuck with the basic model of doing a query within the pragmatic limits of the query language, then further processing the results in the programming language, with a sharp line between what can be done easily in each environment.

The problem with the integrated language solution is that it would have to be different for each language, and it would, as Ehud points out, require significant improvements in the the programming languages themselves. Which means short of an amazing stroke of genius and luck, we won't see any movement in this area for a long time to come.

Pragmatics

Imagine if you will a system with 500 tables each with 20 cols. How many access rotuines are you willing to have in your API? Design something that makes sense.

And oh, let's make it type safe if we can...

I don't know about the scalability you're asking for but...

HaskellDB is type safe, flexible, and reasonably efficient. It's not really industrial strength yet (you have to recompile any time you change the database, and error messages aren't terribly clear), but it does do away with the "impedence" mismatch.

Pragmatics

I thought about this for a while; my first answer was code generation, and my second was an embeddable query language. That's quite a fork in the road.

One of the problems with code generation (at least the naive solution I had in mind) is that would result in a "flat" API, whereas with that many tables it's likely that there are all sorts of module boundaries that could be introduced; probably multiple layers of modularity, in fact. The way it breaks down is unlikely to be obvious, and there may not be an unequivocally "right" way to do it, so there's (hard) design work to be done.

Even if you can use code generation to magic up a huge array of type-safe data access methods, you still have the problem of how to express the larger-scale design. So code generation moves the problem into the domain of the PL (you've "hidden" the connections, cursors, queries and so on), but doesn't make any of the complexity go away, or even make it significantly more tractable. What's more, there are issues of policy (what kind of cursor or locking to use, transaction boundaries, etc.) that may not be able to be decided on globally and once-and-for-all, so the "client" of the API may still have to deal with the low-level specifics of data access. It starts to look like generating all those methods hasn't really bought you all that much.

Code generation--

I, too, wouldn't choose code generation for this problem.

I don't know...

Something in your message sounded insulting to me. I wrote something insulting back, but edited it on reflection. Bygones, I hope.

You are right that without talking about specifics this discussion leads nowhere. I am talking about application specific data access layer, but with the mind to extend it to something more general. I am not talking about any existing framework, or about available frameworks in general.

How would you write this data access library so that changes in the db and/or application logic will be easiest to handle?

An example structure

How would you write this data access library so that changes in the db and/or application logic will be easiest to handle?

That's the tricky part, and partly why I'm not so hot on Hibernate and friends; things are always changing. ;-)

The pattern I'm using these days with good success looks like this (in Java):

There is a family of classes that encapsulate statements in the database language. These know how to set any variables in the statement, including any necessary conversions. At this level you "speak the databases language", i.e. programmer simply uses their understanding of their specific DBMS and schema to get the result you want. A DBA can help here if need be.

Another family of classes knows how to convert the results of queries into application domain objects. These are often surprisingly simple and quick to write.

The actual use of these statements and converters is hidden from the rest of the app by "loaders" and "savers". The rest of the app doesn't have to know anything about the database (even whether it exists or not). This helps to reduce the effects of change both ways across the barrier.

Yes, there are a lot of these classes with a large database, but each class is quite simple, and our team can use existing Java knowledge and database knowledge to quickly change them as needed. We routinely have major schema and app changes and can generally synch them up quite quickly.

There is no need to encode the metadata from the database in some 3rd configuration language or format. I don't have to make any assumptions about how tables and columns map to app data structures or types; I just write simple converters as needed.

Business Software Coops

Somewhat related: Shareware Grows Up. There is a similar government coop too.

Relational Databases for Normal People

Article by David Coursey.

RDBMS lost to marketing?

Success is a lousy teacher (to paraphrase Bill Gates), and
therefore we should not conclude that the database
system, as the unit of engineering, deploying, and
operating packaged database technology, is in good
shape.

Rethinking Database System Architecture:
Towards a Self-tuning RISC-style Database System
paper claims the RDBMSes (I mean marketed implementations) themselves are becoming a choking morass for developers.

Abstraction layers are, at best, an LCD approach

My biggest problem with nearly all of the data access libraries/layers is that they don't make much use of what the database does well, which usually includes:

* Replication

* Constraints

* Stored procedures

Admittedly, these make data access even less portable between databases, but there are good reasons that databases were invented, and can handle huge applications.

The other thing that ends up bugging me far more than it should is that databases were originally intended to serve as an external data store that could be shared between a number of related applications. Most data access libraries are used, if not intended, solely as a way to serialize and store objects for a single application. There's very little thought given to making the data useful outside the application, even though from my experience, data and metadata (broadly construed) frequently outlive the application that generated them.

Right tool for the right job

I think that there is a general misunderstanding in the software industry of the purpose of a DBMS, or at least a RDBMS. As you say there were specific features that they have by definition (though finding a good one right now eludes me. Anyone?) One such is application independence, which among other things enables reporting applications to do their job.

Many applications have simpler requirements that do not call for a DBMS, and in such cases a DBMS's extra features become a source of complications. For example, if an application simply needs to persist objects then a database will always be a poor choice no matter how good a framework there is. I think it's a lack of recognition about this that leads to the ridiculous claims that Prevayler made (makes?), for example. It may or may not be an excellent persistance engine, but comparisons to databases are apples and oranges. That it exists at all I see as proof of a common misuse of databases.

To come back to Ehud's original question, I think even the common commercially used languages have sufficient power of abstraction to deal with databases or other lesser (in terms of features) persistance engines. I'm not sold on code generation, though. I also think there is a lot more confusion and misunderstanding about data and data storage right now than there is about programming languages and language design (the cynic in me points to the far greater role of industry in the database field then in PL's.)

Architecture

The "separate application and database" architecture is the root of all these problems, right? The application itself is excluded from participating in the database transactions because of latency, and that's why we need a separate batch-mode query language to tell the database how to do everything itself. That's a lot of added complexity.

On the other hand, if the database is a part of your application then your queries are just subroutines and your persistent/transactional records are just abstract data structures. That's lovely. Mnesia is like this, and I have the impression that others like Berkeley DB and Gemstone are too.

So why d'yall use these SQL servers, anyway?

Safety

The application itself is excluded from participating in the database transactions because of latency

Safety is a key reason for the separation. Moving the db to its own process isolates it from all the wacky things your app may be doing with threads and memory.

Safety

True, and definitely an important consideration. But this safety can also be provided at the language-level, as with the Mnesia database in Erlang, to keep the benefits of integration.

'Cause we have to...

So why d'yall use these SQL servers, anyway?

- Because there is already significant investement in them in many organizations
- Because someone wants to use standard reporting tools against them.
- Because the organization wants to store data that is used across applications and organizational divisions.
- Because in many organizations applications come and go but data must live forever.
- Because your boss/client/grand-high-poobah said so. ;-)

Open-source Databases

Indeed I was wondering whether open-source code like Firebird, PostgreSQL, and MySQL might not permit that sort of thing, heretofore restricted by closed-source distribution.

Enterprise Architecture

I think I may have mentioned this book before, but it is worth mentioning again, especially since it talks about these issues at length:

Fowler, Patterns of Enterprise Architecture Applications

Some may be put off that he is an "OO guy", but the concepts he discusses can be easily generalized to other styles.

I think he does a good job of laying out the options and discussing their trade-offs. I don't agree with him 100%, but I think anyone interested in this problem domain would get something out of reading it.

yuck

Double ouch, but true enough. Databases are like women (can't live with them, can't live without), and getting rid of one can be as painful as divorce...

I have no use for sexist crap like this in a programming weblog.

To Map or Not to Map

First of all, having spent about eight years developing server-side Java with a variety of data access approaches ranging from hand-rolled to J2EE BMP and CMP to totally dynamic/metadata-driven approaches to Hibernate, I can say with considerable confidence that if you aren't using Hibernate you're working too hard and I'd be willing to bet money that your solution won't scale as well as Hibernate does. Yes, I include hand-written JDBC code in that claim.

The claim that tools like Hibernate enforce a lowest-common-denominator approach to data access reveal the ignorance of the writer: even if you've never used Hibernate, a quick look at the documentation reveals that it supports a wide array of association types, a two-level caching regime, lazy or eager associations, a query language, stored procedure access, and much more. There are a plug-ins for popular IDEs to facilitate the development and maintenance of the configuration files, which are relatively straightforward, but nevertheless very powerful.

It's no accident that Hibernate concepts inform the design of the emerging EJB 3.0 specification.

But all of that's beside what seems like the more interesting point, which is: why isn't there better support for data manipulation at the language level? Along those lines, I recommend that interested parties see <http://equi4.com/vlerq.html>. This effort combines Jean-Claude Wippler's decades of work on the MetaKit database engine and a variety of vectorized processing engines to craft a unified whole. It should appeal both to people interested in tiny, embeddable languages and those interested in what could happen if the relational calculus were taken seriously.

Just felt like mentioning...

you may even point out the Cr

you may even point out the Criteria library. Criteria allows you to write things like:

idx1 = SQLTable.new("orders")
q1 = (idx1.price > idx1.paid) &
(idx1.duedate

q1.limit = 5
q1.order = :ASC
q1.order_by = idx1.name, idx1.age

q1.select()

and spit out SQL92 like:
SELECT * FROM orders WHERE ((orders.price > orders.paid) AND (orders.duedate

note that ruby does not have macros, and that Criteria does not allow you do everything (but I may be wrong about this)

What does everyone do about very complex reports?

I hope this is on topic - I'm assuming this issue is likely to come up for a lot of people - perhaps even the original poster.

The application I'm working on has reports that examine and summarize large numbers of records and complex business logic is involved. I would love to move the records out from the database, convert them to objects, and put the business logic in that layer, but I highly doubt the performance would be sufficient.

Choice two would be to generate the query using objects, have them inject any needed business logic into the query, and send that to the server. However, the tools I've seen for this are a lot more limited than writing stored procedures directly on the server (which is poison, architecturally).

I would like to find a tool that lets you build up an SQL query using relational calculus, and can then generate the query text. Something along the lines of Query.AddTable("foo") Query.JoinNew("bar","foo",JoinType.Inner)... I believe I could layer the reporting on top of such a tool. I am working in C#. Am I going to have to write this tool myself? Does anybody know of such a tool? It would have to be extensible or else very very complete, since I will probably end up needing pretty much the full power of the underlying DB (SQL Server) stored procedures.

For me database independence is less of a concern than it could be, especially since I'm actually trying to improve an application that has most of the business logic in SQL Server stored procedures, so there is no independence at all now.

I'd also welcome commentary on whether this seems like a good design.