Viability of a static type system (like ML) for a relational language?

I'm building a relational language. I like the idea of a static type system but after toying for a while with the interpreter I have wonder how feasible is it.

Here, I think a "type" is the whole head of the relation (even scalars as "1" are relations) and most thing fell fine, but the join operator cause trouble: It could merge heads (or not) depending in the kind of join. So, I'm saying that I feel the relational mode generate on the fly new types and can't be specified before.

So, If i have a query like:


city = Rel[id:Int name:Str population:Int]

city where .id=1
city select [.id]
city select [toString(.id)]
city union city

I see the type of each relation change. Is impractical to type by hand each combination, but I like the idea of type most of it. But can the compiler be made to work with this similar to ML?

How work the sql languages, them are dynamic?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Row polymorphism

I suspect what you want can be expressed with row polymorphism, at least up to dependent types.

If I understand your language:

city where .id=1 means "all the cities x for which x.id=1"

city select [.id] means "all the records x of type [id:Int
] for which there exists a city y such that y.id=x"

Here the type is implicit in your language but I made it explicit in the description.

So now your problem is join. If I understand correctly there are two joins: an exterior product type and an interior product type (on reading Wikipedia article there are even more .. :)

The exterior (cross) join is just the cartesian product. The interior join has the same type, however pairs of records for which a nominated field, occurring in both records, are not equal, are thrown out.

And now, there is a related type, in which one of the duplicated fields is thrown out, since both values are now equal. So the type:

exterior product : [a:A; b:B] x [a:A; c:C] -->
[left: [a:A; b:B]; right: [a:A; b:B]]

interior product: [a:A; b:B] . [a:A; c:C] -->
[a:A; b:B; a:A; c:C]

unified interior: [a:A; b:B] o [a:A; c:C] -->
[a:A; b:B; c:C]

If you use row polymorphism in the style of Daan Leijen

http://research.microsoft.com/pubs/65409/scopedlabels.pdf

(which my language Felix does), then you get an ordered join type like the interior product type above, with duplicated fields, however a reference to a duplicated field returns the first field only, so the apparent type is that of the unified interior (a left join).

You may want to download Felix and try it out, since the static typing is already implemented. In fact it may be possible, since Felix has a user defined grammar, to define your language in Felix with a Domain Specific Sub Language. Felix also happens to include SQLite 3 built in, but the current access is with C level calls. If this works .. well forget the interpreter: you get compiled binaries.

Hlist

See the HList paper which shows the start of how to embed typed relational algebra in Haskell. Generally HM types, type classes with multiple parameters and functional dependencies are sufficient (and type families may be used instead of functional dependencies).

There is a more complete implementation using some prototype code that deals with the complete relational algebra I can dig out if anyone is interested.

HList paper focussed on SQL

I have had a quick look at the paper you specified and I see it as somewhat problematic as it is dealing with SQL (specifically SQL92) which is not relational, as in, no version of SQL is relational. It fails the various restriction or facilities required by a relational algebra.

For your extended version that you mention above, do you handle transitive closure as part of your design as well treating each tuple as an unordered set of attribute values within the unordered group of tuples defining the relation?

I have a problem with the design as it is using lists at the user level as this tends to set up that the tuples and the relations are ordered and not unordered.

If we have two tuples {a:5, b:"a"} and {b:"a", a:5}, the first things is they are equal. They are the same tuple since its is irrelevant in what order the attributes are displayed. Secondly, the relation that arises from these two tuples is a relation that has only one tuple in it. If we create two relations based on using the first in one and the second in the other, then both relations are equal. Does your HList implementation ensure that this is true?

Lastly, does the implementation automatically assume that that all attributes contribute to the "key" if no smaller set of attributes is specified?

Archiology

I have done some digging through my old code, and have started to update it to work with a recent release of GHC, as some things have changed. I am going to put it on GitHub when it is tidied up.

To answer the question it uses the Type-Indexed-Rows code from the HList paper (at least one version does, there is an older one as well). It does some kind of row-polymorphism in that the order of columns does not matter. There is no explicit equality between type-indexed-rows used, but implicit equality (matching the query against the schema) should be row-polymorphic too.

Here's a sample query from a well known example:

moveContaminatedAnimal :: DAnimalType -> DCntdType -> DFarmId -> Query ()
moveContaminatedAnimal animalType contaminationType newLocation = do
    a1 <- table animalTable
    a2 <- restrict a1 (\r -> r!AnimalType `SQL.eq` animalType)
    c1 <- table contaminatedTable
    c2 <- restrict c1 (\r -> r!CntdType `SQL.eq` contaminationType)
    j1 <- join SqlInnerJoin a2 c2 (\r -> r!AnimalId `SQL.eq` r!CntdAnimal)
    j2 <- project j1 (CntdAnimal .*. HNil)
    doUpdate animalTable
        (\_ -> AnimalLocation .=. toSqlType newLocation .*. HNil)
        (\r -> r!AnimalId `SQL.elem` relation j2)

All the types are inferred from the code, but for example join from the TIR version has this type:

join' :: (TIR.Union t u v, Monad m, MonadState SelectState m)
    => SqlRelationOp -> SqlReference t -> SqlReference u -> (v -> SqlTyped Bool) -> m (SqlReference v)

Edit: The HList paper calls them Type-Indexed-Products (TIP) I think.

HList paper covers a lot more than SQL

Hi Bruce, suggest you take a much deeper look at HList; particularly Type-Indexed Products and type-labelled lists -- which are type-level attribute name-value pairs.

The HList paper covers a lot of ground, and Keean is quite right to volunteer it. (Thank you again Keean, Oleg and Ralph for your pioneering work over a decade ago.)

The part of the HList paper covering SQL is more of a proof of capability that the ideas can adapt to SQL. I'd say "despite" SQL being only somewhat relational.

Re Bruce's question of TTM-style tuples (or relations) which are order-independent: yes HList techniques can easily check for equality independent of order. (But note this isn't native Haskell equality-testing.) Alternatively, HList could re-sequence the attributes within one list according to the attributes in another. Then you could use Haskell native equality-testing.

To implement a statically type-safe relational model, I've used type-labelled HLists to represent tuples; then the re-sequencing trick above to get the tuples to the same native Haskell type; then Haskell's Set library to hold the set-of-tuples aka relation.

HList keys to a relation

To answer Bruce's q about keys: look at the Coddfish project. (Which is sadly now defunct.) This is built on top of HList.

Since attributes are types (as type-level indexes to HLists), then "keys" are part of the type of a relation. (They're "phantom types" in Haskell or ML parlance.) Now the operators on relations can do type-level inference for the keys of the result relation. [As TTM asks for.]

Coddfish actually held Functional Dependencies at type level, which meant that FD inference was more tractable.

Building a Relational Language?

First a simple question and then a comment.

The question for your consideration is: What relational algebra are you basing your language on? If it is SQL, then go read TTM (the Third Manifesto) for a possible starting area and then get involved with the TTM group at TTM@thethirdmanifesto.com.

It may well give you an insight into the complexity of the problem space. There are a number of people there who can give you some partial advice on your development work.

However, there are a variety of different opinionated points of view there and you will need to determine for yourself what you want to have applicable to your situation. Though I still have the messages downloaded to my mail system, I don't get involved with any of the discussions at this time.

Yep I have read about TTM,

Yep I have read about TTM, Trans-relational and a lot of related stuff. I don't wanna do another sql, but proper relational language (relational bags, not sets).

I *wish* to create the experience of the dbase/fox family, only this time not tied to a specific storage engine.

I too am building a relational language, called Andl

My relational language is called Andl. See here. Welcome to the club.

Yes, static typing is perfectly possible. The join operator does indeed create a new relational type, as does any projection. Andl does full type inference on those types.

Good answers to your questions about types may be found in the Third Manifesto. See here.

You might want to join the mailing list.

For what I recall of the

For what I recall of the TTM, I don't remember about types more than say a relation is a head * set of sets. Each "column" have a type, but not remember something more than that to be usefull for implement the type compiler of a language.

But the answer here make a lot of sense. A compiler with static types have truly 2: One for the values and other for the types, and need to compute the types alike the values for this case. Is kind of obvious, because the type checking is another pass ;)

What trow me a little is that I don't like type inference, not because is not amazing, but because the experience for the user is unfriendly (IMHO):

The types are solved, but stay hidden visually to the user (until it try to use the IDE for tell him what is that).

So I prefer to annotate the types in a explicit way. Or be fully dynamic (Pascal and Python are my favorites!).

But the requirement of build types on the fly are against a pre-typed code and move towards a inference or dynamic. :(

Explicit types in relations

Here's a user-friendly way to make types explicit, and type inference tractable:


city = Rel[ (Id 5, Name "Moskva", Population 11503501) ]
-- those attribute names are newtypes
newtype Name a = Name a

(You could put `.=` between each attribute name and value, if you really want to mimic relational languages; but it would just be function application.)

The newtypes are punning the attribute name that the user sees/enters with the type name. And the newtype gives us type-level labelling/indexing into the tuple. And because it's a newtype we don't pay any run-time cost.

There's a handy library on Hackage that will convert 'flat' tuples to HLists and back again.

Joins

I think the point was that join needs to take two relation types and returns a new relation type composed of the input relation types, and that typing the full types of the relations makes this a poor programming experience. To do this without boilerplate requires type inference, which some people don't like.

I get the impression the OP is looking for a way of typing relations that does not involve inference or long types. I am afraid in this regard they are out of luck. The type of a relation is the type of all it's columns, and a join really does return a new type.

The only choice is to type the relation types in full everywhere or learn to accept type inference.

Well, maybe the answer is

Well, maybe the answer is also provide relational operator for the types, so for example:


fun InvoiceAndCustomers() = Customer UNION Invoice do
return query {customers join invoice select customers.*}

A bit less verbose, but still kind of duplicated.

More Explicit

Yes, that is more explicit than relying on type inference.

Having recently been doing some work in Python, comming from Rust immediately before I found it tedious not knowing what type a function returned without looking it up. This was made harder by the fact that the import statements gave no hint of which one was adding which class to the scope.

So for me your solution might be acceptable, you don't even need relational operators, just 'set' operators and type level sets. It would all depend on being able to quickly find the type level operator definitions.

If you look at my Haskell code linked to below you will see all top level functions have type signatures, and they use almost exactly this idea, but type level operators are type-classes with functional dependencies (you could also use associated types). So Haskell can infer the type of join, or you can annotate it as you prefer.

Type inference for relational operators

mamcx, to answer your direct questions:

Yes, SQL is dynamic, and usually interpreted. So it doesn't do static type inference. (It supports a form of compiled code called 'stored procedures', but these typically do not support the full polymorphism of ad-hoc queries.) At type level, SQL is an incoherent mess. It's an incoherent mess on many other levels too ;-).

Yes if you regard the structure of a relation and its tuples at type level, then each relational operator needs to do type inference. You need a very powerful type-level language to even get close. HList needs MPTCs and Overlapping Instances and type-level functions (in recent versions) and even then is on a knife-edge into type-incoherence. The paper is very honest about the difficulties they encountered.

It's possible to implement the join operator type-safely. I have done so using the HList techniques I mentioned above. But even for relations with a few attributes, the number of (overlapping) instances is horrendous. Template Haskell might help, but the approach is really not scalable.

OTOH. I've not seen any other approach that gives static type safety without it amounting to the same thing. The forthcoming records proposals for Haskell and (say) Nikita Volkov's records proposal (with lots of TH support) still boil down to type-level (or kind-level) indexing. The Lens approach can't do relational join AFAICT.

Typing Relational Algebra in Haskell

I implemented full relational algebra around the time the HList paper was written as a test of the techniques, and it was pretty good. I did not find it horrendous, and I don't know why you think it would not scale?

The library I developed allowed you to define the DB schema in Haskell and it controlled creation and updating of the underlying database for you.

Here's a sample schema:
https://github.com/keean/Haskell-Webservice-Framework/blob/master/Lib/Relational/FamDb.hs

Here's some relational code using that schema including some relational operators like join:
https://github.com/keean/Haskell-Webservice-Framework/blob/master/Lib/Relational/Main.hs

Here's the library definition, it's a bit messy because it is using an early version of the HList concept, before the concepts had been properly categorised into a sensible library. There is an unfinished updated version of the library in the repo as well that uses the TIP from the HList library:
https://github.com/keean/Haskell-Webservice-Framework/blob/master/Lib/Relational/HSQL.hs

Here's the incomplete version of the library using the TIP (renamed to TIR here for some reason):
https://github.com/keean/Haskell-Webservice-Framework/blob/master/Lib/DBC/HSQL.hs

If I were looking at this again now, I would probably try and take the best ideas from both versions of the library implementation.

Typed logic programming

You may find inspiration from the type systems of logic programming languages like Mercury or Lambda Prolog.

The type system in The Third Manifesto is pretty good

The Third Manifesto describes a type system with value semantics, generated types for tuples and relations, and a bias in favour of strong typing and type inference. It works well for Andl.

Sorry, that's not *natural* join

Sorry Keean, in the RM when we say join we mean *natural* join.

What your library demos is SQL's INNER JOIN with theta-join by equality testing between columns with different names from different tables.

If an AnimalId in the Animal table is the same Id as a ContaminatedAnimal then:
a) give them the same attribute name;
b) join them with natural join.

Natural join is pervasive in relational databases. (By which I do not mean SQL databases -- as Bruce quite correctly pointed out.)

Furthermore the sort of join I had in mind when saying "not scalable" is as per my parameterised polymorphic newtypes above.
c) natural join should match on the type-level label alone
d) then 'improve' the types of the payload to be the same.

(That is given one relation with an attribute type `Name a`
and another with an attribute type `Name b`:
match those wherever they appear in the tuple;
improve by `TypeCast a b`.)

Now HList mechanisms can do all that.
It's the number of instances it needs,
and the number of instance constraints.

Natural joins are unnecessary, cross join is better

It can do inner, outer, left, right and natural join depending on the parameter passed, although I don't think the version of the code I found has that implemented. In any case the code to do the join would be in the library and not visible to the user. The algorithm is simpler than you suggest, you would first intersect the types to get the 'joined' columns, and then the difference between the union and intersection would be the other non-joined columns, this could be provided as a single type level operator to the user. I think improvement is not necessary because it is a requirement that the data has the same meaning in all tables which it is used, so it must be DFarmName in all tables for a join to pass type checking. The "D" stands for Domain, so the type constructors are there to strongly enforce domains and prevent bad joins. You could always directly use a String type if you don't want domains, but I think that would be a bad idea.

In fact any join can be expressed as a cross join (full join) along with an extra restriction and projection, so there is only one form of join necessary mathematically. This is often natural join, but a relational algebra based on cross join seems a better option to me as it separates the concerns. So in terms of relational algebra (and not SQL specifically) using cross join as a basis is perfectly acceptable, if not better. Codd's original relational algebra used cross join.

Codd, cross and natural

Re the library code for natural join, is it possible to get a look? Is it on Oleg's HList repositories?

any join can be expressed as a cross join ... with ... restriction ... projection.

To express natural join you also need RENAME.

Codd's original relational algebra used cross join.

With all due respect to Codd's terrific insights ...

Hmm. His original algebra was very confused about positional access to columns (as with mathematical relations) vs. by "Domain" (per your DFarmName) vs. by column name -- what he called 'roles'. With Codd's original, it's not possible to define natural join in terms of cross. Because he underplayed column name/role and didn't include a RENAME operator. That confusion has come all the way through to SQL today.

I'm not sure what concerns you're trying to separate in preferring cross-join. The reason for data modellers preferring natural join is precisely that there's only one concern (the Id for the animal: one animal, one Id, one attribute name.)

Rename and Explicitness

Rename seems somewhat unnecessary, as duplicate columns from the cross join can be removed by projection, and you can keep either column name, but in any case rename is there already, so all the components to implement a natural join are there. Do a cross join, restrict on the intersection of columns, project either of the intersected columns plus the union of the non-intersected columns, rename away the table prefixes.

I don't think any of this is in the HList library, as it is an application of HList not part of the library itself. I will have a look and see if I can find another backup with the relevant code.

My reasons for not using natural join come from a practical database background. Very rarely do columns have the same name in both relations that you want to join on. The most common join is on the ID and normally a relations own unique ID is just plain ID, but a relation referencing it needs to distinguish the name so would call it 'FarmID'. This occurs in many places like in parental relationships my name would be 'Name' but a reference to my name in a child's relation would be 'FathersName'. To me a natural join seems like a nice hack but in reality I would rather see explicit restrictions so I know exactly which columns are being joined on. It goes back to the earlier comments about explicit types being preferable to implicit ones.

Rename is not needed if the design is done right

Your comments are perfectly understandable for many commercial databases, built with SQL very much in mind, but that's not the only way to do things.

If a database is designed with a full data dictionary, so that every column in every table has a unique name and meaning, then the rename problem just goes away. Every join is 'natural'.

Without even going that far, it's easy enough to adopt the standard that every foreign key and every joinable field is uniquely named.

Unique column names.

If every column has a unique name, then natural join will join no columns (because it joins on the columns with the same name in both relations).

Cross joins are not quite enough

A natural join that preserves attributes from both arguments is strictly equivalent to a cross join and a projection. It can produce more output rows than input, and the algorithm has to be some kind of nested loops.

All varieties of outer joins are equivalent to a natural join and a union. In a strictly relational language such as Andl with no nulls, a different approach is needed (and there are several to choose from). This one is SQL only.

A natural join that preserves only left-side or right-side attributes is a semijoin. It can never produce more output tuples than input, and the algorithm can be a single loop and index lookup. SQL doesn't really have this, but it can be faked with IN or EXISTS.

The converse of a semijoin is an antijoin, which returns the complement of the semijoin, and uses a similar algorithm. SQL doesn't really have this either.

The algorithms are extremely simple, see Andl source code.

Cross join is enough

A relational algebra requires 5 operators: restrict, project, union, intersection and cross join, and can express any relation being a complete relational algebra. All the other flavours of join are convenience functions and can be expressed with just those operators. That's the maths theory, nothing to do with any application like SQL.

Five are not enough

The relational algebra requires 6 operators, but the precise set can vary. The set I recall is SPJRUN (select, project, join, rename, union, negation) but there are other equivalents. You omitted Negation, which is essential. See http://webdam.inria.fr/Alice/.

There is no reason to prefer a particular join (Codd quotes the cartesian product, theta join, equijoin and natural join), but by convention natural join is chosen as the core if attributes are named and Rename is available.

From an implementation point of view, providing a single join is neither convenient nor does it assist in choosing a suitable algorithm. In this respect any join might claim to be 'enough', but each should be provided in accordance with its value to those who wish to use it.

Sufficiency

My point is that you can implement the other joins using the five basic primitives, so they do not need to be primitive but can be implemented as a library in the language.

Negation can be done with set difference from the null set. However you are right in that in my haste i picked the wrong pair, it should be union and difference. With the operators you quote you cannot do set intersection or difference.

I think as with functional languages and type systems 'naming' is considered a meta-operation mathematically so it is not included in the five primitive operations. I don't think you need renaming with a cross join as you can explicitly address any 'column'. Do you have a counter example for the five primitives?

Counter-example: need for RENAME

Self-joins for a relation -- such as a parent-child hierarchy -- need RENAME to avoid name clashes.

Suppose I want a result showing a node, its parent, all its children. What are the attribute names for those three? For at least one of the three we need to invent a new name.