Best default sequence?

Most languages have a default concrete data type for collections (arrays in C and Perl, lists in Lisp and Haskell, etc.) that get used more often then they probably should. By default, I mean languages generally have better syntax for a particular data type that gets used most often (or better pattern matching, or better names for constructors (think ":" and "[]" or "cadr") or historical reasons or predominance in tutorial material, etc.). But for new languages, what criteria should we use when choosing an appropriate default? For a new functional language, I'm thinking about using catenable deques (with O(1) append, cons, and snoc) or democratic sequences (or some other balanced tree with O(lg n) complexity for almost all operations). Is there something out there that might make for an even better default? Or are the current batch of defaults already close enough to optimal?

(And please feel free to opine about different collection defaults in imperative and logic languages as well).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Best default sequence

As you say:

... a default concrete data type for collections ... that get used more often then they probably should.

Ideally providing a library with a variety of data structures with different algorithmic properties ought to do it. But people do tend to choose the default.



If the choice were between, say, skew binary random-access trees, which supports both list- and array-like operations versus catenable deques, I'd choose skew binary random-access lists. I see random access to elements as more important than fast concatenation.



An interesting variation is described in Hinze's

Bootstrapping One-sided Flexible Arrays
. In this implementation one can, at array creation time, trade lookup-time for update-time.

After having looked in to the

After having looked in to the question I have come to the conclusion that catenable deques actually could have a good random-acces time, as it have a sort of list structure. They are unbalanced and hetrogen but I think you could make use of it to get better random acces (and posibly insertion) time. I think you in practice could get something like logaritmic time.

2c++

Bjarne and the C++ community in general usually advocate that std::vector be the default container in C++ because it supports amortized O(1) append, O(1) random access, and satisfies the List concept, as well as being the most space efficient data structure available. The primary defect is the O(N) insert.

I myself have always been a fan of tree structures, and I'm particularly a fan of the rank-ordered tree, which gives O(log N) for all standard operations (insert, remove, search), including indexing. From what I can tell, it's essentially the same as your Democratic Sequence. Unfortunately, it's not entirely appropriate as a general data structure, because it imposes an ordering requirement on the contained type in order to get log N search. However, searching could probably be weakened to linear time in order to remove the ordering constraint (which sounds like what the Democratic Sequence does). Of course, the main defect of rank-ordered trees is that they are some of the fattest imperative data structures around with high overhead per item. If the contained items themselves are large, then this isn't so much of an issue, but storing small objects in such a container can easily cost more in overhead than the contained data itself.

It is interesting to note that VB's default container, Collection, is a conventional hash table. Java seems to take the philosophy that hashing is very important, by mandating that all object types have a hashCode() method. By default I usually use Vector in Java, and HashMap for associative data.

In C++, I don't really consider any of the containers to be a "default", because the iterator interface makes them all more or less equally easy to use (including builtin arrays). So I just spend 2 seconds thinking about which operations are the most important, and choose appropriately. It would be nice if you could do the same in most FP langs, but I don't think the Iterator concept is something that jives well with the FP way of doing things.

how can the std::vector have

how can the std::vector have O(1) append but O(N) insert? couldn't you just create a new std::vector with one element and append append that to get O(1) insert?

Re: append vs. insert

Maybe I'm missing something here, but append != insert at all, right? O(N) insert makes sense because it has to traverse the list to find where to insert (assuming it is being done based on sort order, rather than by a given index, of course). Append would only be tacking onto the final position.

Heh

First, suggesting that insert() should "just create a new std::vector" is quite amusing to this C++ programmer. But the problem is that C++ makes strong guarantees about the performance of its containers which constrains the possible implementations. In particular, C++ guarantees that the elements in a std::vector are contiguous in memory (meaning that taking the address of one element will return an address that is "next to" the address of the next element in the vector). This means that to create a new vector at all implies copying the existing vector, which we all know is O(N). Also, what might not be clear is that when I say "append() is O(1)", I mean "appending one element", and not "appending another vector". In fact, concatenating vectors is O(N), because of the copying (which is implied by contiguity). C++ spends a lot of time copying things, which is a bit ironic, because that's what FPs do as well. The difference is that FPs copy things because they are immutable, whereas C++ copies them because they are values rather than references. It may seem ironic that a language obsessed with performance would do something so potentially expensive as copying everything, but the fact is that value types can be allocated on the stack, making them extremely fast to create and destroy, and this more than makes up for most additional copying. Also, C++ goes out of its way to elide copies, even to the extent of introducing possible inconsistencies to allow certain elisions.

Also, it should be noted that append() is not O(1) on every call. It is only O(1) when amortized over very many calls. When the backing store needs to be resized, a call to append() can cost O(N) (because the elements must be copied to a new location), but because the time between resize events grows exponentially, the amortized cost to resize is constant.

Re: memory layout

Good point. The original question was about a new functional language, so I'd assume that memory mapping is not an issue that would be exposed to the software developer. That's a big difference between language philosophies which we should keep in mind / ask up front: what are the requirements or preferences?

Oh! I thought insert was "app

Oh! I thought insert was "appending one element". In that case what does insert mean?

inserting means inserting a n

inserting means inserting a new element to a position other than the end.

Do you mean doing an destruct

Do you mean doing an destructive uppdate on an element or inserting an element between two adjent ones?

The second

The semantics of insert() are that v' = v[0..i-1] e v[i..n-1], where i is the insertion point, n is the size of the vector, and e is the element to insert. Sorry for not making that clear.

Maps

I would vote for a) maps, b) maps, and, uhm c) maps. It's a functional language, no? ;-)

Program to the interface

Personally I prefer the Python (and Java 5, and I think Common Lisp) approach of making all built-ins work with any sequence. For...in, indexing, slicing, and list comprehensions all work over any sequence, though I suppose Python preferences lists by giving them easier creation syntax and making them the output of list comprehensions.

If you must choose a default sequence, I'd go with STL deques for functional languages. Deques give you O(1) cons & snoc, car & rac, cdr & rdc (as long as you let the result share structure with the original - otherwise you always need to copy, which takes O(N)), and nthcdr, with linear time insert and append. And dequeus are generally very memory efficient - you don't waste more than a constant, vs. up to 50% with vectors and a straight 50% with lists.

For imperative languages, the common operations are indexing, iteration, and insertion at the end, so I'd just use a vector. Simple, efficient, and effective.

Unfortunately, STL deques are

Unfortunately, STL deques are not persistent and actually don't allow sharing. All operations are destructive updates by necessity. Deques are actually implemented as a sequence of arrays, which makes the O(1) complexities a bit of a lie, since the control block itself cannot do everything in constant time.

If a functional default sequence is sought after, I vote for finger trees: amortized O(1) cons, snoc, cdr and rdc, O(log n) append and split, can be adapted to be indexed or ordered sequences, and of course are purely functional.

But actually the question is misleading: Give Haskell views or (almost as nice) pattern guards, and no default sequence is needed anymore.

Re: no steekin' default sequence

Pretty please elaborate - my understanding of views leads me to not understand/agree: making a view from a list to a hash won't magically change the performance characteristics, right? View-of-list-as-hash will still have crappy lookup because I thought at best a view can only give you the same performance as the original construct, unless there is some up-front work that gets done to totally convert the things from a list to a hash, whereupon you'd have to figure out amortized O().

It's just about the interface

In an ideal world, we would program to the interface of an abstract sequence and use whatever operation is appropriate. Then we could use the same code at stacks, queues, deques, maps or whatever. Heck, we could even persuade a profiler to find the optimal sequence implementation for the use case at hand.

This ideal world depends on one thing only: a uniform interface. Unfortunately, an important part of the interface to the built in list is pattern matching at the front. No other sequence can do that, the equivalent code using null/head/tail is simply too ugly to be used much.

Therefore, we need views. Using a view, I could pattern match at the front of any sequence. Pattern guards allow this, too, at the cost of some more syntactic noise. On the other hand, pattern guards are available today, views are not.

Some thoughts:

Perhaps you could do some form of type inference to find the most effective, of a range of implementations, for that specific use pattern. For example if you create a sequence and only do random acces on it you could very easy interfer that it would be most effective as an array. That way you don't have to chose one standard implementation.

Another thing might be hybrid structures. diffrent parts of a sequence might have diffrent implementations and of course you might combine these aproches.

my thoughts exactly

I've also been wondering how far we could get with using some sort of inference system to let the compiler decide on an implementation for a generic interface (I was thinking in terms of Java).

As far as I have read, actual implementations of Lisp/Scheme don't use basic linked lists (one value, one pointer) for the implementation of their lists anyway, they use more compact structures such as several values and one pointer per node. In other words, a combination of arrays and linked lists.

Any way, I'm also curious to know what the gurus think of basic aggregate data structures. I see a pattern:
A 'collection' that contains the same type but arbitrary length (lists in haskell, ML) and collections with same type but fixed length and random-access, such as arrays. Another 'collection' of arbitrary data types but fixed length (tuples in many functional languages, structs in C), but no collection of arbitrary data types with arbitrary length (except LISP...but that's because of different emphasis on types)???

CDR coding

As far as I have read, actual implementations of Lisp/Scheme don't use basic linked lists (one value, one pointer) for the implementation of their lists anyway, they use more compact structures such as several values and one pointer per node. In other words, a combination of arrays and linked lists.

Right, it's usually done using a clever little trick called CDR coding.

Efficient Lisp implementations traditionally use type tagging whereby some of the least significant bits in a word representing a value are reserved for tagging purposes.

These tags are used for a variety of purposes (such as storing fixed-precision integers and byte characters directly in a word, eliminating further indirection) and CDR coding is one of the traditional uses: whenever accessing the CDR of a pair, you first check whether a special bit pattern is set in the least significant bits of the CAR. If it is, you know that the CDR value immediately follows the CAR in memory. Otherwise the CDR pointer follows the CAR in memory. (This latter case corresponds to the usual unoptimized pair representation.)

This complicates the code for the CAR and CDR mutators a bit but it's a worthwhile optimization. The reason this trick works so well is that most Lisp implementations use a copying collector whereby temporally sequential allocations end up located sequentially in memory. In other words, if you do (LIST 1 2 3) then the cells containing 1, 2 and 3 will immediately follow one another. By simply setting the special bit pattern in the cells for 1 and 2 and adding a nil cell after 3, you have a compactly represented list.

If you used a depth-first copying collector (most implementations use Cheney's breadth-first algorithm) then it just occured to me that you could dynamically compact lists using the above trick during collection. The idea is that a depth-first copying collection of a list places all the cells on the "spine" (i.e. the elements of the list) sequentially in memory, making the list a candidate for CDR coding.

Cdr-coding? Actual implementations?

Cdr-coding was prevalent in the early generations of lisp machines and back when memory was in short supply. These days, most implementations use a two-pointer cons cell. Even in the later generation lisp machines, the cdr-code check was taking more time than you got back in memory locality gains (what you're actually trading off in a cached architecture). Also, in most current implementations running on pipelined machines, pipe stalls for switching on the cdr-code are just as bad as memory load stalls due to the list taking twice as much space.

Today, as instruction caches become bigger (and so with proper compilation can hold various forward jump targets or with speculative execution of all cdr-code paths at once) you might -- might -- get some benefits to cdr-coding again. But it would be something that you'd have to benchmark to make most lisper's believe.

In any case, AFAIK, no current lisp or scheme implementation uses cdr-coding.

Not quite cdr-coding...

but I believe Stalin converts lists to vectors when the size is statically known. This is just what I've picked up from c.l.s. I don't think anyone apart from Jeff Siskind really knows what Stalin does...

Lua does this

The "default collection" in Lua is the dictionary, but the VM recognizes "array-like" uses of dictionaries (all integer keys, not terribly sparse, etc.) and implements the dictionary efficiently as an array in many cases. Obviously if the usage pattern changes, it can transparently "promote" the array to a full dictionary.

I believe you could take this much further.

(I'm surprised Rici hasn't mentioned Lua yet... ;)

Re: transparently "promote"

MS has a nice page on how to pick your collection. Note the hybrid dictionary which will transmogrify for you as it sees fit.

Rici mentions Lua

Actually, the Lua implementation is more interesting than that (and than the hybriddictionary mentioned above) because it represents collections as the composition of an array and a hash table, where either part can be empty and the rebalancing is done on demand: essentially, before expanding either part.

It's a nice compromise -- but it is, at the end of the day a compromise. Like democratic sequences, it attempts to offer "not awful" performance for a large variety of use cases, with the understanding that there is probably a much better implementation for any given use case.

Interesting question

I don't have an answer but that's a very interesting question!

All the tricks with accumulators and reverse and so on do seem like a fairly ridiculous way to write programs just for lack of a fast append or snoc.

(Deja-vu, did someone else say the same thing recently?)

Tail pointer...

But it is easy to maintain a tail pointer if you repeatedly need to add to the end of a list. Just because tconc happen to not be in the standard doesn't mean that it is more than a few lines of code to write. :-)

You should provide all possible solutions.

Arrays have O(1) access and O(N) insert time, while lists have O(1) insert and O(N) access time; and then there are trees/maps (sorted or hashed) and sets (again sorted or hashed), both balanced or unbalanced, with in between insert/access times. Therefore it would be better if you provide all of the above, or your language has a mechanism for providing all of the above. Even arrays need not be language elements, since they can easily be constructed by recursive functions over homogeneous tuples.

Views

Views (in Haskell or Scala) allow the default sequence to be a bit more abstract.

Different issues

It seems to me that there are several separate issues here, which shouldn't be confused.

The first concerns the datastructures you provide programmers. Naturally, different applications require different collection datastructures (lists, trees, hashes). A good language should be able to provide these via libraries. Nowadays, you'd find these in the standard library, of course.

This issues isn't really about language design.

The second issue is about using a specific collection interface (or a couple) as an integeral part of the vocabulary of the language, thus essentially making it one of language constructs.

For example, most functional languages use lists in this way; Awk is based on associative arrays (hashes) etc. Thus, in functional languages, you find list comprehensions, list pattern matching, quoting and unquoting etc.

Not only do we know how the choice of lists influences other constructs, we have many years of research behind us about how to make thiese mechanisms effective and efficient (e.g., deforestation).

The second issues is the more interesting one, from a PL design perspective of course.

A third issue, which seems to be the main focus of this thread is the actual data structure used to implement the list interface (i.e., ADT) that is built into the language. This should be hidden from the programmer, in most cases.

Re: integral part of the language

Are we allowed to say "a good language, or a good macro system, would let you make any particular collection library behave like an intergral part of the language" and turn it into a macro / metaprogramming issue? :-)

Personally, I don't think tha

Personally, I don't think that a "good language" implies one specific set of choices regarding fundamental issues. There are many sweet spots.

I think a good compact language might not meet your criteria, and might still be "good" for its intended purposes.

Table Oriented Programing

Some folks think that we should use DB-esque tables under it all, arguing that we get more flexible systems.

Relations

I'd vote for a relation as a first class data structure. Something along the lines of:

CREATE TABLE my_table (...)

my_table = Table.new (...)

SELECT * FROM a_table, another_table WHERE ...

result = selectFrom: a_table And: another_table Where: [...]

I've wished for such a thing numerous times, but it's really hard syntactically.

I'd love having relations as a data structure.

Just don't make me write in SQL to access them, especially if I'm talking to a local data structure and not to an RDBMS.

Why not?

It looks to me like LINQ is a pretty good direction. How would you improve on it?

Relational Programming

Syntactically, I think LINQ does a decent job. What I think is the major stumbling block is the fact that the relational model is at odds with the principles of data hiding and encapsulation that is central to most modern languages. However, I think there is something important to be learned. At some level, one can think of a relation as a structured type. But whereas the OOP model only gives us one operation on types (subclassing), the relational model gives us the full range of set functions (intersection, union, projection, etc.).

So if we are writing OOP classes for Customer and Account, we don't usually think of creating a composite class CustomerAccount which is the INNER JOIN of the two types, even though this is a perfectly natural thing to do in a database query. Part of the reason we don't think of doing such a thing is because the resulting type (relation) is not usually named in the query. Which leads to the interesting conclusion that to better support Relational<->OOP "relations", what we really need are *anonymous types* which are created by more powerful *type operators*. Just as functional programming was born of the notion of making functions first-class values, thereby allowing easily-expressed inline anonymous functions, "Relational Programming", to coin a term, can only be facilitated by allowing types themselves to become first-class values, and thus allowing type composition and anonymous types. Of course, the subtleties are numerous. Just because A accepts some messages and B accepts others does not by necessity mean that A op B will accept the union of the two message sets (or even the intersection). Deciding exactly what it means to compose two types under a relational operator is one of the biggest challenges to creating such a framework. But I think that this is exactly what is missing from LINQ, and why LINQ is only a stepping stone to the full integration of the relational and OO models.

I believe this is the future of language design, however, as higher-order programming will increase in demand in response to the demand for more flexible and powerful libraries. Higher-order functions are nice, but higher-order types are even better. Higher-order statically checked types are the future.

You might like to take a look

You might like to take a look at HList and some of the work done on building an object system in Haskell based around it - it takes some type class hackery, but the sort of thing you describe's already possible. More generally, higher-order types're well understood in systems without type inference and pretty common in languages with (the term "type constructor" isn't a coincidence).

Even so...

I have a feeling that Haskell isn't to the point where VB and C# programmers will say: "Gee, this is a great database interface language!" The key, I think, is to come up with abstractions that are so easy to use that you don't have to have written a Ph.D dissertation on lambda calculus to understand what you wrote. Part of the problem with functional languages is that there is a high barrier to entry compared to imperative languages in terms of background knowledge required to use the language as it's designed. Yes, you can express programs more powerfully, but sometimes programmers just want something as simple as Nullable<t>.

Wow...after reading this paper on HList, I have to say that it's pretty clear that FP has not yet caught up to OOP. I don't know any serious database programmer who would choose to use HList over even VB. I can see nothing elegant or powerful about that approach. Frankly, it reminded me very much of template metaprogramming and the nightmare that can be. I can appreciate the value of strong static typing, but I have to wonder if all the noise is worth it. If this is the best we have for static metaprogramming support, I'd say there's a lot of room for improvement.

LINQ

I feel that LINQ is designed for "enterprise applications": the DB schema is fixed for you, updates are done through stored procedures, etc.

I was thinking about a more "agile/scripting" scenario. Creating tables on the fly, inserting and deleting data - does LINQ do that? Basically, I want to create a relation as a lexically-scoped local variable, join it with another relation that was passed in as a parameter, and return the result.

LINQ == flexible

From what I can tell, LINQ is surprisingly flexible. Consider this example, for instance. It illustrates how you can use any collection that satisfies a certain interface as a source relation. I have to admit, it looks like Microsoft is really taking the lead on something interesting (meaning not just a Microsoft-proprietary technology) for once.

Monads anyone?

Looks like we'll be dragging programmers over to monads sooner than we all thought...

import Control.Monad

main = print $ do a <- [0, 2, 4, 5, 6, 8, 9]
                  b <- [1, 3, 5, 7, 8]
                  if a<b then return (a,b) else mzero


LINQ

You can do the same kind of thing as easily in any language that supports list comprehensions, or even in Lisps with the AMB operator. LINQ-queries-as-list comprehensions isn't very exciting, IMO, and downright nasty for a lot of tasks for which list comprehensions would be appropriate in other languages.

One of the interesting things about LINQ that I don't think has gotten a lot of air time here is the supporting infrastructure in C# 3.0, such as anonymous types, type inference and structurally inspectable lambda expressions ("code-as-data").

The latter in particular is at least semi-novel for a non-Lisp language. (A code-walking macro in Dylan could probably do the same thing.) When you see something like "where a

An XML query processor would likely just invoke this method as a predicate, treating it as a black box. But a SQL query processor can actually pull apart the syntax tree of the method body, and inspect it to construct the WHERE-clause of a SQL query that can be passed to an RDBMS. (This is only possible for predicates that map nicely onto SQL WHERE-clauses nicely, of course.)

Note that this code-walking feature only seems to apply to "lambda expressions", which are new to C# 3.0; earlier versions had anonymous functions, which seem to differ only in syntax and the way in which they participate in type inference. I'm not sure what the deal was with that design decision but it's probably a backward compatibility kludge. C# already has its share of those; "yield return" anyone? Ugh.

Queries != data structures

Of course you can use any collection as a source relation. But this is a bit off topic - the topic being data structures, not queries. There's a subtle difference :-)

In other words, I want a relation as a data structure. For example, I don't see how to do the following with LINQ:

INSERT INTO a_table SELECT a_field, another_field FROM another_table WHERE ...

Please don't make me manually append customers to a customer list. I want a logical relation, not a physical list. (Does a list raise a "duplicate primary key exception", rolling back the whole INSERT?)

Not sure

But looking here, in particular at the DLinq doc, it seems that it should be able to do what you want.

Relational and/or Object-Oriented (or There and Back Again)

Interestingly enough, the Next Big Thing™ (according to my Database Management professor) is object-oriented data modeling using UML for designing relational database structures. I did a lab assignment using Rational Rose to generate SQL statements and it was quite useful. It provided sensible abstractions for (what I'm told are) common relational idioms.

Now, (and I paraphrase) the textbook writers just need to write about it and the tool developers just need to improve the tools.

UML and ORM

Now, (and I paraphrase) the textbook writers just need to write about it and the tool developers just need to improve the tools.

Huh? All the "cool stuff" described in that whitepaper is exactly the same old object-relational mapping that people have been doing for years and years. UML seems like a pretty big red herring; it's useful as a graphical notation for sketching class layouts and associations but that's entirely orthogonal to the mapping issue they discuss. (Note that I am referring to the small UML subset that most people seem to know, not the full notation in all its baroqueness.)

Hmm...

All the "cool stuff" described in that whitepaper is exactly the same old object-relational mapping that people have been doing for years and years.

Is it? This is the first database class I've taken, so I can't say I'm even proficient in the subject. After reading a little bit about it, it sounds like there are definitely connections between the two.

UML seems like a pretty big red herring; it's useful as a graphical notation for sketching class layouts and associations but that's entirely orthogonal to the mapping issue they discuss.

The important aspect of data modeling, however, is using the UML class diagram to directly generate SQL. According to my professor (since I haven't done too much in this myself), everybody has been using the Entity/Relation (E/R) model to design their databases. The UML-to-Oracle-SQL mapping was only introduced into Rational Rose in the last couple of years. And, after reading the text and doing a lab myself, I can see how there are differences between E/R and UML, often subtle or tricky.

Data modelling

The important aspect of data modeling, however, is using the UML class diagram to directly generate SQL.

According to my professor (since I haven't done too much in this myself), everybody has been using the Entity/Relation (E/R) model to design their databases.

If anything, it's my experience that many programmers often start with an object model and map it onto a relational model using a few standard rules of thumb: classes become tables, inheritance is typically handled in a couple of different ways depending on various trade-offs, associations are mapped ono relationships, etc. In other words, I don't think most programmers use entity-relation modelling outside of a subset that maps semi-nicely to the network model (I summon thee, vc! Cthulhu fhtagn nyarlathotep! :)).

If your relational model has been produced in this way, ORMs can present a reasonably convenient object-oriented interface to that model. In any case, the ORM typically handles most of the low-level SQL generation. Some, like Rails's ActiveRecord, do most of the SQL generation but require the programmer to write a bunch of SQL manually to handle certain cases efficiently. Others, like Hibernate, have their own higher-level query language and don't ever require the programmer to write any SQL (though it might be convenient to do so).

I honestly don't see what problem the UML-to-SQL tool in Rational Rose is supposed to solve. Like all the other problems Rational Rose claims to solve, this one seems to be a non-problem. :)

OT

Is this discussion relevant to the current thread?

OT

No, you're right. Sorry about that.