Why Lists?

I just want to ask the rather naive question: Why are lists the most popular builtin aggregate data structure in FP langs? I understand the historical precedent for lists, and I understand that lists are very well understood. But really, there is nothing magical about lists, and computationally, they aren't particularly cheap. It seems to me that the basic interface lists provide that FP consumes is head(), rest(), append(), and splice(). There may be a few others, but my point is that there are lots of data structures that can support this interface. So why not make the interface intrinsic, rather than the data structure, and support lots of different structures intrisically (arrays and trees come to mind)?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

car, cdr, and cons

The rest is just syntactic sugar. Any data structure that implements these three operations is by definition a List (think Interfaces). As for why lists are popular: (a). they are easy to understand and implement; (b). they can be used to emulate all the other kinds of structures - trees and arrays; (c). they can be used for metaprogramming - programs can be understood as lists of symbols; (d). they can be manipulated recursively.

(d). they can be manipulated

(d). they can be manipulated recursively.

Maybe this is the most important reason. An inductive definition of a data type gives rise to a "recursion schema" for defining functions over the data type, just as there is a corresponding "structural induction schema" for proving theorems about the data type. (Curry-Howard, anyone?) Binary trees and lists are perhaps the non-trivial data types with the simplest inductive definition.

Not only lists

Everything you say can as well be applied to, say, trees. I don't see any compelling argument for lists here.

Because lists can be linked together,

whereas arrays need to be copied.

J uses arrays

J is the fp language that everyone seems to forget exists, yet it's likely second to Erlang in terms of real world usage.

J is based on arrays (though they're called "lists" in J lingo). You can index in O(1) time, etc.

True, but annoying

It actually turned me off to J, because it looked like it was designed more for numerics than general-purpose programming. It suffers from the "One True Data Structure" problem that most of the other FP langs do. Would it be so hard to intrinsically and naturally support arrays, lists, and trees all at the same time?

Now you're getting into personal beliefs.

You originally asked why fp languages all used lists, so J is a counterexample. Also see Backus's original "fp" from the 1970s, and the followup "FL." For a more recent language with native support for both arrays and lists, look to K.

I think the main reason you don't see arrays in typical lambda-based fp languages is because it's hard to make them both pure and fast, unless you have good support for whole array operations.

Misquote

You originally asked why fp languages all used lists

I said no such thing, and I defy you to quote where I did.

I think the main reason you don't see arrays in typical lambda-based fp languages is because it's hard to make them both pure and fast, unless you have good support for whole array operations.

I think this is a good answer. So how does J address these issues? Is there a fundamental set of array operations that would let an FP support arrays as first-class citizens?

Pure arrays

I think the main reason you don't see arrays in typical lambda-based fp languages is because it's hard to make them both pure and fast, unless you have good support for whole array operations.

Why is this the case? I remember someone saying something similar about hashtables. What are the reasons why it is difficult to have efficient pure arrays? Is it update in the presence of aliasing? (Would linear types help there?) If so, doesn't that apply to every data-structure?

More support for whole-array operations would be good. In general, more support for whole-container operations (whether set-operations, relational algebra, or folds, maps etc) would be good for any language. (And/or, perhaps, generalised query/comprehensions as in LINQ or monads?)

Aliasing

What are the reasons why it is difficult to have efficient pure arrays? Is it update in the presence of aliasing?

Well, yes.

Consider building up an array of length n, element by element, as you would in a typical recursive algorithm operating on sequences. Unless you have some kind of guarantee that the partially built array is unaliased (whether by linear type annotation or automatic inference by the compiler), you'll need to copy the array as you append each element. Hence constructing it takes O(n^2) time.

Fair enough

You can also use ref-counting to determine at runtime if a structure is shared or not. Tcl does this for its lists (arrays) and dictionaries (hashtables), as it ref-counts values anyway. I'm happy with that, although making sure that a value only has one reference can involve some contortions (such as K-combinator abuse).

Even without O(1) update, arrays/hashtables are useful just for O(1) indexing, although perhaps the argument for making them a fundamental data-type of a language is weaker in that case.

Reference counting

You can also use ref-counting to determine at runtime if a structure is shared or not. Tcl does this for its lists (arrays) and dictionaries (hashtables)

Reference counting would definitely work well for the kind of scenario I sketched. I don't know why I hadn't considered that.

Version tree arrays (v-arrays)

I think that ref-counting (to avoid copying when ref-count is 1) is likely to be a bad idea. A probably better idea is to use version tree arrays or v-arrays. You can find a description and an implementation of them in Paulson's ML for The Working Programmer. Paulson credits v-arrays to Holmström and Hughes, but I haven't read the original paper. The basic idea of v-arrays is to maintain an up-to-date copy of the entire array and a version tree of (single cell) update nodes. The root of the version tree is an unused update node and simply points to the array.

When you read an element from a v-array, you follow the chain of updates until you either hit the root, and access the actual array, or an update node to the cell you want to read. When you want to write to the array, there are two possibilities. The simplest possibility is to simply create a new update node that contains the new value for the cell that you updated. Alternatively, you can re-root the version tree so that the new update node becomes the root.

Reads and writes to the root version are O(1) and have just a single extra indirection compared to traditional array representations. Reads to other versions of the array take time proportional to the number of updates. Writes to other versions can be done in O(1) time by simply adding a new update node.

Version tree arrays should give good performance when arrays are mostly used in a single threaded fashion.

Shifting arrays?

Alternatively, you can re-root the version tree so that the new update node becomes the root.
To me, this sounds like a zipper. I wouldn't wonder if the next thing we hear from Oleg is shift/reset-based fully persistent arrays.

Aha, this post may be relevant. It's really just one step from environments to arrays, isn't it?

Copying isn't the issue

Because imperative languages have the same problem, and they deal with it just fine. You just have to create an array with more capacity than is needed. If you grow the array exponentially (increase the size by a factor of 1.5 to 2), then resizing only takes amortized O(n) time, including copying to new storage.

Yes it is

That solution is imperative, those arrays are not pure/persistent. They are destructivly uppdated. If you did that in a functional language it woudlnt be functional anymore. After doing that you can't acces the original array. To be able to do that you need to copy the array first.

No it isn't ;)

The point is that the additional capacity is invisible to the language. It's just an implementation trick. The only thing you're "destructing" with the update is storage that has been allocated but not used yet. I don't see how it's any different from appending to a list by adding a node at the end rather than copying the whole list.

Err

I don't see how it's any different from appending to a list by adding a node at the end rather than copying the whole list.

Assuming cons lists, how are you going to add a node at the end of a list without modifying the last node of the list, making its 'next' point to your new node? Unless the list is unaliased, referential transparency does in fact require copying the whole list unless you play games with zippers (and I don't think this is what you had in mind).

So...

If I want to create a list of N elements by cons-ing one element at a time, N lists will get created, even if I discard all the N-1 intermediate lists?? Ouch!

Appending

I think you're missing the distinction between (cons new-head old-list) vs (append old-list (list new-last)). You were talking about the latter ("adding a node at the end of a list") which, yes, does in fact involve making a copy of old-list. (Edit: People often avoid appends by playing games with accumulators and reverse.)

Hmm...

I wonder if this is something that noobs miss often...

Quick and Dirty implementation

You can define a pointer to the last used array cell, so you can preallocate n cells and append to the end, copying only when you want to append to an already used cell.

Sure, if you are only growing the array at the end

Arrays are terrible at doing random insert(). For example, keeping text as a single contiguous array is awful if you're writing a text editor, to take one real-world example. A rope is a much more natural data structure, and if you want to handle undo operations cleanly, a functional data structure also has a lot of advantages (even in an imperative language).

In the case of building an array by appending at the end, you can often achieve similar or better performance by using a list. If you're using the usual exponential allocation strategy, then you are prepared to have a peak memory consumption of more than twice the size of the array. If a pointer is no larger than an array element, then a list has less memory consumption. (Think of the moment where you are reallocating the array, say to 1.5 times its former size. At that moment you have the original array, and another one 1.5 times as big: total 2.5 times the size of the array allocated.)

One useful strategy is building with a linked list, and then (if contiguous random access is needed) allocating an array to the precise size required and copying the list into it. This is precisely O(n), and has roughly the same peak memory consumption as the exponential growth strategy.

But there are lots of real-world, code-monkey type problems where:

  1. random access is not needed; and
  2. it's handy to be able to share partial structures

In those cases, the habit of forcing everything into a contiguous array is counterproductive.

Its the lack of sharing that

Its the lack of sharing that makes persistent updating of aliased arrays slow. Then you cons a value to a list you just create a new pair of pointers to the old list and the new element. You never copy anything. The two structures shares the same tail. If you would like to do that to an array, you would have to create a new array. If you had a buffer to the array you could just write there, but that would destroy the old array, wich would be bad if it is aliased. But if you have linear typing you could garante that its not aliased and therfor you could have both pure and fast linear arrays. I think Clean has these.

Pure Arrays and update

Many of the reasons that random-access arrays aren't heavily favored in FP languages are cited; chief among them is that (non-destructive) update (a naive implementation thereof) takes O(n) time--requiring the entire array be copied.

The array problem can be broken up into two sub-problems:

* Initialization of an array. The standard imperative algorithm for this is to iterate over the arrays elements, setting each one at a time. In other words, to initialize an array of size n, n-1 destructive updates are required--if each update requires a copy, this is O(n^2).

* Mutation of an existing array (which may be aliased).

In the first case, there are many ways to deal with this. An array-being-initialized is a linear object (no aliases until the constructor finishes), so update-in-place is a permissible optimization, giving O(n) to initialize an array. A type system which could prove this would be beneficial.

A second way to deal with array initialization (and bulk array modifications) is with better linguistic support. Part of the problem is that an imperative construct (a loop over the elements) is being used. Instead, we need linguistic constructs to manipulate arrays as a whole, rather than on a per-element basis.

Better linguistic constructs for array initialization include:

* Array comprehensions--similar to list comprehensions.
* Atomic array copy
* Atomic array copy with changes--for example, array B is a copy of array A but with the 3rd element set to 5.
* Atomic list-to-array copy--given a list, produce (atomically) an array containing the same elements.
* A sparse version of the above; where a list of (index, value) pairs is used to populate the array (and other elements are given a suitable or specified default value).

For the second case--assuming that an aliased array needs to be updated, it need not require O(n). Use of a two-level structure allows purely-functional single-element update in O(n^(1/2)); use of an n-level structure allows single-element update in O(n^(1/3)); in both cases preserving O(1) random access (though as n increases; constant factors get rather large). If n is allowed to vary based upon the size of the array; you turn into a tree. There are several papers on this topic, including one by Chris Okasaki. (Functional array implementations are not covered in PFDS, however).

What is an Array?

Is an array the ability to access an element by an index? Sounds like a dictionary (tree, hash, etc...) to me. In which case, an update can be handled not too inefficiently with a node swap within the tree.

If, however, you define arrays not by the interface, but rather by the implementation - i.e. contiguous memory - then the updates do get rather expensive.

I think the definition is tha

I think the definition is that you can access element by a integer in constant time. There doesn't seem to actually be any operations that can't be implemented on any sequence making the choice of what sequnce to use redundant if you don't consider the implementation.

A List is one form of a Tree

But I'd maintain that lists are simpler than the more general trees in terms of both implementation and recursion - unless you want to narrow the type of tree down from n-ary to binary (or unary). Starting from the top of the tree, one must recurse down to each of the leaves. For a unary tree, there's just two possibilities (child leaf or null) - but then this is just a list in disguise. For a binary tree, there are two branches that must be followed on the recursion and each can either be null or a child leaf - which means that there's 4 patterns to match at each node instead of just two. And for an n-ary tree, there has to be some way to handle the collection of child nodes at each node (perhaps a List (unary tree) of child nodes at each node?).

(Whoops. Meant to be in response to "Not only lists" above)

Nested Lists are Trees

'((1 2 3) 2 3)

Structurally, yes

But then, car and cdr won't traverse that as a tree. It will traverse it as a list, which corresponds to doing something like a right-most depth-first traversal without backtracking. Which means that to traverse it as a tree, you need to assume that it's a tree and do extra work. On the other hand, the STL doesn't care if you give it iterators into a list, an array, a tree, or a number generator. It will traverse it in whatever manner the iterator has defined, and the builtin iterators for the tree-based structures traverse the entire tree. What would be nice is something like the iterator abstraction, where you can make car and cdr descend the tree without having to think of your structure as a tree (until you wanted to). And it would be really nice if such functions were part of the language, thereby making tree-structured lists first-class citizens.

Lists

But then, car and cdr won't traverse that as a tree. It will traverse it as a list, which corresponds to doing something like a right-most depth-first traversal without backtracking. Which means that to traverse it as a tree, you need to assume that it's a tree and do extra work. On the other hand, the STL doesn't care if you give it iterators into a list, an array, a tree, or a number generator.

CAR and CDR are the low-level accessors to cons cells. People may (and do) build abstractions on top of these, ya know. You can argue about the performance penalty imposed by building more complicated structures out of cons cells but what you're arguing above seems like a strawman--it's very easy to build generic iterator protocols in most functional languages. I do think that some Lisp-like languages are behind the times in these regards. If you want to have a look at a more modern Lisp, take a look at Goo.

car & cdr "low-level"??

Are you sure? I swear I've seen tons of FP code pattern-matching by slicing the first element off a list and comparing it to different cases. I'm sure you could call an Iterator-like abstraction function to give you the right traversal, but nobody does that because they just use lists. And they just use lists because that is the easiest thing to use. If it were as easy to traverse a tree, perhaps people would consider using them more often. And I don't buy the "just low-level" argument. The fundamental operations for iterators are begin(), end(), ++, *, and ->. You can build abstractions on top of those, but the whole point of the iterator interface is to be able to use that level of abstraction directly. And most people do.

Abstraction

I'm all for more powerful out-of-the-box abstractions, especially iterator protocols. The reason you see people "slicing the first element off" via pattern matching is that inductive function definition via pattern matching is well suited for working with inductively defined data types such as trees and lists. People write functions that process e.g. binary trees in exactly the same way, and I've seen a lot of functional programs that work with trees rather than just lists.

Regarding my "low-level" comment: people do often end up using CAR and CDR directly. My point was that if you are using cons cells to build, say, a trinary tree data type, a sane programmer is not going to force the users of the data type to manually call CADDR or whatever. Rather, he is going to expose an interface that operates at the appropriate level of abstraction (TREE-LEFT, TREE-MIDDLE, TREE-RIGHT, etc). In any case this applies more to Lisp. In languages like Haskell and ML, algebraic data types and pattern matching make it easy to define your own data types and write functions that operate on them without having to work with cons cells.

Literally Low-Level

In the specific case of Lisp's "car" and "cdr," "low-level" needs to be taken literally: these were "contents of the address part of the register" and "contents of the decrement part of the register" for an IBM architecture whose number escapes me just now, and really were a single instruction each.

While it's true that this made lists "fundamental" to the LISt Processing language, things certainly didn't stay that way for very long. Even very early dialects of Lisp such as MDL quickly adopted new data types (Stu Galley: "MDL was Lisp 1.5 with data types"), and some of the earliest object systems written ("Flavors") were in Lisp. By the time Common Lisp was de facto standardized in 1984, it had more data structures and algorithms built-in than you could shake a stick at, and it was the first ANSI-standardized object-oriented language.

Looking at languages from the ML family (and I include Haskell here), I think we've started to see, over the last few years, greater application of folds and unfolds ("origami programming") to structures other than lists, as the category-theoretic notions behind such folds and unfolds ("catamorphisms" and "hylomorphisms") percolate to the surface. One of the reasons that I think Oleg's Zipper (vs. Huet's) is important is precisely because it derives from a traversal interface rather than the underlying data structure, providing a shockingly universal construct for deriving an "update cursor" into any immutable data structure from its traversal interface!

So I think FP is getting where you want it to be... right about... now. :-)

All the more reason to use Lists!

Actually, there's a lot of overlapping functionality between lists, trees, arrays, dictionaries, such that if you implement one, you can get to the other without too much trouble. Though in general terms, I wonder whether the idea of Collections is the most general data structure.

Performance

This is the reason FP purists and imperative code monkeys don't get along. Yes, the only difference between lists, trees, arrays, maps, etc. is the complexity guarantees of their operations. The problem is in that little word "only". When you are solving toy problems that easily fit into a small memory footprint, that word is wholly justified. When you are dealing with thousands or millions of objects/records, that word betrays an ignorance of what is important to the people who pay the bills. Believe it or not, many people actually *do* have to think about space and time efficiency, and *cannot* afford to hope that they have properly converted their algorithm to tail-call form such that TCO *is guaranteed* to be applied. Those people care very much whether you have an actual tree or hashtable or "just" an emulation. Speaking of which, a hashtable requires fast indexed access, and is one of the more important containers in a good programmer's toolbox. FP language designers should keep that in mind when deciding how well to support arrays. Arrays might not be as elegant as lists, but they are far more important in big applications and apps where speed really matters.

Performance

The distinction I find most useful is not FP vs. imperative but battle-hardened vs. not.

Erlang is an example of a battle-hardened functional language & implementation. Prehistoric Erlang programmers have met with lots of hard problems like the ones you allude to: bad execution speed, bad I/O handling, no fast trees/hashtables, slow GC, no way to handle millions-of-element datasets, lack of documentation, etc. Thankfully for us they were working on important systems and they had to roll up their sleves and solve these problems.

Life is pretty easy for modern-day Erlang programmers because most of the problems we face have been solved before and we only have to think about the unique parts. We would have to be doing something pretty exotic to hit a problem that would require hacking the language or runtime system.

This cuts even in unsettling directions. I have heard a few serious Common Lisp hackers admit that, damn, it was so much easier to do this in Emacs Lisp instead :-)

Hmm, what sorts of things are

Hmm, what sorts of things are easier to do in Emacs Lisp than in Common Lisp?

Well, I use keyboard macros all the time in ways that would be cumbersome in CL, but I'm sure that's not what you mean.

Tree more fundamental

Since the tree is the more general structure, and arrays are more general than scalars, shouldn't FPs support K-ary trees with N-arrays as elements as the fundamental data structure and let users pick a default where N = K = 1? Let that be a challenge to the FP community to bring such support to your languages. ;)

Algebraic DTs and Iteration

First most (statically typed) functional languages let you define new data types very easily. Lists are possibly over used (but about this later), but they certainly aren't the only thing available. So defining my binary tree with whatever the heck I want as an argument is simply
data Tree a = Empty | Leaf a | Branch (Tree a) (Tree a)
I could easily make it more flexible. Heck I could abstract it up the wazoo,
data FTreeAlg f x a = Empty | Leaf a | Branch (f x)
With newtype Fix f = In (f (Fix f)) we could reproduce binary trees with newtype Dup a = Pair a a, type Tree a = Fix (FTreeAlg Dup) a or how about type NAryTree a = Fix (FTreeAlg Array) a (Array a simplified type from the actual Haskell array type).

Next you comment that in C++ you "just use iterators" and call it a day. -Lists are the FP iterators-, especially in a lazy language like Haskell (in most cases, sometimes lists perhaps won't do, but often they do). So I write one function over lists and different traversals over a tree that flatten it to a list and I can combine the together simply and each is useful on its own as well. Lists are also very flexible in this sense, they can be considered almost as first class "chunks" of loops.

[Edit: I also want to note that lists aren't particularly built-in to Haskell. The support is only some minor syntactic sugar modulo that it is just a library definition (data List a = Cons a (List a) | Nil)]

Algebraic DTs

So what it comes down to, is that most FP languages have syntactic constructs for list manipulation on top of the language constructs one can use to build and manipulate any other algebraic data type.

It is easier to type [1;2;3] than it is to type 1::2::3::[] (or even cons(1, cons(2, cons(3, nil)))) so mostly one uses data types with such syntactic support. So why don't all languages have direct syntax for tree data structures?

I think what David is trying to get to is some syntactic support for interfaces instead of specific data types. So [1;2;3] could be used for lists as well as for arrays, etc. This would still be restricted to very list-like structures of course (as you could guess from the interface definition)...

Exactly

For instance, in addition to car and cdr, what if languages supported a builtin like first^, which I define as:

first^ (a b c) =
a, given a is an atom
first^ a, otherwise

Similarly, rest^ (a b c) =
cdr (a b c), given a is atomic
cons (rest^ a) (b c), otherwise

I just picked the names arbitrarily, but the point is that just adding these two functions as builtins gives you depth-first, in-order traversals of trees. How hard is that? The reason you don't see functions like this in Lisp is, of course, because you have to use lists as records as well as containers, making such functions more or less useless. But in a language where it is easy to distinguish between containers and structured objects (make containers subtypes of Iterable?), it seems like these would be useful builtins. To round it out, you could define pre- and post-order traversals like so:

first x, such that x is the first atom in the list

rest (x), such that (x) is (a b c) with first

first> (a b c) =
x such that x is the first non-atom in the list

rest> (a b c) =
(x), such that (x) is (a b c) with first> removed

Depth-First?

Head and Tail (first and rest) do not have an exact correlary in the world of trees. You might implement such a function as depth-first but there's no particular reason why this makes any more sense from an iteration standpoint from breath-first traversal.

Anyhow, most tree libraries allow one to access a tree as a list - perhaps a performance penalty is involved, but no more so than defining an iterator for a tree as you've asked for above (that is - you've defined 2/3rds of the List interface for the Tree - car and cdr).

"How hard is that?"

Not hard at all. In fact it is so easy in most languages there is no reason for for "first^" to be a built-in. In C++, iterators are not built-ins. In Haskell it is trivial (and has been done) to make a type class of such operations. The most pleasant approach for Haskell uses multiparameter type classes (which are admittedly "non-standard", but there are other approaches as well); class Sequence seq elem where first :: seq -> elem; rest :: seq -> seq Other languages have other mechanisms. The mechanism you seem to desire is (a special case of) a basic tool of software engineering: defining interfaces independently of implementation, and just about every language supports it. (Incidentally, part of the point of my previous post is that using lists as an intermediary is just such a form of abstraction. Still more incidentally, what I described is part of the idea that inspired the OO design pattern called "Transfold".)

I don't see what you are getting at. What is it that you think is missing? Simply that these aren't hardwired into the language? or that it is not standardized?

[edit] Responding directly to your top-level post: I illustrated in my previous post that lists are not "built-in" to Haskell except for some trivial syntactic sugar (ironically, arrays are built-in to Haskell insofar as you couldn't reasonably define them in terms of the constructs provided by the language). This is true for most languages near Haskell e.g. the MLs. Lisp is not the "world of functional programming" let alone the Lisp 1.5 you seem to be referring to. My previous post also gave reasons why lists are popular. Finally, replacing "lists" with "arrays" the same argument could be made about mainstream imperative languages. For example, neither C++ nor Java do what you describe, but both certainly have such "derived" interfaces.

Both

I don't see what you are getting at. What is it that you think is missing? Simply that these aren't hardwired into the language? or that it is not standardized?

Before the STL, people would commonly hand-roll their own containers, even if other people wrote perfectly good ones for them to reuse. People will use the mechanisms that are most naturally supported by the language they are using. My point is that generalizing the list interface to support other fundamental structures will make it feel just as natural to use those as it does to use lists. You can write libraries on top of anything, but unless they are standardized and ingrained in the culture of that language, people will take the path of least resistance. I'm not interested in what *can* be done, because the answer to that question is mostly boring. I'm interested in what *is* done, and every tutorial you find for just about any FP lang will talk almost exclusively about lists (and possibly arrays, if the language doesn't treat them like a red-headed stepchild). It's obvious that trees simply are not as common as lists. In C++, on the other hand, you are just as likely to see a map or a set as you are a list or a vector. That's because the language makes it just as easy to use any of the standard containers. And the fact that you have to transform other structures into lists also proves that many languages cater to lists better than other structures.

Haskell practice

1) Many of the tutorials I see with Haskell don't tend to use sets and arrays admittedly, but they do tend to use more complex tree-like data structures because they are so easy to define.

2) The flatten to list approachis not necessary as I demonstrated with the Sequence typeclass.

Is Sequence standard?

Because from what I can tell, that is the List-as-Iterator interface I've been talking about. My argument is that people should code to Sequence, rather than to List. And Haskell should *provide* standard containers that offer different complexity guarantees, rather than letting people write them. Then there would actually be a reason to use Sequence. My point is that language design is as much about *culture* as it is about *capability*. Each language has to define its set of idioms for solving the most common problems. And then those idioms must be instituted as standardized mechanisms. That is how languages evolve. It doesn't really matter if they are builtins or libraries (I actually prefer libraries to builtins, if they can result in the same syntax). It *does* matter if a programmer has to wonder if (s)he can use them in a given program.

the list *is* the common sequence interface

If all you want to do is traverse a sequence (as you would do with a pair of iterators in C++), you simply convert the sequence to a list, then fold over the list or deconstruct it by repeated pattern matching. What you seem to be missing is that this is *the* *same* as the iterator approach. The list is never built completely, but lazily while it is consumed. Shortcut Deforestation might even optimize it away completely. No slow operations are involved: pattern matching at the front is O(1). Lists are exactly the common interface you are missing.

Of course there's a place for other containers. Modern Haskell provides pure as well as imperative arrays as primitives and sets and maps in the common libraries. To traverse them, you convert all of them to lists. Not because of culture, but because it is fast, correct and convenient.

Example

Ok, help me understand this. Let's say I want to call this simple function:

res = map f mydata

Now map can be defined like so:

map f [] = []
map f (x:xs) = f x : map f xs

But suppose that mydata is a tree, represented as a list of lists. So you're saying that I have to convert mydata to a list first? Would that be something like so:

res = map f (makelist mydata)

If so, my whole point is that this makes List the preferred structure, because everything that's not a list has to look like one. Whereas, if map were defined like this:

map f [] = []
map f x = f (first x) : map f (rest x)

and you could define first and rest differently for each data type, then you wouldn't need to convert anything to a list, as long as it conforms to this interface. Pardon my pseudo-Haskell, but I hope my point is clear.

Functional iterators

But suppose that mydata is a tree, represented as a list of lists. So you're saying that I have to convert mydata to a list first?

If I understand you correctly, you are implying that you would rather that you instead obtain some kind of iterator to mydata. It's possible I'm wrong about this but it seems to me that in a lazy functional language like Haskell there is essentially no difference between "makelist mydata" and "makeiterator mydata".

Let's think about it in C++ terms since your idea of iterators seems to come mainly from there. An STL sequence ("forward-and-backward") iterator is something that essentially looks like a linear list to a client. What are its virtues over constructing an actual list to represent the same sequence? The main one is that you don't waste the space in copying everything, which is only possible because ++ is a destructive/mutative operator. In a purely functional language, you would have to make the equivalent of ++ return a new iterator. So you _would_ end up effectively allocating as much space. In a lazy language like Haskell, you wouldn't end up paying an up-front cost in either time or space for parts of the list you don't traverse--just like C++ iterators.

Not quite iterators

I recognize that iterators don't translate perfectly to FP. My point isn't to force an imperative concept on FP. My point is to generalize the use of the List concept. I really don't care if you spell it "makelist" or "makeiterator". I consider it a flaw that you have to adapt your data structure at all. My proposal is that the Haskell list notation to slice a list into car/cdr be extended democratically to apply to any data structure that wishes to support the car/cdr interface. I tried to spell that out by using explicit function calls in my modified definition of map, but I would be perfectly happy if rather the list notation were to call user-defined functions on the passed-in data structure.

Let's review. Typical definition of map:

map f [] = []
map f (x:xs) = f x : map f xs

The problem with this definition is that it gratuitiously assumes that the argument is a flat list. I say "gratuitously" because you can make a trivial change to it that will work with *any* data structure that provides the appropriate interface. Here is a suggested change:

map f [] = []
map f x = f (first x) : map f (rest x)

So instead of just taking car x, we call some specially-named function like first, which anyone can overload for their custom data type. This is a generalization of car that is implicit in the default list-based syntax (x:xs). You similarly generalize cdr so that map will now work perfectly well with any data structure, given it has suitably overloaded rest. Now, it would be a pain to go back and rewrite all those standard higher-order functions to use this generalization. So a better idea is this:

Simply define the list-based syntax to implicitly call the specially-named functions first and rest. So when you write:

map f (x:xs) = f x : map f xs

All the x's get translated to (first x) and all the xs's get translated to (rest x), and the (x:xs) expression itself gets translated to x. You then define first and rest for List, and anyone who wants to use the List interface (such as a tree container) simply overloads first and rest for their container type. Now you don't need to do this:

res = map f (makelist mydata)

Because mydata presumably already conforms to the "Sequence" concept (that is, it overloads first and rest). What that means is that if there is more than one conversion to list, you would specify a canonical conversion that would List-ize the data structure, and encode that in the overloaded first and rest functions. Instead, you just write your code like so:

res = map f mydata

Whether mydata is a flat list, a tree, a graph, or something entirely different that implements Sequence. It's not really a matter of efficiency. It's a matter of expressiveness. Lists are nice and great and they aren't going to go away. But I think a simple change like the above would go some ways towards levelling the playing field for other data structures. But perhaps I just really don't understand Haskell well enough to know what I'm talking about?

The way I see this interface as being similar to the Iterator concept is that I see first as being somewhat analogous to begin(), and rest() as being similar to ++ and end() (again, it's not a direct analogy). The point being that List is nothing special. What is special is the way you iterate over lists. In the C++ world, you do that with begin(), ++, and end(). And in the FP world, you do that with car/cdr. The analogy is that the conventional definition of map is written something like this:

map f (List x:xs) = f x : map f xs
Set s = ...
res = map f s // Error! s is not a list! Boom!

When it could be written like this:

map f x = f x.begin() : map f x.rest()
Set s = ...
res = map f s // Yes! s satisfies the interface!

I know that is an unholy mixing of C++ and Haskell that will have some of you showering in holy water for the next week, but I hope I got my point across.

P.S. I realize that first and rest are just adaptors that convert a data structure into a list. My point is that it should be hidden in the machinery of the language rather than forced on people because List has special status.

Views

Apparently there are work on doing exactly that in Haskel. There was a link in the "default sequence" thread on views in haskel.

Interesting

Unfortunately, I don't grok enough Haskell to fully appreciate the discussion. Is the point that I can do something like this:

c = JoinList ...
res = map f c

Because JoinList has defined isEmpty, first and rest? If so, that is cool indeed!

Functor?

Perhaps you are asking for Functor?

One advantage of a Functor typeclass is that you can define separate ways to traverse a structure. You can map a tree depth first, breadth first, or whichever way you like.

Big if

That's a rather big if you start of with. Of course not, you want to be able to insert elements in datatypes too, right?

Trees are omnipresent

It's obvious that trees simply are not as common as lists.

Although I agree that lists are somewhat overused in FP, I don't see trees underused. Almost all non-trivial algebraic datatype definitions in practice form trees in one form or the other, and they certainly are omnipresent in functional code. Also, I believe that "real" code uses (tree-based) maps and sets much more often than you seem to assume.

However, you rarely see "generic" tree data structures (except as implementation detail of other generic data structures, like e.g. said maps and sets). I suppose this is simply because specialised tree types are so easy and natural to define and use with pattern matching that there is little need to have something generic that tends to fit specific application domains only approximately. Consider abstract syntax trees, for example - you wouldn't want to use some generic tree representation for them, because that would not encode enough of their structural invariants in the types (plus, it would actually make working with them more cumbersome).

Lists on the other hand are the canonical inductive representation of sequences, such that there is little need to special-tailor them every time.

More explicit

I suppose this is simply because specialised tree types are so easy and natural to define and use with pattern matching that there is little need to have something generic that tends to fit specific application domains only approximately.

Depends where you are coming from. Yes, lists are omnipresent, such that they also are supported with some syntactic sugar in functional programming languages. Uhm, and yes, trees are also easy enough to define in, well, all FP languages. My feeling is that 'updatable' functions (and sets) are very good candidates to also support with some syntactic sugar in most FP languages, and I am actually quite suprised they arent't yet.

As an example, beside having [] as syntactic sugar in an Haskell/ML it would also be nice if functions and sets would also be supported natively in FP languages. Something among the lines like having {} for sets and => for functions.

Below some examples how that may look like:

-- build a singleton set
singleton::T -> {T}
singleton t = {t}

-- the size of a set (would be better to support that natively, say O(1))
size::{T}->Int
size {}      = 0
size (x u S) = 1 + size S

-- flattening of a set of sets to one set
merge::{{T}}->{T}
merge {}              = {}
merge ({}       u S1) = merge S1
merge ((x u S0) u S1) = x u (merge (S0 u S1))

-- explicit insertion of key value into a function
insert::A -> B -> (A=>B) -> (A=>B)
insert x y f = (x => y) u f

-- the total of a bag of integers modeled as a function 
total::(Int=>Int)->Int
total {}             = 0
total ((x => y) u f) = x * y + total f

-- the inv of a function
invert::(A => B) -> (B => A)
invert {}           = {}
invert (x => y) u f = (y => x) u (invert f)

-- composition of two functions, g is treated as a function
compose::(A => B) -> (B => C) -> (A => C)
compose {} g             = {}
compose ((x => y) u f) g = (x => g y) u (compose f g)

-- a domain of a function
domain:: (A=>B) -> {A}
domain {} = {}
domain ((x => y) u f) = union {x} (domain f)
(Of course, the trick in the above examples is that 'u' models some say log(N) insertion, lookup, and pattern match.)

Well, uhm, this is actually how I think of function/sets. I rolled out my own trivially size balanced maps in ML to do that. Of course, would be nice to have syntactic sugar for that, which I hope to implement once in the PL I am working on, ETA: approx. 4 years though ;-].

FGL vs inductive sets and maps

Your examples remind me of FGL, which is based on the idea of inductive graphs.

Sets in CSPm

The machine-readable version of Hoare's CSP process algebra, which was developed by Bryan Scattergood as an input language for the FDR refinement-checker, includes an embedded functional language inspired by Haskell and Gofer. One of the nifty features that Scattergood included in this embedded language was built-in sets, in a style similar to the one you are advocating. See here for a brief guide to the syntax and operations (which include set comprehensions!). See here for an example use of CSPm sets - a CTL model-checker written (well, ok, hacked) in CSPm.

Sadly, CSPm isn't usable outside of FDR (although I bet it'd make a nifty concurrent functional language if anyone ever implemented a version that wasn't dependent on FDR, and that supported things like I/O - might be a fun project if I ever get enough spare time). But I often find myself wishing that other languages supported those set literals and set operations as built-ins. They're very convenient, and hard to give up once you get used to them.

Sets, Bags, and Mobiles

This is a great idea, but takes more than syntactic sugar to do properly. Pattern matching needs to account for axioms like commutativity, associativity. idempotency, etc. Some term rewriting systems like Maude do support this, but I haven't come across proposals to put it into regular FP or OO languages.

I know, but problem or design choice?

I gather you hint at, like, what is the meaning of:

has_one_dom::(Int=>Int)->Bool
has_one_dom ((1 => y) u f) = True
has_one_dom _              = False

has_one_ran::(Int=>Int)->Bool
has_one_ran ((y => 1) u f) = True
has_one_ran _              = False

has_one_dom ((2 => 1) u (1 => 2) u {})

I think this is one example where I would actually prefer to just handwave axioms away and just go for the silly operational semantics that has_one_dom some_function is true if by chance the pattern matches a key/value pair where the key equals one. (EDIT: the example)

Seems like I always mention this...

The Kleisli Query System's CPL language included support for sets, bags and lists. Since it's a query language, it only includes a limited form of structural recursion over each of these types, which I believe encapsulates the collection axioms. IIRC, there's a low-level recursion operator that leaves the programmer free to screw up (by folding a set over a non-commutative operator, for example), but the more commonly used comprehension syntax is safe.

CPL is a very cool little language, and the Kleisli papers are pretty good reading. I think a successor to CPL would make a really great shell language, and if it was adapted to do stream processing, I think it could be a big success, especially now that so many people are familiar with comprehensions/generators.

Sets

There's always SETL.

lists important because of proof by induction?

As a newbie, I've been mulling this question my self. I think one reason lists are so useful is because it is convenient to reason about them. In other words, you prove something about one item, then you prove the same thing about the 'next' item, and all of a sudden you have proven a certain property about the list as a whole. I also get the feeling that lists are undervalued because most of us procedural programmers simply think of them as a linked-list type data structure. Lists, in functional programming, are more general, and often the implementations backing them are more effecient than simple linked lists.

Induction

I already mentioned this in my first post to this thread. Lists and binary trees are special in being among the simplest inductively defined data types. The "induction schema" that you outline is dual to the "recursion schema" that people use to write functions over lists. This quality is also part and parcel of what makes lambda calculus nice to work with, both for writing proofs about lambda calculus and functions over lambda calculus terms. The combinator calculus has an even simpler inductive definition but is hard for humans to work with by hand. Lambda calculus strikes an almost perfect balance.

A Vote for Sets (or Relations)

Lists, Vectors, or Trees all impose needless structure when often it is not needed to understand the problem. I have seen too many programs that use Vectors when Sets would suffice. I always advocate using the most flexible data structure that suits my needs so that the compiler (and maintainer) knows the important features of what I am using the aggregate data structure for.

I vote for Sets as the builtin aggregate data structure.
1) The Set interface can be implemented in a multitude of ways that best performs for a given purpose
2) A set of identically typed objects is a relation, which is a generalization of a Map.

Even though Sets as the bultin aggregate data structure are nice, a builtin relational algebra would be even better.

Relations would point not towards Sets...

...but rather towards tuples as the base structure - a set could then be a collection of like tuples.

Problem vs Design Space

I really think that lists are just there because they have a really intuitive and easy implementation (they are the simplest of all recursive container data structures). But that is design space.

However, given how I think of functional programming, I mostly seem to manipulate sets/predicates/functions, and often (say for memoization or explicit lookup) this boils down to needing tree based maps rather than lists (since the former are the more natural translation of functions for performance reasons). Tree based maps are also a better choice for implementing pure 'manipulatable' functions with reasonable performance.

So, uhm, I actually think that it would make more sense for FP languages to natively support maps with some syntactic sugar than lists. And I feel that the historical reason for natively supporting lists is actually one of confusing design with problem space.

Probably this just surfaces now since people want to build more complex applications than ten years ago in a functional manner?

Why lists? Five reasons.

I will give you five reasons.

  1. Tradition: Lisp has them. Prolog has them.
  2. Simplicity: The unary encoding of the natural numbers (data Nat = Z | S Nat) is about the simplest recursive datatype. Lists are about the simplest parametric recursive datatype.
  3. Pedagogy: Because of the above, most basic texts concentrate on lists. Because basic texts have a larger audience, most people concentrate on lists.
  4. Technical: Sequences are a pretty basic and important concept. In math, when you need a sequence you usually use free monoids. But to process such a sequence, you need to provide an associative operator satisfying unit laws, and this involves a proof obligation. Lists are a normalized form of such sequences, with a lawless elimination scheme, so there is no proof obligation. This makes it easier to implement lists in a safe way, and can make some proofs shorter (though it makes others longer).
  5. Mindset: Once you understand the recursive structure of lists, you tend to see that head-tail structure in other things, especially if you have been exposed to more FP than math. For example, when I see a proof involving natural numbers I often see the base case and the successor case immediately. It seems "natural" to me, but I know that this structure is artificial: the natural numbers also form a free monoid over the one-element set, so you could decompose a proof into cases involving 0, (+) and 1, provided you follow some rules. Similarly with groups, rings and many other algebras. But once you conceive of some data as having list-structure you are more inclined to model it using lists, so lists start to seem more important, even essential. (It is because I know this notion of latent structure is an illusion, incidentally, that I take exception to using the word "latent" in "latent types" and "latent syntax".)

lisp lists == python lists??

I'm new to both languages so I'm curious about the following: lisp/haskell lists are recursive structures, python lists don't seem to be defined recursively. The question is, are they different? In other words, can all the examples and work related to recursive definition of lists in functional text books, papers, etc. be applied to python lists?

It seems that there is a diff

It seems that there is a diffrence in that python defaults to a imperative use of lists. the append method does a destructive update of the list for example. I don't know python well enought to say if there exists funcional/persistent operators as well, exept for subscripts (wich only can be used for deconstrucction).

While python's list is implem

While python's list is implementation-dependent, cpython uses a dynamically resizing array for its list, similar to the vector in stl. Also, I believe slicing is done via copying of the subarray, as opposed to just returning the sublist like it's done in most FP languages.

Because they're the most natural?

With algebraic data types, lists are by far more natural than arrays. There's probably a categorical argument to be made that they are _the_ simplest structure for storing collections of things.

For example, if you consider that peano arithmetic is the simplest way to express all of the integers, then lists are the perfect analog of peano numbers: either a list is empty, or it's an element attached to another list:

Number = 0 | 1 + Number

List (of something) = Empty | Cons something List

0 becomes the empty list, Cons becomes the algebraic '+', and the constructor for the parameterized datatype is just '1'. I really don't know enought category theory to go deeper into that, but I'm pretty sure that such an argument has consequences for things like fold, unfold, etc.

Why arrays can't be like that?

Array (of something) = Empty | Cons something Array

Cons something to an array creates a new array with all the elements of the previous array plus the new element. An array is nothing more than a tuple, actually. The only real difference with using lists is performance: lists do not have to be copied.

Just another example where an implementation detail defines the terminology used in later advanced stages.

I guess you can say that its

I guess you can say that its the implementation/performance that distinguish cs from mathematics. If you didn't mind arbitrarly bad performance or space/time complexities, you could just solve all problems with a naive nondeterministic search, and there would be no need for all this algorithms and datatypes.

Array and list interfaces versus implementations

Maybe it would be helpful to distinguish the list and array interfaces from the list and array datatypes.

Here's how I might design it:


class List l where
    cons :: e -> l e -> l e
    nil  :: l e
    head :: l e -> Maybe e
    tail :: l e -> Maybe (l e)

class Array a where
    array  :: (Int -> e) -> (Int,Int) -> a e
    elemAt :: Int -> a e -> Maybe e
    update :: Int -> e -> a e -> Maybe (a e)

The Maybes are there to make the functions total; it might make more sense to throw exceptions in actual code.

Actual lists and arrays can implement both interfaces, although they'll have different performance characteristics. There are other datatypes designed to implement both interfaces reasonably well.

If you wrote, say foldr using head and tail, it would work for any type that implemented List. Unfortunately, pattern matching is tied to the implementation, rather than the interface.

An array is nothing more than a tuple, actually.

Well, that depends on what you mean by "array" and "tuple". In strongly-typed languages, arrays can support integer indexing because every element has the same type. The elements of a tuple may have different types, so you need separate projection functions.