The fundamental difference between Sets and Lists?

In my pursuit to implement persistent functional databases, I'm struggling with the difference between Sets and Lists.

My question is to Lambda the Ultimate is: what's more fundamental? Set membership or the relative order between List elements?

I tend to prefer Sets above Lists - because Sets are more succinct and because ordering can be expressed on top of Sets (although somewhat artificially).
Consider the following Set A:

A = (5, 1, 3)

Imposing order over Set A can easily be achieved if you compare and sort all its elements (lexically), giving:

a = order_in_list(A) = {1, 3, 5}

Conversely, mapping Lists to Sets cannot be achieved, because double elements could be possibly contained by Lists, hence:

b = {1, 2, 2, 3, 5}
B = make_set(b) = (1, 2, 3, 5)
b' = order_in_list(B) = {1, 2, 3, 5}

which gives the unsatisfactory : b' != b

But things get even worse if we try to map the following (unsorted) List to a Set and vice versa:

c = {9, 3, 3, 1, 5, 4}
C = make_set(c) = {9, 3, 1, 5, 4}
c' = order_in_list(C) = {1, 3, 4, 5, 9}

Which gives us the even worse : c' !=!= c

Yet if Sets are more general then Lists how then can we uniquely map c to C?
One possible mapping would be the following:

C' = make_ordered_set(c) = {(0,9), (1,3), (2,3), (3,1), (4,5), (5,4)}

and vice versa:

c'' = ordered_in_list(C') = (9, 3, 3, 1, 5, 4)

Giving us the c'' = c --- the equality we want:

But with this approach we are still stuck with the (position, value) List pairs. But not too surprising this can be rid of easily with the following generic mapping:

P = ($position, $value)
p = make_orderder_pair(P) = {{~, $position}, {!, $value}}

where '~' and '!' are special and reserved symbols that are unique and cannot be re-used except for the creation of List pairs.

Now that we established a perfect mapping from Lists to Sets and vice versa, we can conclude that Sets are more fundamental - not a big surprise if you think of Sets to be the basic building blocks of all mathematical theorems.

But one interesting operation remains: the concatenation of Lists. Let's see if the previous mapping is capable in expressing the concatenation of Sets. First, let us consider the simple case:

A = {4, 1, 3}
B = {6, 2, 1}

One algorithm of mapping the concatenation of A and B would be the following:

C = concat(A, B) = {{4, {6, 2, 1}}, 1, 3} ie. nesting B into the greatest element of A.

However, if we concatenate C a million times with itself, this algorithm would create a nested structure of the same depth: a million - not an ideal solution. Especially if we want map such enormous Sets to Lists in at least (O(log n)).

An alternative algorithm would be the concatenation of the greatest element of A to *all* elements of B. This would give:

C' = concat2(A, B) = {4, 1, 3, (4,6), (4,2), (4,1)}

But again, by applying this scheme we alternatively end up with very big elements (length > million).
So the choice is to allow either very deep recursion of nested sets, or the growth of enormous elements (by repeated concatenation).
My choice would be the growth of enormous elements - because concatenating elements can be relatively cheap.
But how can this be cheap? How can we ever expect to concatenate humongous elements cheaply?
Consider the concatenation of simple Strings, in Java or in any other industry strength language - such Strings are immutable. But the naive Java implementation of immutability causes a lot of inefficiency. Concatenating two Strings will always involve the copy of both Strings to produce the new one, so:

A = {This is a text of a millions characters....}
B = {A trillion characters text will follow....}
concat(A, B) -> in Java this will cost you O(1.000.000 + 1.000.000.000)

But what if you can concatenate A and B in min(log(A), log(B)) operations? This can be done in reality, albeith not easy.
One route to fast concatenation would be the recent discussion and implementation of:
Finger Trees: A Simple General-purpose Data Structure by Ralf Hinze and Ross Paterson.

Another potential solution is:
Ropes: an Alternative to Strings by Hans-j. Boehm, Russ Atkinson and Michael Plass.

Regardless of this discussion, both articles are very much worth reading - especially because they encompass the thru functional style we all love.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Probably me being naive

But wouldn't tuples be a preferable building block for both sets and lists? I'd think more typical of persistent data - the kind of things that functions tend to get combined with would be a tuple like:

("abc", 42, 42, [1,2])

Of course, if your lists or sets are not homogenously typed, perhaps it's not an issue?

Ordered tuples as sets

I'm not sure if I really understand this thread, but you can define ordered pairs (and then general tuples) in terms of sets. From pp 133 of The Haskell Road to Logic, Maths, and Programming:

Ordered pairs behave according to the following rule:
(a, b) = (x, y) ⇒ a = x ∧ b = y
This means that the ordered pair of a and b fixes the objects as well as their order.

Then, a bit later they define ordered pairs in terms of sets:

(a, b) = {{a}, {a,b}}

and leave as an exercise to prove that the above property of ordered pairs also holds for this set encoding. They then extend this to general ordered tuples (over some base set A), using the following definition (pp. 135–6):

  1. A0 := {∅}
  2. An+1 := A × An

Where A × B is Cartesian Product (i.e. {(a,b) | a ∈ A ∧ b ∈ B}). Using this encoding the representation of (1,2,2) would be {{1}, {1, {{2}, {2, {{2}, {2, ∅}}}}}} (from Wikipedia; I don't have the concentration currently to determine if that is correct).

So you can certainly defined ordered tuples in terms of unordered sets. You can also implement sets in terms of lists, so I would guess that neither is more fundamental than the other.

Not exactly optimal

While you certainly can do this, it isn't exactly ideal. In practice having "pairing" as a fundamental operation from the outset s oth useful, and doesn't really lose you much. See for example local set theories/intuitionistic type theory.

Sometimes it is nice to be able to reduce everything purely to set theoretic foundations. Sometimes, however, doing so does more to obscure than elucidate what is actually going on.

Yes, but the question was

Yes, but the question was 'which is more fundamental', not 'which is most useful/efficient/optimal'.

Neither is more fundamental

You can build one from the other and vice versa. It's all a matter of what you want to percieve as appropriate building blocks.

Yes. That's what I said, too.

Yes. That's what I said, too.

I don't understand the problem

But if I do, then lists are more fundamental, because they can hold any elements without having to have their equality defined.

And because there is an obvious implementation of lists with particular efficiency bounds, while for sets (with different operation costs) it's not that obvious.

Interesting observation

Thanks for your interesting observation about equality. Still, I think that almost all Set operations can be implemented equally efficient as with Lists. However, there are some notable differences I have problems with.

Set.
'put(Value) -> Set
'slice(Value) -> (Set, Set)
'contains(Value) -> boolean
'union(Set) -> Set
'diff(Set) -> Set

(Note that slice returns two Sets, one with elements greater, and one with elements smaller then the given slice Value).

List.
'put(long, Value) -> List
'slice(long) -> (List, List)
'get(long) -> Value
'concat(List) -> List

Note that the union and diff operator cannot be mapped to Lists easily. Similar, concatenating Sets is also difficult to visualize.
Yes, mapping Sets to Lists and vice versa is certainly doable. However, this still leaves me with cross-mapping all operations (efficiently).

Fundamental operations

Fundamental operations on lists are the following (in a Haskellish syntax):

  • cons :: a -> List a -> List a
  • nil :: List a
  • isEmpty :: List a -> Bool
  • first :: List a -> a
  • rest :: List a -> List a

They are all O(1), and everything else can be defined from them.

OTOH operations on sets in various settings are more diverse. One problem is with the operations required from elements. Theoretically equality is sufficient, but lt leads to O(n) times of various operations which could be done in O(log n) time if elements had more structure. It's thus common to require ordering or hashing.

Another problem is that listing all elements of an unknown set is non-deterministic, so results of operations are either somewhat ambiguous (iteration over all elements delivers them in an unspecified order) or overconstrained (e.g. sorted order makes sense only if the representation is based on the ordering, there is no obvious order of elements for sets implemented as hash tables).

Sets might be more convenient as the base in mathematics, but sequences are more important in programming languages.

Mathematical perspective

Before you reinvent the wheel too much, there is a whole branch of (foundational) mathematics called Set Theory, and amongst its first concerns is building up all the data structures needed by mathematicians out of nothing but sets. If you skim the first few chapters of a book on axiomatic set theory for definitions, you should find out a lot of these tricks.

There are quite a few ways of repesenting lists using sets, but generally you start with the idea of a cartesian pair, (a,b) defined to be the set {{a}, {a,b}}. [A little thought should convince you that {{a}, {a,b}} = {{c}, {c,d}} as sets, if and only if a = c and b = d, that is, (a,b) = (c,d) as pairs].

You can then build lists out of these pairs in the usual cons-pair style, using the empty set as nil, so the list [a,b,c,d] is defined as (a, (b, (c, (d, {})))).

Another typical method for mathematicians is this: First define the natural numbers in terms of sets, like so:

0 := {}
1 := {0} = {{}}
2 := {0,1} = {{},{{}}}
3 := {0,1,2} = {{}, {{}}, {{},{{}}}}
and so on

Then define a function/mapping as a set of domain, codomain cartestian pairs (a,b) with the required uniqueness condition saying that whenever both (a,b) and (a,c) occur in the set, we actually have b=c and so they are the same pair. Then define a list as a function with some particular natural number as the domain. So [a, b, c] is {(0,a), (1,b), (2,c)}.

Infact you can go on to define more than just natural numbers but a whole class of infinite ordinals in this way, and this allows for various infinite lists using the same method. A 'stream' for example, would be a function from the first infinite ordinal omega. This offers more generality than the cons-pair definition.

Now you can also do the opposite and take lists as the fundamental data structure and try to represent everything in terms of them - a finite set could be represented as an ordered list with no repeats, say - but this isn't so convenient for mathematicians who deal with all sorts of weird infinite sets, some of which don't necessarily have an order on them. The concept of a set starts to look more fundamental than a list when you start dealing with infinities, I think.

Of course thinking about things this way is quite different to thinking about actual physical data structures from a computational point of view - the mathematician has no concern for how fast it is to compute thing with these data structures, just in being (theoretically!) able to boil down high-level theorems and concepts into statements about sets. And of course to a mathematician doesn't particularly care that it's impossible to store an infinite set all at once on a computer :)

Slight tangent:
What I'd like to see would be a programming language which takes designations like 'list' and 'set' to be higher-level conceptual definitions, and is capable of reasoning about them on that level (using some category theory, perhaps?). It chooses the optimal physical data structure representation to use, or perhaps allows you to specify, as a separate lower-level optimization decision. This seems more commonplace in the world of databases but would be nice if could extend to more general-purpose languages too. Anyone have any pointers on that?

Data structure compilation

The idea you present in your slight tangent is a very appealing one. I think the relational viewpoint is one possible way to generalise the datastructures.

For instance: Suppose we have a relation of type Int -> Char. If we tend to use the relation by supplying the integer AND we have an additional constraint that the relation uniquely assigns a character to an integer, then we may as well compile the relation to an array. This kind of global constraint can be expressed with universal quantifiers.

Arr: Int -> Char

ForAll x:Int . ForAll y,z:Char . Arr x y AND Arr x z AND y = z

I think part of the trick would be looking at a substantial number of ways in which the relation is actually used in a code base and trying to encorporate them in as constraints on the datastructure. Then you can simply have a library of datastructures that satisfy various conditions. Not that this would be a simple task of course!

the standard way of doing

the standard way of doing this is, i think, to pretend that sets are more fundamental (paying due respect to set theory), but provide an optimized implementation of lists "under the hood".

ditto for integers, etc.

and eventually you give up and write something practical and simple :o)

Neither is more "fundamental"

Mathematicians tend to favor sets, because they are more convenient for making logical arguments than lists. Computer Scientists tend to favor lists, because they are easier to implement efficiently. Sets have reasonable implementations in terms of lists, but mathematically popular definitions lead to atrocious implementations of lists in terms of sets.

Finite sets are the equivalence classes of finite lists modulo order and repetition. The typical way to do computations on equivalence classes is to use a "mod" function that picks a canonical member of each equivalence class, and use this canonical member to represent the entire equivalence class. Thus to compute in the ring of integers mod 42, one could compute the remainder after every addition or multiplication. In the case of lists and sets, one possible "mod" operation is sorting the list and removing all duplicates.

So, if you are given some type X, and you want to compute in some quotient type X / ~, a mod function should have two properties:

idempotency:       mod(x) = mod(mod(x)) 
well-definedness:  x ~ mod(x)            (for all x in X)

There is no way to create functions between lists and sets with all the properties you desire. Lists and Sets have two slightly different takes on the world.

Constructing Polymorphic Programs with Quotient Types

The paper here with the above name is an interesting paper I ran across quite a while ago that I don't believe has ever been mentioned on LtU. On thing I find interesting about it (in a kind of meta way) is how little research there is on this topic in a programming setting. Incidentally, there are tons of other interesting papers at the above link, old and new.

Comprehending Queries

Torsten Grust's thesis may be of help here. Also, don't try to fit all needs in a single data-structure. Instead provide a way to define arbitrary high-level algebraic (or co-algebraic) data structures and define additional functors to map these to your persisted data structure.

Lists for computers, either conceptually.

The mathematical argument between Lists and Sets takes place on the empty background of conceptual space and each can be defned in terms of another with the appropiate mappings.

However on modern computers the background is normally linearly addressed single-dimension memory, Lists (or arrays) are obviously closer to the 'fundament' of said architecture.

Being data structure neutral is all well and good, except underneath it all you only have one data structure.

This all assumes that performance is at least a marginal concern.

However on modern computers

However on modern computers the background is normally linearly addressed single-dimension memory, Lists (or arrays) are obviously closer to the 'fundament' of said architecture.

Well, arrays are. Lists and arrays are not the same datatype.

(The term 'list' is sometimes used differently in different contexts though.)

Three graduate students were

Arguments could be made in favor of each datatype being more fundamental. Which is it? I'll leave that up to you to decide.

Three graduate students were arguing over what was the more fundamental data type. The first graduate student, Paul, claimed that the array was the most fundamental datatype.

"Arrays are clearly more fundamental, because memory is arranged as a linear sequence of cells. Therefore arrays arrays are a very close representation of the structure of memory." Paul argued.

Alice, the second graduate student disagreed. "No, your just not thinking very well today. The list is obviously the more fundamental datatype because each memory cell can contain the address of another memory cell. Certainly the list represents the underlying memory much better."

Adam, the third graduate student, was quite surprised at the ignorance of the other two. "The set is the more fundamental datatype. This is obviously true, because memory consists of a set of addressable cells. Certainly the set is the datatype most closely resembling the structure of memory."

If data exists in memory and no one is there to interpret it, does it have structure?