Small is Beautiful: the design of Lua

Roberto Ierusalimschy discusses the ideas behind the design of Lua specially on the pervasive use of tables and coroutines.

Video, Slides[PDF] and Talk Abstract.

Note: The video is in Windows Media format but it works with Quicktime Player on the Mac if you have Flip4Mac WMV installed. I can open the video using File->Open URL.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Would be great if they post

Would be great if they post it on, say, blip.tv. Or even their own Youtube site. But I guess they are a bit too concerned with the video escaping into the wild?

Eventually

They usually do eventually show up on iTunes and YouTube. This is just hot off the press, I guess :)

Another good one from the same colloquium is by David Unger on Self.

Self and Self: Whys and Wherefores [iTunes]

Self and Self: Whys and Wherefores [YouTube].

(September 30, 2009) David Unger, from IBM Research, discusses how his experience in computer science has led him to the conclusion that even if your ideas succeed, the real legacy is the people.

The Lua-C API...

...is really clever!

Really clever?

It may be clever from an implementation standpoint (I don't know), but what always puzzled me about it is how unwieldy it is, given that Lua is designed as an embeddable language. Manually pushing arguments on a stack when calling Lua from C seems very tedious. (Supposedly, most apps use Lua the other way, though: calling C from Lua.)

The popularity of Lua's FFI

I don't like to use the stack API. It is very lightweight, though. I spent a bit of time trying to think about a system of contracts that bridged the Lua FFI (i.e., so that it could be expressed both in C and Lua). There is not much to the FFI to make this kind of thing easy, except the fact that paradigmatic use of the FFI does not generate any malloc/free events.

What I find most attractive about Lua's FFI is that the internal decision about how to represent closures makes closures something that are transparent to the FFI, through upvalues.

Beyond this, I think the good press the FFI gets is in large part due to the fact that Lua offered the C/C++ world a useful, thread-safe, fully reentrant interpreter before anyone else did. So that's more to do with good properties of the whole language implementation than of the FFI in particular.

Simplicity is not a count of data structures.

It seems reasonable to treat sets as just a special case of maps, since their operations are so well aligned. But sequences seem sufficiently different that trying to treat them the same way is less "simple" than having a dedicated sequence concept.

For example, sequence concatenation is a very basic operation. But if you're trying to say that your only data type is the table, you need to define concatenation as a table operations. How would you define the operation such that it makes sense for maps and sets as well?

Even though the underlying implementation is the same, I think people's brains still make the distinction between maps and sequences. Trying to merge the two concepts in the language results in a net increase in complexity. Are there benefits that I'm not taking into account (other than ease of implementation)?

Indexing Lua Tables by Natural Numbers

Natural number indexes are treated specially by the implementation of Lua's tables, so they take the same space as arrays.

I'm talking about the model, not the implementation.

I didn't mean to imply anything about the execution efficiency of their setup. I realize they use techniques to make sure that if you're using something like a sequence, it operates as efficiently as a dedicated sequence implementation.

I was wondering if other people felt that having the sequence concept be just a special case of the table concept did actually reduce complexity. To me it seems that sequences do not fit into the table model cleanly and it would be simpler to model them separately (as opposed to sets, which seem close enough to maps that it seems reasonable to merge the concepts).

see Clojure, for instance

personally i don't have an opinion either way, but maps in clojure support sequence ops, if that is of any relevance.

Details?

I looked through some of Clojure's docs but couldn't see what you were referring to.

For example, do maps have a concatenate operation that would, if we had two sequences encoded as maps, do the right thing?

i may have missed your point, but...

"Because collections support the seq function, all of the sequence functions can be used with any collection."

I think some views are ok...

Curse my inability to navigate Clojure's doc website. My guess is that when you view a map as a sequence, map[K,V] becomes either seq[K], seq[V], or seq[tuple[K,V]], all three of which I'm ok with. Java's collection library does something similar.

What I don't like is Lua's setup where seq[V] is map[Nat,V]. 'concat' is now this slightly awkward-looking operation where you have to renumber the keys in the second map and then merge the two. This is not a natural operation for maps. Sure, you can encode sequences with maps, but there seems to be no encoding of sequence concatenation with operations that make sense for maps.

Sometimes, of course, the operations match up. With Lua's encoding of sequences, "seq.length" is simply "map.length". "seq.map" is "map.map_values". I think the more elegant solution is to define certain relationships between sequences and maps (like Clojure and Java do) instead of saying that they're the same thing.

Not quite...

Sometimes, of course, the operations match up. With Lua's encoding of sequences, "seq.length" is simply "map.length".

Not quite. In fact, in Lua the notion of "map.length" becomes another "slightly awkward-looking operation"... From the reference manual:

The length of a table t is defined to be any integer index n such that t[n] is not nil and t[n+1] is nil; moreover, if t[1] is nil, n can be zero. For a regular array, with non-nil values from 1 to a given n, its length is exactly that n, the index of its last value. If the array has "holes" (that is, nil values between other non-nil values), then #t can be any of the indices that directly precedes a nil value (that is, it may consider any such nil value as the end of the array).

There are many things that I find quite elegant and beautiful about Lua, but I'm still not so sure about this one.

What I don't like is Lua's

What I don't like is Lua's setup where seq[V] is map[Nat,V].

Most every book I've read with a definition of "sequence" in it has been exactly that.

'concat' is now this slightly awkward-looking operation where you have to renumber the keys in the second map and then merge the two.

That has always been the definition of concat.

Most programmers are not used to thinking of sequences in this way. Nonetheless, the sequences they have been using conform to this definition.

But 'concat' doesn't make sense for maps.

I don't think, as you are suggesting, that I'm uncomfortable with the connection between seq[V] and map[Nat,V]. For example, I'm ok with having seq[V] export a view that is essentially map[Nat,V] with read-only keys.

But 'concat' is being defined over maps, even though it only makes sense for the maps that are actually sequences. I don't think this is a simplification over having a distinct sequence concept.

a weird priority reversal

The problem, for me, is that this builds sequences on top of numbers and sequence operations on top of arithmetic — not just in the language implementation, where that's probably unavoidable, but in the user's semantic model. But sequences are, I think, much more basic to algorithms than numbers are. There are lots of programs you can write without numbers, but very few that you can write without some kind of sequence.

The usual non-number-dependent definition of a sequence is that it is either empty, or contains a first item followed by a sequence. Consider how much more natural the definition of concat is when defined that way: concat [] x = x; concat (x : y) z = x : concat y z. And a sequence defined that way allows any number of operations on it to be written lazily.

I think Lua's choice is defensible, but its disadvantages are real.

Sequences are annotated numbers

Note that as soon as you have sequence, you have numbers as well, as they are definable as sequence of elements of the unit type.

You can make formal the troubling similarity between the two following recursive datatypes:

data Nat =
| Z
| S Nat

data List a =
| Z
| S a (List a)

Granted, there is a huge step between this (List a) type, and (Map (Nat, a)).

I first was warned of this correspondence in Okasaki's Purely Functional Data Structures book; it's the idea of "numerical representations" that can be used to example elaborate structures such as Finger Trees (lists are built on *unary* numbers, which is a quite boring way to represent numbers).

The wider idea of building inductive datatypes by "annotating" other ones is explored by Conor McBride's work on "ornaments".

Yes, but

The very simple interface you've written out is actually adequate and superior for many algorithms on sequences, but hopelessly inadequate for efficient arithmetic. The correspondence is beautiful and interesting but not really relevant to the question of whether it makes sense to import all the complexity of arithmetic into a program when all it really needs is "succ" and "zero?".

Non-Inductive datatypes

If those data definitions are as in Haskell (not well-founded inductive data), then Nat isn't really the type of Natural numbers. In fact, I think most of kragen's objections stem from the fact that Lua indexes sequences with ordinary inductive Naturals.

But let's not throw the baby out with the bathwater. I have found the viewpoint of collections as indexed over an index set to be incredibly fruitful.

Dispensable fundamental notions

I think Kannan's point about concat shows that the notion of sequence supported by Lua's tables falls short of that provided by Lisp and even C. Hash tables plus a special sequence length operator doesn't seem to give the kind of full-blooded notion of sequence that programmers expect. I think we can say something similar another way by observing that hash tables are, from the algebra of programming point of view, a deficient notion of collection because I see no clean way of extending CCCs to contain finite maps (say T[A,B] is the object representing the finite maps from A to B) in a way that some reasonable algebraic condition of collections is true; e.g., observe that T[A,-] fails to be a monad over the category for most types A.

But does this really matter in the fundamental way that you and Kannan argue? When you say But sequences are, I think, much more basic to algorithms than numbers are, my question is why? Lua programmers seem to be quite happy with their unalgebraic notion of collection. We've got an odd notion of basic algorithmic construction if it is one we can manage quite well enough without.

It's about interfaces

Why do I think sequences are much more basic to algorithms than numbers are? Because I've written lots of useful programs with no numbers, but very few with no sequences. Turing machines have a sequence, but no numbers. The λ-calculus doesn't have either one, but when McCarthy (RIP) set about making a practical notation for recursive functions based on it, he added sequences, but not numbers. Cellular automata have sequences, but no numbers. Post machines have sequences, but no numbers. You can, of course, reduce a Turing machine to a counter machine, which has numbers but no sequences, because numbers and sequences are isomorphic; but nobody does, because it's a totally unnatural way to program, even more than the Turing machine itself. "Hello, world" has a sequence in it, but no numbers, unless you give main a return value. Real digital computers don't have numbers in them; like Turing machines, they have sequences of symbols, typically bits, which we then add higher-level logic to interpret as numbers.

C programmers manage quite well enough without polymorphism or strong type-checking, but that doesn't make it beneficial.

I really like Python's version of sequences. It supports lazy sequences, which can be generated on demand by a coroutine, which can (Haskell-like) be written inline as an expression. Any object can implement the "iterable" protocol, and thus be used any place in the language that a sequence is expected. This means I can write things like this:

# Print longest line in foo.c, using only memory enough for two lines
# plus a fixed-size file input buffer:
print max(open('foo.c'), key=len)

# Inline coroutine ("generator expression") ensures only two vectors
# are needed; xs, cs, and the return of Numeric.cos are vectors of 
# 1280 numbers.
cs = sum(Numeric.cos(xs * ii) for ii in harmonics)

# Iteratively compute the first eigenvector of matrix, without using
# the matrix math abilities of Numeric.
def occupancies(matrix, occupancy):
    while True:
        yield occupancy
        occupancy = [sum(a * b for a, b in zip(occupancy, column))
                     for column in zip(*matrix)]

In D, a similar concept (Alexandrescu's "range" object) allows a more natural implementation of all of the STL's algorithms. Some of them are defined as custom adaptors, allowing them to run lazily and be composed within a single expression, like Python's iterators; but it can express a larger range of algorithms, including lazy reversal, constant-space aliased slices (including of linked lists), and so on. You can wander around in UTF-8 strings, including files, in constant time, even though you can't efficiently count their characters.

None of this is on the table if the first thing you want to do when you get a sequence is to count its length, and your later accesses to it are by a numerical index.

Sequences and essentials

I'm really after things that are really "core" to the notion of algorithm: I thought that by sequence you were talking about the notion of sequence you get with, say Lisp and Prolog, and which is essentially given to you by constructs such as Simula's simsets. Clearly Lua's tables do offer a usable notion of sequence (a map from integers to numbers) in a wider sense, with less need for handwaving than for cellular automata.

It's not the deficiencies of Lua's notion of table that stops the language being extended to marry its notion of table and collection: this could easily be done by modifying the semantics of for loops to look for, say, an iterator function in a table argument, but that kind of extension would be in tension with Lua's minimalist philosophy. An obvious deficiency of any small language will be that it lacks many things.

C programmers manage quite well enough without polymorphism or strong type-checking, but that doesn't make it beneficial.

This missed the point of what I wrote. Type-checking is an algorithmic property of some programming languages, but the fact that a program passed type checking is not part of the algorithm that it expresses.

So I was looking for a sense in which the choice of tables might make Lua a poor candidate for expressing certain algorithms. But I guess I missed your point, which was maybe about which were better constructs for programming languages to have.

None of this is on the table if the first thing you want to do when you get a sequence is to count its length, and your later accesses to it are by a numerical index.

This is well put. Note that Lua does encourage use of iterators.

I was wondering if other

I was wondering if other people felt that having the sequence concept be just a special case of the table concept did actually reduce complexity.

I feel that way. I have read plenty of math book in which a sequence is defined as a kind of set. It's how I think of them.

Hard not to think more

Hard not to think more abstractly once you have laziness, generators, ... .

The 'Everything is a Table' Model

Lua tables are more than just arrays and maps, they are also structs, environments, objects, modules/packages, sets, etc. The issue I think comes down to just being able to treat conceptually different entities with a uniform interface.

The advantage of that ability is that you can write generic code that operates on the one uniform interface, and then use it on all the specific cases you might encounter without further thought.

I think having everything be the same construct at some level simplifies the language and standard library definitions and reduces the number of things you have to know. From this angle we can draw an analogy to encodings of data in the lambda calculus; even though booleans, numbers, cons cells, and so on are different, being able to express them all as the same thing (higher order functions) leads to the simplicity that makes LC so useful.

I'm afraid I can't think of any good concrete examples of how the Lua approach reduces complexity in code, but as a Lua programmer, I know I definitely appreciate it, and take full advantage whenever I can.

The issue I think comes down

The issue I think comes down to just being able to treat conceptually different entities with a uniform interface.

This does seem nice at first blush. It would be nice if your functional language could auto-lift type constructors to functions so you wouldn't need to wrap it in a lambda, and if your record labels were first-class and were also auto-lifted as selector functions, and if object messages were also first-class and message dispatch was just function application, and so on.

If every language abstraction could be auto-converted to a lambda you seemingly gain some power, though perhaps you'd lose something elsewhere. In the above, function abstraction and application are the functional "uniform interface" that would replace Lua's primitive table operations.

Sets vs. Maps and Concatenate

Imagine a set as a mapping from undefined to each value in the set.

A sequence needs ordered keys, so to use this as a sequence would require remapping the keys from undefined to particular values.

Since a set is unordered, any ordering of unique keys will be satisfactory.

Concatenation is an operation upon sequences, so you could implement it as a re-key and merge operation upon maps.

Now you can concatenate multiple sets to produce an arbitrary ordering of the union of members.

You could then rekey that sequence to be all undefined to produce a set from that sequence, which would produce the union.

I don't see any theoretical problems here (except that you need to support one-to-many mappings).

How would that help with

How would that help with adding a sequence interface on maps?
You would still need a separate sequence data type, for when you want sequential concatenation ...

Separate Sequence...what

Not a separate type, but a separate function, to "massage" your sequence into a form compatible with general map concatenation.

You should have many such "massage" functions, depending on how you want the sequences put together.

All of these functions should be usable on compatible maps and sets, in a way that does not affect the invariants they care for, but do affect the sequence's invariants, for instance, those relating to insertion order.