data locality and data structures

In the past few months, I've been studying programming languages/compilers/etc.. I'm surprised that basic data structures used to implement various language constructs are not given a great deal of importance.


Many PL/Compiler books mention tuples, records, trees, linked lists...but don't generally describe their performance trade-offs.


Simple matters such as using a basic linked list vs. a linked list which stores several keys in a node would greatly improve performance (and just one paper I've read mentions that most implementation do use such a technique).


Huge performance gains from languages such as fortran and APL seem to come from data locality as well (alhtough I can't be sure since I didn't get any straight answers from relevant usenet newgroups...or I didn't know how to phrase the question correctly).


Now I have collected enough books and papers which talk about lexing, parsing, type theory, functional vs. imperative, etc. But nothing which goes into detail about having the right set of fundamental data structures. For example: isn't it better to use to 'struct' to keep attributes of an object together in memory (since they are often likely to be used together), where as attributes of a large relation should be more liked a 'linked' data structure to allow quick addition/deletion of attributes (...basically adding relational algebra as a first class component my mini-language). What if I have a list of a million records (tuple with named attributes)...isn't it much better to have name/position correspondence stored on as part of list definition rather than a million times with each instance of record?


Data structures are, obviously, a seperate subject, but surely they deserve more attention in compiler/PL books. Any way, what do the experts on this forum think? Did I miss something obvious?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

It depends...

The quick answer is that there is no one right answer. Each choice leads to a variety of tradeoff of space vs. speed vs. convenience for the compiler writer. Does the topic need more coverage? Probably. But, from a language design and implementation point of view, does it warrant that amount of attention? Probably not.

In general, the interesting part of the language design is to what extent the run-time data will leak over to the user's experience in writing code. All else is simply performance hackery. And you have data structure books to help you with the decision of the run-time data organization.

As for the actual implementation - the generation and manipulation of your data storage elements should probably be in a separate module where they can be easily changed if you decide to use a different storage layout or add a variant. From this point of view, if the choice you made isn't giving you enough performance, simply re-write the relatively few lines of code needed to re-implement this part.

I would have agreed with you

I would have agreed with you before I actually started learning about implementation. Text books such as 'Modern Compiler Design' by Grune et. al. and 'Modern Compiler Implementation in C' by Appel spend some space discussing how to implement activation records, best ways to parse files, etc. These matters could also be abstracted away but are important enough to warrant several chapters.


Specifically regarding data structures, internal structure of a linked list node may be an ignorable detail, but something like contigious memory allocation of an array must play a central role in any language...no? Further, such organization of data becomes even more important in langauges that try to ease distributed programming.

Implementation vs Specification

These matters could also be abstracted away but are important enough to warrant several chapters.

Well, remember those are books about writing a compiler not designing a language. Implementation details matter when you are implementing, but they shouldn't matter when you are specifying.

Trying to constrain the implementation when you are specifying overly constrains the solution space and makes it harder to make good trade-off decisions at implementation time.

Further, such organization of data becomes even more important in langauges that try to ease distributed programming.

Oz, for example, is a language with unusually good built-in support for distributed programming, and it manages to specify semantics at a reasonably high-level. In fact, I would argue that in distributed programming assuming too much low-level detail in your specification is especially bad, since you shouldn't have to constrain private workings of a remote machine, or else you ruin the scaleability and independence of your distributed system.

Assumed knowledge

It makes sense to me that a compiler book would only give descriptions of compiler-specific data structures and techniques (activation records, methods of parsing), but leave other topics untouched. I've been reading a book about databases. The book only covers database-specific data structures (logs, pages, buffers, etc) and techniques (locking mechanisms for transactions, etc).

I would have agreed with you before I actually started learning about implementation...

So the problem you're faced with is that you need a lot more information than is contained in the book? The same is probably true for any book on complex software - databases, operating systems, programming languages, etc. For example, to implement a database I need to be familiar with locking primitives, I/O and networking protocols, general purpose data structures, memory management, distributed programming techniques and a lot of other stuff! I just have to fill in holes in my knowledge by consulting other sources - I guess this is the position you're in too.

Data structures are, obviously, a seperate subject, but surely they deserve more attention in compiler/PL books. Any way, what do the experts on this forum think? Did I miss something obvious?

I'm not an "expert", but I think data structures should be assumed knowledge. Unless you're talking about something particular to compiler/PL programming?

Garbage collection helps

At least with things like linked list or tree based structures. See Cache Performance of Fast-Allocating Programs as evidence.

fundamental structures

having the right set of fundamental data structures. For example: isn't it better to use to 'struct' to keep attributes of an object together in memory (since they are often likely to be used together), where as attributes of a large relation should be more liked a 'linked' data structure

Yes, this implies that a language should have at least these two building blocks with which to construct efficient data structures for lists, tables, relations, etc.

  • structs, basic tuples or objects, for locality
  • pointers / references / links
Lisp seems to get by without the former, but at least the car and cdr are local to one another.



Although perfectly relevant to interpreting programs, matters of the performance of more complex data structures have been addressed more often from the hardware side, with its hierarchy of memory from registers to disk, than from the language implementation side. It's somewhat analogous to the way arithmetic was originally a big concern of language implementers. Now he/she essentially passes along what the chip provides, or faces a substantial performance hit.

Lisp has structs!

And objects! And multidimensional arrays! It isn't just a list-only language.

update

On another reading, I found 'Modern Compiler Design' does include tips such as effeciently implementing an extensible array by growing it by a multiple of its original size.
More relevant to LTU, on Robert Harper's page there is a presentation "Types and Programming Language" which mentions sub-structural types. Apparently these types define data layout! Now I have to go find material related to such types.

Broken record

At the risk of being one of the tedious cheerleaders for Pierce's work ;-), the book he edited recently, ATTAPL has a chapter on substructural types.