Fundamental Flaws in Current Programming Language Type Systems

I'll start with the following statement (which from different perspectives may or may not be considered valid by various parties):

Static Type systems and Dynamic Type systems are the two sides of a single coin.

In the past I have raised a number of questions in this forum regarding types and programming languages that have in the main only been answered by those who have a strongly held viewpoint. I have been spending much time over the last few years considering the viewpoints thus exhibited.

The viewpoints expressed have been one of the following:

Static Type are the best - various reasoning
Dynamic Types are the best - various reasoning
Soft Typing system - amalgamation of the best of both worlds - various reasoning.

What has been unsatisfactory about the answers and the discussions that have ensued is that the subject of real world systems and the real world situations they are deployed in have not been closely examined. By this I mean that hardware related errors, system software related errors and user caused errors (including deliberate attacks) have not been brought into the discussion in relation to how the various kinds of type system can help in these conditions.

Putting aside anyone's particular bent in belief as to which is the best kind of type system for programming languages, my question is:

In what areas do each of the above type systems fail to provide the programmer with the tools (language constructs, abstraction facilities, etc.) to actually deal with real world troubles, problems and conditions?

PLT should be about improving what we can actually do with the computing machinery that we use.

With the wide range of people in this forum, what areas of research are being looked or not looked in this regard?

An example problem is an application that works correctly on a single processor system but exhibits somewhat random behaviour on a dual (or more) processor system due to the seemingly random interactions/responses of a third party library or component.

Please, not interested in ideological wars over this.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Type Dynamic

Static Type are the best - various reasoning
Dynamic Types are the best - various reasoning
Soft Typing system - amalgamation of the best of both worlds - various reasoning

Ignoring various complaints one might have about this classification (in particular, the class of "static type systems" is much too diverse to throw it all in one pile, esp. considering dependent types and the likes, which seem relevant to your question), it is missing out on one important alternative:

Static Typing plus type Dynamic - amalgamation of the best of both worlds - various reasoning

This is very different from "soft typing" and makes entirely different trade-offs. In a nutshell, it opens up static typing to dynamic without sacrificing type soundness, while soft-typing opens up dynamic typing to static without sacrificing untyped programming.

Excepting typeful

Excepting typeful programming (including type-based metaprogramming), types don't provide any 'tools' to the programmer - they only provide safeties on other language constructs, abstraction facilities, etc. provided by the programming language.

Most type-systems focus on operation safety (i.e. guaranteeing that no 'undefined' behavior occurs in the system), but type-systems can be aimed at almost any invariant (guaranteeing that only one reference to an object exists (linear types), guaranteeing some degree of security in communication, guaranteeing hard realtime constraints in DSP, etc.). I think that, if you look, you'll find plenty of discussion on how "system software related errors and user caused errors (including deliberate attacks) have," in fact, "been brought into the discussion in relation to how the various kinds of type system can help in these conditions."

Admittedly, hardware errors don't really come into it. There isn't much a type system can do to prevent or reconcile hardware errors at the myopic scope of a single program. When distributed programming becomes more common, we may see variants on 'linear' and sub-structural typing aimed to guarantee certain degrees of redundancy and survivability, but most programming languages themselves don't yet provide the 'tools' (abstraction facilities, language constructs, etc.) to even start considering this. At minimum, we'd need the ability to indicate (in abstract) that a critical behavior is prepared to occur 'on other hardware in case of hardware failure', and that requirement is independent of the whole dynamic/static debate.

Anyhow, this topic has a trollish smell to it. It might just be paranoia from long experience with people starting up typing discussions without any substantive claims. Is there a reason you are asking the question?

There isn't much a type

There isn't much a type system can do to prevent or reconcile hardware errors at the myopic scope of a single program.

You might take a look at Project Zap.

Project Zap

Interesting. I'll definitely take a look at the papers there.

Typing for Concurrency

An example problem is an application that works correctly on a single processor system but exhibits somewhat random behaviour on a dual (or more) processor system due to the seemingly random interactions/responses of a third party library or component.

I'm not sure that the rest of your question is well-framed, but this part struck me as potentially interesting. In my experience, concurrency heisenbugs arise from memory violations (which we clearly know how to get rid of), compilation errors (many compilers mis-handle memory barriers, but types aren't the answer) and concurrency guard failures (something is done without the appropriate lock/semaphore/whatever being taken).

The last problem is something that might conceivably be helped by dependent types and/or region types, and I think it's a potentially useful thing to pursue. That said, I think it has almost nothing to do with the static/dynamic continuum. Your example question presupposes that failure at runtime is bad, and that says that we're looking for a compile-time (i.e. static) check here. The problems I see are (a) working out all the technical details, which some people are already trying to do, and (b) reducing the result to something that ordinary humans can use in practice, which is a very hard problem that is under-acknowledged in research funding systems around the world.

Pluggable and Optional systems missing.

I've been tracking some of Mr. Rennie's earlier threads and have found plenty of threads with comments on exactly this subject that cover many things that static typechecking may handle that dynamic typing cannot. One such comment was dedicated to TyPiCal and its ability to locate race conditions and identify deadlock hazards.

I believe (but cannot prove) that dynamic type checking is pretty much limited in scope to checking 'myopic' conditions on code such as 'is this next operation well defined?'. The only undesirable property that can be prevented in dynamic typing is the execution of undefined behavior. Anything that requires a check broader than a single operation (such as checking a whole expression, a whole function, or even whole program flows) really is the domain of static typing.

So, if the question becomes "In what areas do each of the above type systems fail to provide the programmer with [assurances]?", the answer for dynamic typing is 'most areas'. You'll not find dynamic type systems that can assure properties such as real-time constraints, memory footprint guarantees, absence of deadlocks and race conditions, containment of information, capability and containment of authority, etc.

Of course, many of these conditions might be provable if one is able to annotate the code for external theorem provers (e.g. 'pluggable' static type systems). One might even be able to annotate assumptions for the optimizer, or allow the 'pluggable' type-systems to also operate as code-transforms that can perform or suggest optimizations statically (which would be helped by the annotations being part of the AST and the pluggable type systems being part of the standard program pipeline). So it seems that in Bruce's organization of type system classifications, 'pluggable post-processing pipelines with language-supported annotations' seems to be missing from among them.

There are some advantages of including some form of 'type system' directly in the language. One is to check for certain invariants across 'module' boundaries (where a 'module' may be introduced in a pre-compiled form by a third party). Another is in support of typeful programming, where the type system can be reflected back into the runtime behavior or syntax of the system in support of DSLs and new language features (as sometimes happens with Ruby meta-object protocol and C++ templates at opposite ends of the dynamic-static scale).

An interesting question, I think, is what advantages can embedding a particular static type system directly into the language offer assuming the language has another set of language constructs for plugging in post-processors (e.g. in a convenient manner similar to importing modules) that can readily report errors and associate them with pre-macro-expanded text areas in the source as well as perform or suggest optimizations or properties to post-processors further along in the static pipeline.

One thing that some type systems provide

With type classes you can define a generic + with identity element 0. There is one plus for numbers with the 0 identity. There is another + for vectors with the zero vector (0,0,...). This means that you can define a generic sum for summing a list. If you sum an empty list of numbers you get 0, but if you sum an empty list of vectors you get (0,0,...). In a dynamically typed language you don't know if an empty list is a number list or a vector list so you can't define a generic sum. A bigger example is QuickCheck. It is able to generate random values in the domain of a function by using the type of the function.

Are there ways to define a generic sum in a dynamically typed language?

Contextually inferred

Contextually inferred parameters have advantages regardless of whether or not those parameters are typed... though I agree that if you're already performing the analysis necessary to infer parameters it doesn't make much sense to skip out on at least some static typing and safety verification.

Generic sum can be done in dynamic programming. Either the sum function has a parameter explicitly for the seed used in the fold, or a common 'identity' value is included with the default dynamic type dispatch rules such that 'object o OP identity i = o' for all operations (which could easily be done by overriding 'DOES NOT UNDERSTAND' and such).

The solutions you mention,

The solutions you mention, adding a parameter to sum or extending the + operation for a common identity value, aren't good. Adding a parameter to sum doesn't really solve the problem and breaks abstraction barriers. Other functions that are calling sum (for example a function that averages a list) will have to take an extra parameter too. Extending the + operation with a new common identity value for vectors and numbers isn't any good either in my opinion. First, you shouldn't have to change low level operations to be able to define sum (breaks abstraction barriers) and second you now have two zero values per type, zero the number and zero the common identity.

The best solution I can think of is annotating lists (at run time) with the type of their elements. List<Integer> is the compile time list type in a statically typed language. In a dynamically typed language you could have a constructor EmptyList(Integer) that returns an empty list of integers. This solves the problem because you can now access the identity value with Intger.AdditiveIdentity. This "solution" throws away a lot of the benefits of dynamic typing.

So is there a real solution for this problem?

Value Judgements

Other functions that are calling sum (for example a function that averages a list) will have to take an extra parameter too.

That isn't really the case. Two points: (1) There are ways for languages to support implicitly threaded parameters if that is what one needs. Metadata can be attached either to constructs or to computations. (2) In a manifestly statically typed language, there is no advantage whatsoever provided by the static typing's implicit 'sum', since other functions calling 'sum' will already be forced to 'take an extra parameter' for the data types.

Extending the + operation with a new common identity value for vectors and numbers isn't any good either in my opinion.

You should really stick with technical judgements rather than that sort of value judgement.

First, you shouldn't have to change low level operations to be able to define sum (breaks abstraction barriers)

I agree, but I don't see how the example I proposed requires "changing low level operations", so I feel the argument is moot.

and second you now have two zero values per type, zero the number and zero the common identity.

Zero value is only a summation identity. The 'identity' value would not be a 'zero', it's simply 'identity value' - a unique object or value with its own properties.

The best solution I can think of is annotating lists (at run time) with the type of their elements.

That is an option, of course. Similarly, you could have each list be able to overload its own 'sum' and 'product' operations. Those two solutions are about equivalent.

(1) There are ways for

(1) There are ways for languages to support implicitly threaded parameters if that is what one needs. Metadata can be attached either to constructs or to computations.

Can you give an example?

(2) In a manifestly statically typed language, there is no advantage whatsoever provided by the static typing's implicit 'sum', since other functions calling 'sum' will already be forced to 'take an extra parameter' for the data types.

Type classes hide this extra parameter.

You should really stick with technical judgements rather than that sort of value judgement.

I disagree. Programming language design is based on taste, not just facts. And the explanation of my opinion is in the next sentence.

I agree, but I don't see how the example I proposed requires "changing low level operations", so I feel the argument is moot.

If you add an indentity value you have add a rule to + so that x + Identity = x. So you have to change + to be able to write sum. This also means that you have two equivalent values for addition: 0 and Identity. You probably want x * Identity = 0, otherwise you're breaking the algebraic properties of + and *.

Example ways to attach

Example ways to attach meta-data to computations: environment variables, thread local storage, language-supported continuation contexts, hidden implicit parameters as utilized in Haskell monads.

Example ways to attach meta-data to constructs: extra properties attached to an object, adding extra maps or tables from construct reference to associated meta-data, or adding a 'metadata' object to each object.

There are more, of course, but that's what I can come up with in a few minutes.

Type classes hide this extra parameter.

It is type inference that is hiding the extra parameter. Type classes have nothing to do with it.

Type classes are essentially open functions from type->function. In a manifestly statically typed language, you could expect to explicitly include the type in the type class. Consider that C++ templates can effectively serve as type-classes due to support for template specialization (which allows them to be 'open' functions):


 template<typename T>
 T sum(List<T> lT) { 
   T r = sum_class<T>::zero();
   for(it = lT.begin(); lT.end() != it; ++it)
     r = sum_class<T>::add(r,*it);
   return r;
 }

Type classes could easily be designed for manifestly typed systems, so long as you can produce open functions from type->function.

Since I was referring explicitly 'manifestly' statically typed systems, I must have been excluding type inference. Inference of argument is useful, as I mentioned above, but it should be considered as a separate feature from static typing.

I disagree. Programming language design is based on taste, not just facts.

Regardless, arguments on LtU shouldn't be based on taste.

If you add an identity value you have add a rule to + so that x + Identity = x.

You have to add a rule to the language, yes. In a dynamic language, this isn't such a big deal... no different from adding the rule to add vectors together to produce vectors.

This also means that you have two equivalent values for addition: 0 and Identity.

Precisely where does this cause a problem?

You probably want x * Identity = 0, otherwise you're breaking the algebraic properties of + and *.

I'll agree that if you're aiming for your language to remain utterly true to algebraic properties you should probably favor static programming languages... and either functional languages or logic languages.

But keep in mind that programming languages, while oft inspired by math, are not beholden to it. What is x*null, or Integer x * Duck d? How does an 'null pointer exception' or 'runtime type error' fit into the algebraic properties of '*'? Why does 32767+1 equal -32768 for some implementations of C and not for others?

Why should I care that (x*Identity) = x at the same time (x+Identity) = x? In Smalltalk, the model isn't math, it is 'x receives message '* Identity and replies with x'.

Does 0==(0,0,...)? Does 0 ==

Does 0==(0,0,...)? Does 0 == Identity? Does (0,0,...) == Identity? Does 0==(0,0,...)? If no then equality isn't transitive and if yes, then:

a = new Hashtable
a.insert(key: 0, value: "foo")
a.insert(key: (0,0,...), value: "bar")
a.get(key: 0) => "bar"

If you want this behavior then you have to change the hash function for 0 and (0,0,...) to produce the same hash.

Assuming we want 0 != (0,0,...) and we want equality to be transitive, so 0 != Identity. What does Identity + 0 produce? If Identity + 0 == Identity then Identity + a == a no longer holds. If Identity + 0 == 0 then 0 + a == a no longer holds. As the authors of SICP would say: "This is a serious complaint.".

Comparing x*null with x+0 is comparing apples to oranges. x*null doesn't have meaning in math, x+0 does. Math in programming should behave like math to be able to reason about programs.

Regardless, arguments on LtU shouldn't be based on taste.

(that's an opinion btw) I think they should. What people find easy/simple/beautiful/logical is critical. Programming languages are designed for humans to use.

To your first four

To your first four questions: no, no, no, no.

If no then equality isn't transitive

Since (0 != Identity) and (Identity != (0,0,...)), transitivity is not violated when (0 != (0,0,...)).

What does Identity + 0 produce?

0.

If Identity + 0 == 0 then 0 + a == a no longer holds.

I'm not seeing the logic for that claim.

In any case, in mathematics '0 + a == a' only needs to hold when 'a' is a number.

Comparing x*null with x+0 is comparing apples to oranges. x*null doesn't have meaning in math, x+0 does

Depending on the domain (e.g. x could be a tuple or graph), 'x+0' might not have any meaning in math. And, in programming. 'x+0' can also throw a 'null pointer exception'.

Math in programming should behave like math to be able to reason about programs.

We reason about programs using the tools (types, semantics, etc.) to reason about programs.

You seem to want the ability to reason about math language in programs using the tools to reason about math while ignoring that the context is programming and not math. I don't consider it an undesirable goal, and it might be well applied to a DSL, but there will forever be borderline cases where math is not programming, where 'a*0' vs. '0*a' can be the difference between a program that never terminates and a program that terminates immediately, etc.

Regardless, arguments on LtU shouldn't be based on taste.

(that's an opinion btw) I think they should. What people find easy/simple/beautiful/logical is critical. Programming languages are designed for humans to use.

Besides being practical, it is policy. An argument or claim based on taste is not substantive. See Item 4: Provide context and substantiate claims, and 4b: avoid ungrounded discussion.

When you've developed statistics about people's tastes and what they find 'beautiful' or 'easy' (as opposed to merely 'familiar') then feel free to link them into an argument.

I'm not seeing the logic for

I'm not seeing the logic for that claim.

Here is is:

Identity + 0 = 0
a + 0 = a
so take a = Identity: Identity + 0 = Identity
so Identity = 0

re: statistics. I doubt that all your claims are backed by solid research. It's simply not a practical way to design languages. Maybe I should have used "common sense" instead of "opinion".

Commuting Common Sense

For a person who loves math so much, you do play fast and loose with the logic. I notice that the original '0 + a = a' has become 'a + 0 = a'.

In any case, I'll mention that, like its commutative cousin, 'a + 0 = a' is only valid in math when 'a' belongs to certain classes of values. It isn't a universal mathematical truth.

re: statistics. I doubt that all your claims are backed by solid research.

A claim doesn't need to be backed by solid research to be substantive, but it needs a lot more grounding than 'cuz that's the way I like it' or 'cuz I said so'. Extraordinary claims require extraordinary evidence. Less extraordinary claims still require evidence, or at least the ability to point to some evidence when asked.

When you start making claims about how a particular language design is 'beautiful' or 'easy' or 'simple' or about how 'that's the way it should be', you'll invariably run into idealist bastards like myself who demand evidence that the way you think things should be has any reason to influence the way they think things should be.

If you believe something 'beautiful' but can't pony up the evidence, you're just asking for a juvenile 'is so!' 'is not!' 'is so!' 'is not!' type of argument with no substance whatsoever. Since we're all professionals here on LtU, we'd be much better off avoiding such arguments by avoiding (as much as we can) claims that aren't defensible.

Maybe I should have used "common sense" instead of "opinion".

I'm of the opinion that "common sense" is a legendary base of knowledge that probably doesn't exist and that, if it exists, almost certainly says very little on the subject of programming language design.

My thoughts and comments for the responders

My thanks to Andreas Rossberg, David Barbour, Noam, shap and Jules Jacobs for your responses. I have been spending some thinking about the various responses given.

Firstly, David, this is not a troll - the reason I am asking the question is to find out the limitations of current type models (thank you Andreas and David for your additional suggestions of groupings) in relation to writing applications in the "real world". All of the discussions/papers that I have seen/read has been focused on what each model allows you to do and achieve but nothing on the areas that they fail to provide as a benefit to the programmer, other than

on exactly this subject that cover many things that static type checking may handle that dynamic typing cannot.

What I am interested in is the limitations of all or any system in current use and if there is any active research that is trying to extend these boundaries for "real world" application development.

For example, you provide your view that type systems are about assurances and only about assurances, hence an obvious limitation. You also provide your view of the limitations of dynamic type checking. However, every static type checking system is a dynamic system analysing and generating new types as the program text is read and generates the appropriate code for execution. So in this example, it is about more than assurances - that is one of the outcomes available.

Thank you for pointing out TyPiCal. An obvious limitation of TyPiCal is the inability to determine deadlock conditions in systems that have dynamic numbers of processes over an extended period of time, as the analysis is still based on static analysis of the program text which is not necessarily the condition in human-computer or computer-computer interacting systems.

David you make the following statement

types don't provide any 'tools' to the programmer - they only provide safeties on other language constructs, abstraction facilities, etc. provided by the programming language
and in doing so, from my perspective, seem to be advocating that it is not the business of programmers using the language in question to be able to use this as a part of their application or application development processes. Is this intended?

One of the interesting facets of static type systems is that types are meta-information related to the language and not values that can be manipulated in the language. I have read a number of papers that discuss why this should be so - all of them have the same underlying premise that type containing all types must also contains itself which leads to a logical inconsistency (fair enough).

One of the underlying assumptions of the above is that the map is the territory. I have observed this phenomenon in many areas of mathematics in very many different areas of study (from PLT to quantum mechanics to astrophysics to etc). An opposing view would say that the map is not the territory. An example of this would be that the values within the domain type represent the various other types of values and are not those types.

Would this lead to a different model for type theory? I don't know and that is part of the question relating to the limitations of current type theory and type checking systems. I do not have the expertise to look at this, it not my area and there are many on this forum alone who are certainly more qualified to study this.

Shap, thank you for your comments about the difficulty and the underfunding for research. I am not presupposing that failure at runtime is bad, what would be bad is not being able to handle the failure. From my perspective, what limitations in current type systems (static, dynamic, etc) prevent their use in dealing with heisenbugs (love the term Shap, thank you).

Noam, thank you for pointing out about Zap, more reading to do.

Over the years, my questions in this forum have been motivated by the view that PL's and PLT are tools in the toolbox for which I can solve problems presented to me in a manner that is appropriate for the circumstances at the time. Static typing is one tool and in languages that have it, I can solve particular problems with. Dynamic typing is another tool and in languages that use it, I can solve other problems. But there are limitations to these tools as a developer.

I find that those who advocate one species of typing over another seem to just don't seem to see that each of these tools has its appropriate place in the toolbox. I can use a hammer to drive in a screw but it is a lot of hard work, I can use a screwdriver to drive in a nail and this is also a lot of hard work.

The entire field of PLT is still very young and there is still much to look at in how the tools provided by PLT can be put to general use in program development.

So I still go back to my original question

In what areas do each of the above type systems fail to provide the programmer with the tools (language constructs, abstraction facilities, etc.) to actually deal with real world troubles, problems and conditions?

Again, my thanks to those who have responded, I greatly appreciate your time.

"Real world"

In this comment you use the phrase "real world" twice. The real world is a very, very big place. Even if we just focus on the real programming world we have everything from microcontrollers to ultra-high performance computing clusters; we have code for one-off shell chores to life critical medical equipment control; we have user written game scripts and we have financial decision support and trading systems; game simulations and airplane navigation systems.

The real world is staggeringly vast and the economic cost/benefit analysis of any particular formal proof strategy is going to be different for every corner of it. Thus there is not, nor can there be, a simple answer to any question based on the "real world."

Now, which corner of the real world do you want to talk about?

Real World

James, I agree with your comments in relation to the scope. However, after 30 years (from undergraduate days till now), yet at its core, programming is about solving a problem at hand (irrespective of how large or small) and my question is related to the limits of what can be done and what can't be done.

An example unrelated to programming, is that my brother and sister-in-law drive 50 tonne mining trucks and my son-in-law drives a 1 tonne ute. Both are vehicles that carry materials from one place to another using the same fundamental techniques/technology, but a limitation of both is that I can't use either to carry goods from here to Antarctica.

Putting aside any cost benefit analysis, what are the limits currently. Our knowledge can and has been increased by asking a question about what are the limits of any particular methodology/technology, etc or why we do things a particular way. My question may not be phrased in the most appropriate way.

To go back to my toolbox analogy from my last response, for a craftsman, the tools in the toolbox are continually being added to because the existing tools are not always the right tool for the job, even if he has to make them himself. For this look at any specialist woodworking magazine and the tools that are being shared about.

So, what are the current limitations and what areas and ideas can be suggested that are not being looked at so as to be able to expand the boundaries?

Not just about assurances

you provide your view that type systems are about assurances and only about assurances

I thought I was pretty clear in describing that type systems provide two facilities: assurances and typeful programming. Indeed, my first sentence, the latter half of which you quote in your reply, started with an important condition: "Excepting typeful programming (including type-based metaprogramming) [...]"

Assurances provide protections (static or dynamic) against various forms of program faults.

Typeful programming reflects types back into the syntax (as more than annotations... e.g. for automatic conversions and statically determined polymorphic dispatch) or into the runtime behavior of the system (including virtual methods or multi-methods, runtime polymorphism, meta-object protocols, etc.). Thus these features may also be at different ends of the static/dynamic scale. With static typeful programming this meta-data might not become part of the runtime, but it still can be manipulated (often to powerful effect) by a skilled programmer.

However, I will note that the majority of 'typeful programming' benefits can be achieved (often more flexibly) by other mechanisms than use of types (e.g. an extensible syntax on one side, whole-program-transforms on the other).

And assurances can also be achieved by mechanisms other than forcing the type-system directly into the language (i.e. one could use pluggable type-systems to perform arbitrary analysis under varying assumptions). The main area I see a problem with this is in partial or separate compilation, in which the 'assumptions' one makes can become easily violated.

However, every static type checking system is a dynamic system analysing and generating new types as the program text is read and generates the appropriate code for execution.

Poking a slight hole in that characterization: "as the program text is read" is generally false. Static type analysis, or at least the majority of it, can only occur "after" the program text has been read. And that difference can be critical.

But I'll agree that you can run a static type analyses on a runtime program. Unfortunately, doing so will usually require freezing the program.

Not just about assurances

Thank you David, for your correction regarding you statement. My reading of the emphasis was that assurances was the goal.

In regard to your second major paragraph, what you say is true, but!!, rarely are these facilities given to the programmer to use as part of his toolbox. You make the statement that a skilled programmer can manipulate this meta-data which can be true (as seen within many of the papers referenced on LTU. However, I will maintain that there are very few languages that allow the same levels of freedom for the programmer that the compiler/language has for manipulation of this data.

The following paragraph, I think hits one of the nails on the head, and my response is Why? Is it a inherent limitation or is it a limitation due to the current practices and theory?

David, in this regard, this point you make is answering the original question I raised. Thank you.

I agree that various mechanisms can be used to generate assurances. In this area, the question that can be raised is if type information is available at one point why is it not maintained at all points thereafter particularly for partial/separate compilation situations or made available for the load time and/or runtime? I do recognise that this would increase the complexity of the resulting program. Is this not useful information? Please don't get me wrong, I am not advocating any position here, I am simply asking a question so that appropriate clarity can be found.

My characterisation "as the program text is read" includes any transforms that may occur to the textual data into other internal structures (as the analysis still needs to "read" this information). Sorry for any confusion on that point.

Yeah, well

if type information is available at one point why is it not maintained at all points thereafter particularly for partial/separate compilation situations or made available for the load time and/or runtime?

Types are erased for performance; there is no other answer. Object modules may maintain types for different top-level, and sometimes local, entries (like most ML and Haskell languages do).

It's a trade-off. If the parser and semantic analyser for a language are fast enough, and one is not hindered by IP, there is no real reason why one would skip the intermediate translation to object files and just distribute the sources.

I guess VB, and most Basics, actually did that.

This is an implementation issue, not a language design issue. A language will not grow in expressive power, or be more useful, if types are not erased during intermediate steps in the compilation process. I don't think this addresses your original question.

[Stated differently: If your language is such that you 'cannot' erase types when translating to object modules; well, you're stuck to building a compiler where that information is preserved.]

Yeah, well

Marco, yes they are erased for performance in static systems but not in dynamic (different approaches). But why? I started at a time when the big grunting mainframes were far, far less powerful (in many ways) to the current generation of desktop machines today. With the power under the hood we have today, this should not be an issue.

The trade-off is there, we distribute sources or a binary format that retains type information and yes, this is an implementation issue.

The expressive of the language could go up because it could now allow the programmer to provide the additional facilities to manipulate this information. For partial/separate compilation, one additional facility is to allow verification of type information between modules if you like. Another is that it can become possible (with the appropriate values) to manipulate these modules by the programmer. Food for thought.

At any rate, 'tis time to hit the pillow as I only got up to get a drink of water not respond to the various posts. 'Tis 2:30 am AESDT.

Power is not absolute

It still takes a long time for most languages to compile a million lines of code. I guess there is your answer.

One other reason for type erasure

might be for obfuscation; particularly for software which is sold and deployed to third parties.

One of the initial (and still somewhat relevant) objections to Java, at least early on, was the ease with which .class files could be decompiled back into source code; many shops consider the source to be Valuable Trade Secrets and Intellectual Property.

This was a bit of a shock, at first, to houses used to C/C++; translation of either to assembly does a far better job of obscuring the meaning of the program then Java compilers do; hence the popularity of Java obfuscators and such.

Now, whether or not obfuscation is a good thing or not is a question of ethics as opposed to technology--but some vendors prefer to keep their secret sauce secret. :)

Limitations of Typeful Programming

the majority of 'typeful programming' benefits can be achieved (often more flexibly) by other mechanisms than use of types (e.g. an extensible syntax on one side, whole-program-transforms on the other).

The [above] paragraph, I think hits one of the nails on the head, and my response is Why? Is it a inherent limitation or is it a limitation due to the current practices and theory?

The basic reason that other mechanisms can be more flexible is that any given type system - including the one integrated into your favorite language - will, by nature, be limited in its algebra and the properties it both describes and can effect upon the program.

Support for whole-program transforms and syntax manipulation aren't so limited, but making these facilities generic creates quite the exercise for the language designer.

Types vs. Type Descriptors

all of them have the same underlying premise that type containing all types must also contains itself which leads to a logical inconsistency (fair enough)

Russell's paradox is the set of all sets that do not contain themselves. I'm pretty certain the set of all sets can include itself without inconsistency. Same, I suspect, for the type of all types.

Anyhow, as programmers we are limited to working with type-descriptors (values that describe types) as opposed to directly working with types (which are logical entities that only exist after a logic is enforced between type-descriptors and program constructs).

Keeping this dichotomy in mind often helps clear up issues of when it comes to 'types containing all types'. It is, after all, often straightforward to construct a type-descriptor of all type-descriptors in a manner that can be used to describe itself.

Type vs Type Descriptors

David, my question to you on your statement is, why are we limited to working with type descriptors and not with types? You describe type descriptors as values, why are not types considered as values as well along with characters, code, environments, booleans, numbers, sets, lists, etc?

I think this is the part that just doesn't make any sense. For any type that we can imagine, as programmers, we use some form (whether it be a glyph or a series of glyphs or a binary code or electrical signals) to represent an abstract value that we consider being part of an abstract grouping called a type or class or whatever else someone might want to call it. My question on your comments comes back to this, what is the difference between doing the same for the abstract values in a abstract grouping called Type?

Types as values

There are languages, and logics, where types can be treated as values.

However, in such a setting, you cannot check all terms against all types statically. I personally like the discrepancy between static analysis and dynamic execution. (If I want runtime checks, I will write them down as assertions, and keep the boundary between static analysis and dynamic evaluation clear.)

There are also logics for programming languages which have a type of types. It just becomes a bit more esoteric, and people might wonder about the assurances given by the type system.

There isn't one good answer, because, well, there isn't. Fundamentally. Either you like static checks, and you're stuck with a type system where not all you want is expressible, or you don't care too much about it, and you can consider other languages.

Maybe you want to look at Epigram and Sage?

RE: Bruce's question to Me

David, my question to you on your statement is, why are we limited to working with type descriptors and not with types?

Types are not values and have no physical representation in a programming system. This is true in much the same sense that 'correctness' of a program has no physical representation in a programming system. Types are logical entities, not values.

However, we can describe points in a type algebra by use of values. The class of values used to describe types may, appropriately, be called 'type descriptors'. Like all other values, type-descriptors possess both logical and physical representation, and thus are subject to computation or manipulation.

You describe type descriptors as values, why are not types considered as values as well along with characters, code, environments, booleans, numbers, sets, lists, etc?

A value is something that can be packaged up into a message and communicated or stored. This makes values immutable, and essentially requires that they be logical views of a representation.

Objects are not values, but may be represented by a pair of values (one value to identify the object, one value to represent the state of that object).

Environments are not values, either, at least not in the general sense.

For any type that we can imagine, as programmers, we use some form (whether it be a glyph or a series of glyphs or a binary code or electrical signals) to represent an abstract value that we consider being part of an abstract grouping called a type or class or whatever else someone might want to call it.

Not always. As programmers we often use 'types' that possess no algebraic representation or direct support from the language. You may have heard of this called 'Duck Typing' - if it talks like a duck and reacts like a duck, it's a duck. In Smalltalk, you might have a 'Coordinate' object be any object that responds to 'x' and 'y' messages.

My question on your comments comes back to this, what is the difference between doing the same for the abstract values in a abstract grouping called Type?

You are free to represent a point in a type-algebra called a type-descriptor. You are even free to create type objects by dividing the type-identifier (e.g. the name) from the point it describes in that space (nominative typing vs. structural typing). But 'types' in PLT are more than descriptions of types; type-descriptors have no meaning except as they are applied in an environment, and environments are not values.

RE: Bruce's question to Me

David,

You make the statement

Types are not values and have no physical representation in a programming system. This is true in much the same sense that 'correctness' of a program has no physical representation in a programming system. Types are logical entities, not values.
which can be applied to integers, characters, trees, lists, etc. Your following paragraph then goes on to say
However, we can describe points in a type algebra by use of values. ...
We do exactly the same for any other domain. There is no fundamental difference.

You then make a distinction between objects and values and environments and values and all of this is based (from my perspective) on a the preposition that a value has a reality that cannot change and so anything that can change is not a value. This is one viewpoint certainly, but is it an accurate viewpoint or even a complete viewpoint. Is this paradigm making a distinction where it is not warranted?

We are and should be free to incorporate any domain into the language of discourse as first class values - but there seems to be a resistance to this. This suggests to me that the paradigm of types or objects (or whatever else takes your fancy) are not values is a limitation for type theory and in the long run for the boundaries of type checking to be expanded this paradigm may need to be dispensed with. This itself may be an area worth investigating in relation to type theory if it hasn't already been done so.

Sorry David, maybe I'm just stupid and just don't have a clue but the last two paragraphs seem to be trying to make a distinction that isn't there (at least to my mind). Again what is so special about an environment that it is not a value.

Correct me if I am wrong please, but it's like you make a distinction between an list of the integers from 1 to 10 which you cannot change and a list of the integer between 1 and 10 that can be changed. One is a value and one is not. I can do everything to the first that I can do to the second except update it. Please help here, what are you saying precisely?

I'll continue looking at the various responses when I get back from mowing my yard.

Types and Type Descriptors

[...] which can be applied to integers, characters, trees, lists, etc.

This is not true. Values are logical views of representations, so every value in any programming system will, necessarily have a representation associated with it. Values can be fully understood in terms of a view and a representation.

Types, however, can only be understood in terms of their environment.

----

However, we can describe points in a type algebra by use of values. ...

We do exactly the same for any other domain. There is no fundamental difference.

Yes, like every other domain: we describe 'types' in terms of 'type descriptors', we can describe 'cars' with 'descriptions of cars', etc. But 'descriptions of cars' are not 'cars', and 'type descriptors' are not types.

There is no fundamental difference.

You then make a distinction between objects and values and environments and values and all of this is based (from my perspective) on a the proposition that a value has a reality that cannot change and so anything that can change is not a value. This is one viewpoint certainly, but is it an accurate viewpoint or even a complete viewpoint. Is this paradigm making a distinction where it is not warranted?

Correct: A value cannot change. This is not the root proposition, of course. The root proposition is that a value can be communicated in physical space (which requires that values be immutable and of finite representation).

The 'viewpoint' you describe - that 'anything that can change is not a value' is accurate, but is obviously incomplete (seeing as it does not indicate what is a value). You denigrate the proposed 'view' of values as a mere paradigm; unfortunately, it is a paradigm enforced by what we know to be physically possible regarding communication and storage of information.

We are and should be free to incorporate any domain into the language of discourse as first class values - but there seems to be a resistance to this.

You can incorporate domain descriptions into a language, but unless the domain is language itself (math, programming constructs, etc.) one usually cannot incorporate domains in a first-class manner.

This suggests to me that the paradigm of types or objects (or whatever else takes your fancy) are not values is a limitation for type theory and in the long run for the boundaries of type checking to be expanded this paradigm may need to be dispensed with.

You will be free to 'dispense' with it when you change the physical reality of what can be communicated and stored within a computation system. While you're at it, I'd like for you to investigate into making 'correctness' and 'robustness' into first-class language values ;-)

Again what is so special about an environment that it is not a value.

In the general case, the environment for a computation is very mutable. You can package up a 'state' for an environment into a value (such that the environment can be restored to that state), but doing so is similar to packaging up the 'state' of an object. The resulting value is not the environment or object, and (without outside guarantees) might not even reflect state of the object or environment even moments after being produced.

Correct me if I am wrong please, but it's like you make a distinction between an list of the integers from 1 to 10 which you cannot change and a list of the integer between 1 and 10 that can be changed. One is a value and one is not. I can do everything to the first that I can do to the second except update it.

I'm not certain what you're imagining here. If you are thinking 'list-object' when you say 'list', then a 'list that you cannot change' does not imply 'an unchanging list' because someone other than you might be able to change the list-object. If you aren't thinking 'list-object' when you say 'list', then the latter (a list that can be changed) does not exist.

When you say you can do everything to the first that you can to the second except update it, I suspect that includes observing the first for updates (e.g. via polling or publish-subscribe), so I suspect you refer to two objects, the former of which you simply lack the capability to update.

Values possess intrinsic identity, so the value [1,2,3,4,5,6,7,8,9,10] is identical to the current state of both these objects. Not just equivalent to. Identical with.

Objects are not values. You really couldn't say the two list objects you mention are 'identical' unless updating the latter list resulted simultaneously with an update to the former list, and observing the former list for changes also results in observing the latter list.

Type Is Not A Type

See e.g.:

http://portal.acm.org/citation.cfm?id=512671

Or freely accessible:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.7073

One thing you can do to evade the problems discussed in these articles is to introduce an infinite hierarchy of types. This is what Coq does, for example.

Type is a Type

I'll take a look at those pages, but (from the abstract) it only insists the type system is undecidable in the general case.

From the abstract

Mh, from the abstract of the first paper ("Type Is Not a Type"):

We apply Girard's techniques to establish that the type-of-all-types assumption creates serious pathologies from a programming perspective: a system using this assumption is inherently not normalizing, term equality is undecidable, and the resulting theory fails to be a conservative extension of the theory of the underlying base types. The failure of conservative extension means that classical reasoning about programs in such a system is not sound.

Requirements & Design

In what areas do each of the above type systems fail to provide the programmer with the tools (language constructs, abstraction facilities, etc.) to actually deal with real world troubles, problems and conditions?

Since you ask a very general question, I'll give a general answer. One thing I think current systems really fail to support are requirements.

Software is mostly build for a purpose. The purpose is translated to requirements (hard and soft, performance and interfaces, concurrency, etc., etc.).

You can't normally guarantee, or even discuss, that specific pieces of software originate from design decisions, specific requirements, or some global purpose. If there is one stumbling block to the development of good software then I think this is it.

Requirements & Design

Marco, it is a general question and thank you for your response.

Looking at requirements and design, this is an area where much needs to be done to even get decent documentation let alone anything else.

What thoughts do you or have you had about this where type systems may be able to be of benefit or are they unable to be of any benefit in the current manifestations? I ask since my original question is about the limitations of type systems.

No real thoughts

If you want to look at limitations of type systems than you can see that a lot of research goes into building type system where specific types of requirements can be enforced statically. (Robustness, resourse usage, concurrency, etc.)

Like "A Type System for Safe Region-Based Memory Management in Real-Time Java", or "A Path Sensitive Type System for Resource Usage Verification of C Like Languages" can be seen as attempts to bridge the gap between time requirements, resource requirements, and a language, such that these requirements can be statically guaranteed.

I would find it interesting to see type systems to model, for instance, failure rates/risk analysis of components. This would be nice on top of, say, an Erlang like language.

At the moment, to me, your question somehow boils down to: For which types of requirements don't we have type systems?

[I am being kicked out of the library now. Uh, *click*]

[Actually, I didn't find really good examples. I thought it should be rather trivial to encode, say, real-time constraints into types. I am sure I read some stuff... At some time...?]

Garbage collection

Every Tuesday night I manually take the garbage that's accumulated over the week to the street for collection. I consider this a real world problem for which state of the art type systems (AFAIK) offer little or no help.

Garbage Collection

Thank you Matt, I needed that chuckle. For me it is every Thursday night after I have sorted the recyclable from the non-recyclable and placed into the correct bins. Though in principle, maybe something could be done.

First Class Garbage

Maybe you should make your garbage into a first-class domain value in a language with explicit memory management, that way you could 'delete' it.

First Class Garbage

David, do you have children? If you do, I think that they would find their daddy very funny.

I still laughing as I write this. It would be a bit way out there for my kids, but I'll keep this in mind all day and when I need a chuckle, I'll reread it again.

Hahahahahahahahahahaha - brilliant. Thank you.

Over the heads of Children

is the domain of programming-language related humor. I'm afraid that children, if I had any, would probably find such a joke to be mystifying.

I'm surprised you found some amusement in that comment, but I do appreciate it.

The liveness of an object might be considered a type attribute

On the other hand, the whole point of liveness analysis is that the type of garbage is irrelevant. :)

No, the fact that you have to sort the garbage and haul it to the curb--or even place it within wastebaskets in your house--is manual memory management. The big smelly truck that drives by once a week is analogous to the OS reclaiming a page that the application has marked as free.

If true municipal garbage collection were implemented, several times a second garbagemen (or women) would enter your house, place all occupants (and object) there in suspended animation, read your minds to determine which articles are important and which are not, and remove some fraction of the ones which are no longer desired. (A few undesired objects may be left behind permanently; as long as they don't smell bad this should be of no concern to the occupants, as the amount of such garbage is in most cases bounded). To you, unwanted objects would merely disappear from perception and memory--as you are suspended during these collection cycles--you wouldn't notice.

In some cases, objects in the house might move, as all the garbage is shifted to one room for easier removal. But you wouldn't notice that either, as your memory would be modified so that moved objects would be remembered in their new locations. (If the article was desired but missing, like my wife's car keys :), it would still be present but in another unknown location).

According to industrial engineering studies of the above procedure, it is speculated that the above procedure would be expedited by having a larger house. Unfortunately, this would not increase the actual living space, as most of the rooms would be cluttered with garbage; and some would not be perceptible to the occupants in any case.

Unfortunately, municipal waste collection isn't as sophisticated as computer science, and the above results are all hypothetical. The moral of the story is, you still have to manually sort your trash and haul it to the curb.

Lack of Expressiveness

While interesting as a means of structuring access to structures defined in terms of a set of primitive structures, lists, arrays, dictionaries, etc., current type systems can't express the sorts of things that you need to express when writing things like databases and operating systems.

In those domains the exact layout of date in memory is extremely important and the lack of a sufficiently expressive type system condemns modern high-level languages to a marginal (ie. "extension language") sort of existence.

Examples:

1. Packed, varying data structures.
2. graphs built with self-relative or base-relative offsets
3. compound objects - a graph allocated in a contiguous subheap, possibly with external references

Until we have type systems that can tackle these sorts of problems, high-level languages will continue to be C programs.

BitC

We spend plenty of time talking about BitC.

Beyond C?

BitC is indeed interesting, but I don't believe its type system goes beyond that of C.

Not clear what you mean

The BitC type system incorporates type classes, parametric polymorphism, and effect types, which is a big step up from C, so presumably you have something else in mind.

Can you say what that might be?

Some Examples

Most of the type structures you mention seem to be built on top of what C provides rather than extending the basic vocabulary. Here are some examples of the sorts of memory structures I'd like to describe:

1. contiguous varying structures

Basically a sequence of (length,variable-value) pairs packed together. You locate the Nth field by parsing and skipping the preceeding fields. This is a common implentation for a relational row.

Example: field 1 contains "foo", field 2 is null, field 3 is "bar" the bytes look like this:

4 f o o 0 0 4 b a r 0
Assuming a maximum field length of 1 byte. For extra credit, a bit ladder. Define utf8 as a type.

2. same with null bitmap

The preceeding structure represents null as a (length,value) pair with a length of 0. If compression is an issue, you might prefix the structure from #1 with a bitmap with one bit for each field. Only those fields for which a bit is set are present and non-null.

3. offsets, indirect lengths

A bucket page for a hash table might be represented as a fixed size page which starts with a variable array of entries. Each entry consists of two (length,offset) pairs, one describing the key, one describing the value. The offset is relative to the start of the page and the length is the number of bytes found at that offset.

Key capabilities are the use of (small) offsets instead of pointers and storing length separately from the data it describes. Commonly, we have a single length which describes multiple date, arrays of descriptors, what have you.

The motiviation is minimizing disk writes: generally one update to the hash table touches 1 bucket.

4. self-relative offsets

For any sort of shared graph (ie. in a file or a network buffer, mapped/read into an arbitrary location, a self-relative offset is a natural representation for pointer. The desire is to read or map a contiguous expanse of data and have immediate (ie. O(0)) first-class access to it.

5. sub-heaps

Imagine a hash table which stores values which are themselves trees. The trees need to contiguous (and compact), so they are allocated in their own sub-heaps.

Obviously, the implementation of "heap" is beyond the scope of the type system, but things like transitive closure are. This is a generalization of cdr-coding.

For example, to send a chunk of graph over the network, you might define a transitive closure, apply this to an unconstrained graph in the heap, producing a contiguous graph in a buffer which could then be sent to another VM and used immediately.

Ah. I see. Some of this can

Ah. I see. Some of this can be expressed with low-level dependent types, which we are considering. Some, of course, can't.

In any case, I now think I understand the kind of thing you mean -- or at least one of the kinds. Thanks.

While the ability to handle

While the ability to handle memory layout is critical for integration with hardware (memory mapped IO, BIOS bootstrap, etc.), I don't believe explicit support for memory layout is as critical to operating systems as you imply. There is a lot of evidence to suggest that the vast majority of any operating system can be written in a higher-level language.

I'm very much of the opinion that expressing data layout in memory should not be a task associated directly with types. Instead: specification of data layout should (a) be optional (since we don't care most of the time even in an OS), (b) be expressed independently of type information, and (c) support automatic translation between representations of the same information.

I suspect that declarative metaprogramming for specification of data layout for compilers (possibly specified as part of driver descriptions) will serve much of what we need. Some sort of codec-driven I/O (codecs being specified and applied separately from types) will likely serve the remainder of the need.

Examples please?

Can you cite even one credible piece of evidence supporting your assertion about memory layout in operating systems? By "credible", I mean that (a) an actual experiment using a real implementation of a serious (as opposed to research toy) operating system has been constructed in the manner you suggest, and (b) credible performance metrics using real-world workloads (again as opposed to toy benchmarks) yield a measured performance overhead that is 5% or less.

It could be that there is evidence out there that I have failed to notice, but this is an area that I try to track closely. I am aware of many indicative efforts: works that seem to suggest that more credible evaluation is warranted. I am not aware of any such credible evaluation actually having been performed.

I think that there are places where compiler-driven memory layout looks very promising, and there is even some work suggesting that lazy functional languages can do much better than C/C++ in some domains. But from experience and long study, I'm very skeptical that operating systems or databases are among those domains.

Can you cite even one

Can you cite even one credible piece of evidence supporting your assertion about memory layout in operating systems?

Certainly. I can point at any number of other large compiled software projects, written in Haskell or OCaml for example, where performance is quite reasonable despite not using programmer-specified memory layouts. I can further point out that the majority of code in, say, Linux operating system does not rely on having a particular memory layout.

By "credible", I mean that (a) an actual experiment using a real implementation of a serious (as opposed to research toy) operating system has been constructed in the manner you suggest, and (b) credible performance metrics using real-world workloads (again as opposed to toy benchmarks) yield a measured performance overhead that is 5% or less.

I'm not particularly concerned about your personal definition of "credible", especially if it's going to lead to an argument over a definition of "research toy".

I can only repeat something Tom Sweeney has said regarding video games: I'd gladly trade 10% overhead for 10% productivity. I'd trade more than that for the ability to break an OS down into services such that I don't need to distinguish between 'kernel' and 'application'. I think we too often trade a local improvement in the OS for a loss of reliability, security, and eventually speed or simplicity in user-space.

Since you're wanting your baby, BitC, to compete as a systems language (for Coyotos), I can understand you feeling a bit defensive about claims that imply some of your fundamental, language-driving decisions might not be the best ones long-term.

In any case, my assertion above isn't that programmers shouldn't have control over memory layout. It is that such control should be optional, and should be applied separately from the type system. As a programmer, control over representation doesn't bother me when I want it... but it bothers me a lot when it means a lot of my code becomes hand-translations between physical representations of the same information.

It could be that there is evidence out there that I have failed to notice, but this is an area that I try to track closely. I am aware of many indicative efforts: works that seem to suggest that more credible evaluation is warranted. I am not aware of any such credible evaluation actually having been performed.

The codec approach isn't a logic-driven approach; it simply says: "this input/output will be represented with that structure, and here are the encode/decode functions to translate between the two". An optimizer can decide whether to perform a translation, specialize the code receiving it that representation, or pass around a virtual (decoder/structure) pair like polymorphic data. It hasn't seen much use in operating systems as of yet.

The declarative approach to memory layout is still early in its application, but looks very promising.

I think that there are places where compiler-driven memory layout looks very promising, and there is even some work suggesting that lazy functional languages can do much better than C/C++ in some domains. But from experience and long study, I'm very skeptical that operating systems or databases are among those domains.

I didn't suggest lazy functional languages for OS use. I think whatever language we use for operating systems 'of the future' will need to support a powerful and composable concurrency model where reliability and productivity is achieved... possibly at what you'd consider a magnificent expense in terms of efficiency (e.g. doubling of temporal costs).

Any sort of transaction support pretty much nullifies whatever benefits you'd hope to achieve by favoring mutable memory, which in turn diminishes the benefits of controlling memory layout. Further, I am convinced that support for transactions (as per software transactional memory or transactional actor model) is the only viable (composable, reliable, safely allowing for priority-driven preemption, etc.) approach to programming large massively concurrent systems - including operating systems.

We'll earn the speed back by using a bunch of CPU cores. We'll gain productivity benefits by avoiding lock management (locks are too challenging to get right, especially with a modular system) and through reduced-size component vocabulary (because transactions allow complex negotiations with simple vocabularies as opposed to having ever-larger vocabularies so that whole negotiations can be crammed into one message).

Citations, please!

What is striking here is that you once again allege to have citations, but you don't actually give any. Until you do, you are engaged in hand waving and strong assertion. Which is fine, but let me go put on my hip waders...

Just to be clear, I'm very eager to see a credible evaluation!

Concerning research toys, I'm not interested in discrediting valid comparisons. But we all know that the devil is in the last 10%, and I get really annoyed when people implement the low-hanging 90% of something in a safe language, measure it with a home-brewed and absurdly non-representative benchmark, and then declare victory for their pet position. You should too! I understand the limits of research budgets, but the best that can be achieved with this type of measurement is indication that more careful measurement is in order. I consider that a very valuable research result, but it is completely improper to evaluate a favored case and then claim a general result.

In any case, my assertion above isn't that programmers shouldn't have control over memory layout. It is that such control should be optional, and should be applied separately from the type system. As a programmer, control over representation doesn't bother me when I want it... but it bothers me a lot when it means a lot of my code becomes hand-translations between physical representations of the same information.

I think that having a compiler do layout optimization is a marvelous thing, and I certainly don't intend to reject that possibility in general. What I insist on is the ability to specify. But I do believe that advocates of compiler-driven layout tend to neglect the issue that compilers are just as bug-ridden as anything else, and that assurance is generally improved by adding magic to the compiler.

Trading 10% performance for 10% productivity is a straw dog. So far as I know, that trade has never been observed in the wild. That's why I want a citation!

And in all seriousness, what is your view on the tradeoff between assurance and complexity in compilers?

you once again allege to

you once again allege to have citations, but you don't actually give any

In my original statement I say "there is a lot of evidence that the vast majority of an OS can be written in a HLL". By "the vast majority" I mean "stuff other than memory-mapped IO and HW drivers, passwords and private data, and a few other things that probably need special consideration in the language". By "a lot of evidence" I mean "pretty much every large compiled software project out there". Those achieve what I consider very reasonable performance... often no more than 30% slower than C, and rarely but occasionally faster than C or much smaller than C. And this with only a fraction of the effort optimizing the code.

Contrary to your assertions that C+5% is the mark to achieve, I don't believe this speed loss is a big deal... especially considering that, should we ever support transactions or versioning, most of the speed benefits you're currently seeing from fine-grained 'mutable' memory will evaporate.

But I'll admit that I don't have any citations that match your definition of 'credible'. What I consider reasonable performance vs. what you consider reasonable performance will probably have us bashing heads with one another.

In practice I use the codec approach on a regular basis, albeit implemented using value-objects in C++, to very good effect. It provides a logical view of a data representation with a simple translation at the periphery. I've applied it in system software often enough, albeit not to a full operating system.

The declarative metaprogramming approach is something I've only seen done in a toy example by other people. What I saw was very promising, but far too early to call anything but a toy, so you're free to ignore my handwaving on that subject.

We all know that the devil is in the last 10%, and I get really annoyed when people implement the low-hanging 90% of something in a safe language, measure it with a home-brewed and absurdly non-representative benchmark, and then declare victory for their pet position. You should too!

Agreed. OTOH, I get annoyed when people attribute the success of a language like C to a particular property of it without demonstrating that said property is the critical one. If someone says: "memory layout is critical to success" then claims as evidence that "no language that doesn't support memory layout has achieved mainstream success", I feel a great deal like replying: "Then, by that reasoning, is the use of semicolons critical to success as well?"

I understand you're aiming to support explicit memory layout because you associate memory layout with approaches that have succeeded before. But, as I said, I'm unconvinced it is as critical to success as has been asserted on this page and elsewhere, and I've written enough RAID drivers to believe I can claim attempting to thread particular representations through code (which must be specialized) and so on becomes a major productivity hit (due to code specialization end-user translation efforts)... unless you apply something like the codecs approach.

And, as I noted, mutable memory vs. transactions: I've made a choice, and I am fully convinced it will work for operating systems. Perhaps I'll prove it at some point.

I think that having a compiler do layout optimization is a marvelous thing, and I certainly don't intend to reject that possibility in general. What I insist on is the ability to specify.

I've no objection to 'ability to specify' at critical points (I/O especially) so long as it doesn't come along with 'requirement to specify', such as it does if representation is part of the type system.

But I do believe that advocates of compiler-driven layout tend to neglect the issue that compilers are just as bug-ridden as anything else, and that assurance is generally improved by adding magic to the compiler.

I'm all for the ability to inject extra post-parse pre-compile stages into a program (i.e. for plug-in type systems, optimizers, etc.) to help one achieve a little extra 'magic' at need. It's just that sort of thing that can make quite practical declarative approaches to memory layout - i.e. the language needn't start with such a capability; it can be tacked on and updated in its own library.

And in all seriousness, what is your view on the tradeoff between assurance and complexity in compilers?

One can often achieve greater assurance by reducing complexity of compilers. Simpler systems are easier to verify. Am I right to suspect you speak of the tradeoff between guaranteeing the translation from source is correct and achieving optimizations at cost of greater complexity?

After having worked with extensible languages, I'm of the opinion that assurance is a better default choice, especially when aiming to achieve safety or security properties as a critical part of the language. I'm tempted to assert that the compiler should be able to provide a proof that the code was translated correctly, and based on this article that demand doesn't seem to run too contrary to optimization.

types not semicolons

I get annoyed when people attribute the success of a language like C to a particular property of it without demonstrating that said property is the critical one. If someone says: "memory layout is critical to success" then claims as evidence that "no language that doesn't support memory layout has achieved mainstream success", I feel a great deal like replying: "Then, by that reasoning, is the use of semicolons critical to success as well?"

Good grief! I'm not drawing a statistical correlation between semi-colons and system programming, I'm pointing out that one of the things that I commonly do when system programming, is far more difficult in a high-level language with a strict type system than in C with a permissive type system. Neither language provides support for what I need, but C allows it while higher-level languages actively oppose it. This bears very simply and directly on the failure of HLLs in this domain and seems to be exactly the sort of thing the original poster was asking about.

types not representation, either

I understand why data representation is often critical for system programming (HW integration/mmio, FFIs, certain security concerns, and various hand-optimizations). I also do system programming. But I still feel you misattribute it.

It isn't the "permissive" nature of C's type-system that is helping you out. It's the "prescriptive" nature. Ignoring the nasty issues regarding integers of implementation-defined width, types in C determine memory layout.

And it isn't the "strict" nature of HLL type-systems that hinder you. It is, rather, the inability to specify memory layout. You can't even specify a bitfield to be a contiguous block of memory, which you need to be able to do (at prescribed addresses) to support memory-mapped IO.

Data layout doesn't need to be a type-system issue. Why not allow representation to be specified declaratively? It is often the case that a single data layout will support a broader spectrum of values than the current type of a variable, so why not allow types and representations to be specified independently? It is also often the case that a data layout will support a narrower range of values than the type of a value (e.g. int32 representation doesn't carry all integers), so I suspect it would be useful to formally recognize such distinctions and create an 'Int32' type for modulo 2^32 arithmetic and have the 'Int' type result in a runtime exception 2^31 is shoved into a 32-bit signed integer box. Why not?

It is to this question that your answer is a bit like correlating semicolons and system programming. You say: but it hasn't been done that way before. Well, that's fine, if it's true. You know the C-approach to memory layout works because it's been done and succeeded before. Even elephants tend to tread in the footprints of the elephants before them - it's solid ground.

No... wait... you said: but that approach hasn't achieved "mainstream success" before. Oh. Well, clearly, mainstream success is the best measure of technical merit. I should have known.

But if one is going to suffer a 10 year "research toy" phase in something like Coyotos that is ultimately going to become yet another 'lessons learned' project that can't compete with worse-is-better systems like the future Windows 9 and Linux 2.8, it isn't unreasonable to try a few things differently if only to achieve a few more 'lessons learned', and taking that sort of risk is the only way to have any hope of achieving "Mainstream success". New operating systems won't ever catch up on drivers except by either porting them or finding better ways to write them. If you're trying to find better ways to do everything else, why not drivers?

I'm pointing out that one

I'm pointing out that one of the things that I commonly do when system programming, is far more difficult in a high-level language with a strict type system than in C with a permissive type system.

Out of curiosity, and recognizing that it may not yet be possible for you to answer, where would you put BitC on the spectrum between "high level language with a strict type system" and "able to do yucky OS stuff"?

The proof is in the doing, and there is no proof yet, but I'm curious if anybody else thinks we're anywhere near a sweet spot.

Semicolons rule!

Still no citations...

You write:

By "a lot of evidence" I mean "pretty much every large compiled software project out there".

Then you should surely be able to cite at least one without any difficulty.

From your earlier posting:

I can point at any number of other large compiled software projects, written in Haskell or OCaml for example, where performance is quite reasonable despite not using programmer-specified memory layouts.

Respectfully, I don't think you can. Tell you what. Set aside my 5% goal. Can you cite even one example of a quantitative comparative benchmark in which a serious attempt was made to perform an apples to apples comparison? I accept that the decision point for speed/productivity/reliability trade-offs may be different for different users. In light of which, I'll settle for defining "credible" as "evaluated in such a way as to compare two functionally comparable software artifacts."

The issue that has my knickers in a knot here is things like people comparing apache to a trivial, single-threaded html server. The fundamental challenge in something like Apache is to deal with internal concurrency, HTML connection reuse, and CGI and/or other transforms on the delivery path. These motivate most of the complexity of Apache. Getting strong performance on mere file delivery is absolutely trivial. Claiming performance that is comparable to Apache for this task is simply stupid. If your task doesn't require CGI, Apache isn't the right tool to compare against. So when people compare their favorite language by implementing a trivial web server, the honest comparison would be to compare it to an equally simple web server written in C. I have yet to see this done in real publications.

When people like you (or I) make statements about comparative performance, our words carry the weight of our reputations. The damage when you (or I) make sloppy statements is consequently quite high. Given our respective reputations, the damage in your case is rather higher than in mine. :-)

In the discussion at hand, I suspect that the only metric you actually have for "good enough" is "in your personal opinion." Which is a fine metric for deciding what you should use, and perhaps even what I should use, but a weak metric by which to evaluate language alternatives.

I'm only at odds with compiler-driven memory layout to the extent that I am very concerned about assuring the compiler. That is probably a solvable technical problem, but I do wish that people might include assurance in their trade-off arguments more often. Admittedly, that's a hot button of mine.

Can you cite even one

Can you cite even one example of a quantitative comparative benchmark in which a serious attempt was made to perform an apples to apples comparison?

Ensemble seems to fit the bill. They had a system originally written in C (Horus), which they then re-implemented in OCaml, and it improved the performance IIRC.

Then you should surely be

Then you should surely be able to cite at least one without any difficulty.

OR I could just tell you to google 'ocaml performance'. Oh, look, a hit - comparing a purely functional OCaml ray-tracer vs. a mutable state C++ tracer.

I notice you like to create lots of white-papers that you can 'cite', even if all you're doing is speculating on a bootstrap environment. That's your dog, not mine.

If your task doesn't require CGI, Apache isn't the right tool to compare against. So when people compare their favorite language by implementing a trivial web server, the honest comparison would be to compare it to an equally simple web server written in C. I have yet to see this done in real publications.

Agreed, though I've seen it done in publications that you might not consider "real". Similarly, if people implement a web-server supporting many connections, that also needs to be compared against an implementation in C on the basis of such things as reliability, resistance to DOS attacks, and performance under massive loads. And so we end up with comparisons like this one because apache isn't designed for that level of concurrency, and C doesn't make it easy.

Unless your programs give equal priority to concurrency, security, performance, robustness, graceful degradation, etc. the benchmarks will be unfair.

The C language and its ilk are well known for giving good 'localized' performance, but I've never been impressed with the cost of that 'local' performance when considering the entire system (runtime libs, separating memory spaces, heavyweight threads, complex security, etc.)

our words carry the weight of our reputations. The damage when you (or I) make sloppy statements is consequently quite high. Given our respective reputations, the damage in your case is rather higher than in mine. :-)

You feel that your reputation is in the dumps? Because mine, as I understand it, is hovering roundabout zero.

I'm only at odds with compiler-driven memory layout to the extent that I am very concerned about assuring the compiler.

Well, you make a trade off between assurance in one place and assurance everywhere else. If you only care about assuring the compiler, and not everything else, the best way to go about it is to tell everyone that they're going to be programming in, say, Ook language from now on.

I'm afraid your reference

I'm afraid your reference (the ray-tracer) is very misleading. This is a toy ray-tracer with usage application only suited for shoot-out style benchmarking.

The real-time ray-tracing community is fairly unison in the usage of C++, one reason being able to optimize using SIMD instructions and the other being tight control of memory layout (cache-lines, etc.).

Apples to apples was the

Apples to apples was the demand. The C++ ray-tracer was also a simple toy, just as requested.

Based on your reply, a quick look discovered CamlG4 (OCaml SIMD library) and a doc that I can't link directly entitled 'SIMD Vectorization of Straight Line Code' (Stefan Kral) where people, unhappy with C performance, grabbed themselves a Prolog program and had it compile the assembly for them - exactly the sort of declarative approach I hope we eventually achieve on a regular basis (except I'd like to see it happen without escaping the language environment and without sacrificing security/safety).

Of course

It's not like anyone actually *likes* to mess around with C/C++ is there? We're pretty much all screaming for a better language, but I'm afraid there's not that many alternatives out there.

"Better" is too difficult to

"Better" is too difficult to measure.

Bling WPF sort of works like

Bling WPF sort of works like this: rather than write HLSL code to express your pixel shader, you write C# code that generates one. The C# code mostly looks like the HLSL code you would have written with a slight distinction between staged (runs on the GPU) and unstaged (runs during code gen) code.

And if I recall correctly,

And if I recall correctly, there are other examples that got really big advantages by exposing concurrency opportunities when running on cell processors, but I don't recall the citation right at the moment.

Optimization Tools

CamlG4 is just a wrapper for calling C routines that run loops over a single SIMD instruction. The overhead for doing this means that your loops must be substantial - i.e. the arrays must be big. Obviously, it would be much better if OCaml supported SIMD (short) vector types and could make use of the SIMD instructions.

Stefan Kral's work is very worthwhile, trying to graft out SIMD (short) vector parallelism for straight-line code. This is quite useful - but one would wish that compilers could do this themselves, so that one didn't have to rely on code-generation tools.

Various responses

I notice you like to create lots of white-papers that you can 'cite', even if all you're doing is speculating on a bootstrap environment.

It is true that my group produces a lot of white papers. This is because we feel that documenting rationale in depth is an important part of sustainable research, and because specifications can't really be published in conferences. It doesn't seem likely that more comprehensive disclosure is a bad thing in research.

But you do make an important point about citations, because people out there do sometimes use citations of white papers inappropriately. When we cite our own white papers, we cite them for what they are: explanations of what we had in mind or how a design works. Unless those papers incorporate appropriate evaluation (which may be qualitative or quantitative, according to the claim) we do not cite them as any sort of evidence that our ideas are correct.

And so we end up with comparisons like [YAWS vs. Apache] because apache isn't designed for that level of concurrency, and C doesn't make it easy.

Perhaps unintentionally, you are using selective data to support your presuppositions. The method of the experiment described there definitely does not represent any typical usage scenario or purport to be comparable to any standard reference benchmark. Selective benchmarks can prove anything, so when taken in isolation this measurement does not support any credibly generalizable conclusion. Because the experimental method is described in insufficient detail to permit reproduction, it isn't even a publicly useful point experiment.

The author acknowledges in the "comments" section that their attributed rationale is speculation. That speculation may or may not be informed. From the description, which lacks any supporting rationale, I can't tell and neither can you. From what is reported, we have absolutely no idea whether the result is a language result, an OS result, or simply a bug in one implementation or the other. A proper evaluation would have taken this conjecture and tested it to determine whether it accounts for what is happening, and either confirmed or refined the conjecture as appropriate.

From what I do know about other measurements of the underlying network implementation in Linux, I'm currently leaning toward "bug", but I'ld be quite happy to be wrong. Given just how remarkable the result seems to be, and the fact that the group is clearly a research group that engages in research publication, it is surprising that no followup has occurred. Perhaps it is simply that the paper isn't published yet.

So on the whole, your example is a perfect illustration of my point about non-credible benchmarks. No systematic evaluation has been performed here, but the existence of a single, unqualified point experiment leads insufficiently critical readers to infer that grand support for their pet views. It does not follow that your views are wrong; merely that they are as yet unsubstantiated.

And as I have made clear elsewhere, I would be delighted if your supposition were true. But I want to see a credible demonstration before I jump off of technologies that I know will solve my problem.

...you make a trade off between assurance in one place and assurance everywhere else...

That is an interesting conjecture. In this case, we have two cases that involve qualitatively different changes. I agree that reducing the specification burden on general programs should simplify them, and should consequently improve our confidence in those programs — and simultaneously our ability to formally analyze them. That would be very valuable. If we can get that, and at the same time establish formally that the automated layout mechanisms of our magic compiler are correct, then that would be very powerful. For a great many applications, that would be a very substantial improvement, and I have never intended to say otherwise.

But simplicity (and its associated confidence) that comes at the cost of violating functional requirements is not an improvement, and there really do exist applications that have tight-constant-bound space requirements. And not just arbitrary bounds. Most critical applications must be shown to execute correctly within a specified constant amount of memory.

For such applications, layout optimization can be fatal (to humans, and I mean that literally). Unless the effect of the optimization on the memory requirements of the program can be strictly and carefully quantified, the optimization must be disallowed. In these applications, predictability is more important than performance. The "catch" is that these applications also tend to have challenging performance requirements, and one of the few tools that we currently have to give us predictable performance is layout control.

Your supposition seems to be that conflating layout control with type increases program complexity, and therefore lowers the confidence that we should have in those programs. That supposition is both plausible and empirically testable, but I'm not aware of published empirical results. I would expect such evaluations, if any, to have been done in the software engineering community, most likely in the context of aerospace or aeronautical engineering.

My [unvalidated] expectation is that the effect is less pronounced than you believe. I certainly don't believe that mere boxing or unboxing of otherwise identical data structures has much, if any, impact. In each case the confidence we have (if any) derives from type information, and our ability to do type analysis for both kinds of structures is comparably complete and comparably credible.

The more interesting conjecture here would seem to be tied to mutability, which allows us to build (for example) qualitatively different kinds of data structures. I do not, for example, know of a fixed-space functional AVL tree implementation.

In critical systems, the mitigating factor is that use of complex data structures is viewed with overriding suspicion, and subjected to either extensive test or formal verification or both. Because such strong techniques are applied, it is difficult to know how the comparative assurance results in the two cases actually play out.

For general purpose applications, my basic belief is that mutation and mutation-dependent data structures are often a bad idea, but that they are not always avoidable. This is the heart of why BitC takes a straddling position on this issue. But you'll also see if you examine the language that the design bias is in favor of immutability, because that is the default in the absence of contrary declaration.

Re: Yaws vs. Apache Perhaps

Re: Yaws vs. Apache

Perhaps unintentionally, you are using selective data to support your presuppositions.

... You do mean my presupposition that "Unless your programs give equal priority to concurrency, security, performance, robustness, graceful degradation, etc. the benchmarks will be unfair", right? Right?

Not quite. I meant the

Not quite. I meant the seeming implication that YAWS outperforms Apache in some interesting way. If you intended that link as an example of how not to do comparison, then I completely misread what you were saying.

Even on the basis of a straight concurrency comparison, that data point is so inadequately described and so completely non-representative of either normal case or standard benchmark case use that it isn't meaningful. Or if it is meaningful, we can't figure out how, because the measurement isn't described well enough to determine what is (allegedly) being measured.

Your supposition seems to be

Your supposition seems to be that conflating layout control with type increases program complexity, and therefore lowers the confidence that we should have in those programs.

No, this doesn't have to do with confidence in programs. It has to do with separation of concerns. Conflating layout control with type reduces program flexibility - the ability to change layout issues independently of changing types. Said conflation may also easily hinder programmer productivity via requiring programmers specialize code to a representation, though you've bought productivity back in BitC via use of Type Classes.

Anyhow, you make a lot of assertions about hard memory limits and realtime constraints (I'm familiar with those, having done a summer Internship helping certify code for airplane touch-displays). Those issues are important to me, too... which is why I've not been saying we should leave such issues entirely up to the compiler.

But, whereas you've been pursuing approaches where programmers can guarantee constant memory by giving programmers more control over the implementation-details of the program, I've been pursuing (... okay... I'll correct that to 'pondering and researching') approaches where one declares such constraints at higher level computation spaces, and supports a declarative reduction of code to meet these goals via logic programming with search strategies supplied by programmer annotations and suggestions from expert systems.

What I want in the end is for the compiler to hand me both a product and a formal proof that the specified properties are correct (assuming that specified axioms in its hardware database are also correct).

And I'm rather convinced that the correct approach to this is to change certain language status around: give the programmers less 'absolute control' and more 'soft control', allow programmers to make 'suggestions' as to memory layout such that these suggestions can be used to reduce search costs (search is the time bumbling about heuristically looking for possible answers), and to make program specifications ... more declarative. Which is what I've been saying all along.

Not that I can provide you citations regarding this that will make it available for you today or in mainstream projects. I can only tell you that researching what is being done with those "research toy" systems in 'Declarative Metaprogramming' is quite a ride.

BitC will work, I agree. But it feels like a kludge. I just want something more than that.

In any case, you continue your approach and I'll continue mine. Perhaps, ten years from now, you'll appreciate my efforts and I'll appreciate yours.

BitC will work, I agree.

BitC will work, I agree. But it feels like a kludge. I just want something more than that.

From your tone, this may surprise you, but so do I. The problem from my perspective is that I think "something more than that" is still 10 years of research and another 5 years of integration away, and I'm not content to wait. BitC is an interim solution; nothing more. Of course, so was C.

As to appreciating efforts, I think that you have taken my comments here wrongly. I do appreciate the ideas you are putting forward, and it would be marvelous if they pan out. I have articulated some valid concerns about assurance, but that should be taken under the heading of "issues not to forget" as opposed to "fundamental objections".

Where I have been very hard on you is in demanding that you substantiate claims. That's important whatever your approach is, and whatever the context in which you do your work. Unsubstantiated claims undermine your credibility when you try to pitch your work, but I actually think that is a secondary concern. The main concern is that people who are in the habit of hand waving very often find themselves believing their own claims (at least operationally). If the claims aren't right, that tends to destroy the effectiveness of their work.

So I'm hard on you in this regard not because I dismiss your work, but because I think it has promise. In order to realize that promise, I think you must abandon habits that lend themselves to self delusion.

But they are conflated!

Conflating layout control with type reduces program flexibility - the ability to change layout issues independently of changing types

As a practical matter, type and layout are conflated by every language HL language implementation. A cons cell has one physical structure, ditto for an array, dictionary, object. Furthermore programmeers know this, and choose the appropriate representation (list vs. array) for the task at hand. Limiting the choice of data structure simply moves complexity elsewhere.

Every language implementation that I've ever looked at has a hard coded static object system rooted at "struct object" and a much more limited dynamic system built on top of that. Using the high level language constrains the programmer to use only the notion of indirection made palatable to the type system by the VM. The result is that we get lots of really stupid classes like "list_item" (come on, nothing is a list item) and 20 years of unfulfilled promises that some sort of generic template mechanism will condense all that glop into the single piece of code that would be trivial to write if only we could tell the type system what a list descriptor (offset of next and previous fields) is.

This isn't like the transition between assembler and compiled languages. In that case, compiled languages (C mostly) were expressive enough to remove the need for assembler, and very soon thereafter they became good enough that hand-crafting assembly simply wasn't economical. In the realm of data structures, we haven't made any progress in 30 years and the difference in expressiveness is still so huge that that arguing about performance is silly.

This is computer programming so everything can be done in terms of everything else. If you want to limit yourself to arrays, list and dictionaries as primitive data structures, you certainly can. But, when you get done, are you any closer to the truth? If it "doesn't matter" why work on it?

As a practical matter, type

As a practical matter, type and layout are conflated by every language HL language implementation.

Well, for simplicity of implementation this is often the case that an implementation picks a few simple representations, but types are only 'conflated' with their representation if all implementations are required or expected to do so in a consistent manner.

programmers know this, and choose the appropriate representation (list vs. array) for the task at hand

That would be all fine and dandy if such local "task at hand" decision-making was able to consistently accomplish more globally desired computation properties.

Limiting the choice of data structure simply moves complexity elsewhere.

That is true when you're discussing essential complexity. But I suspect there's a great deal of accidental complexity associated with unnecessary choices and choices presented to us in non-uniform manners.

We certainly don't need 'int8' 'int16' and its brethren... what we need is the ability to specify a contiguous block of bits of some size and desirable alignment then 'view' it as an integer. A compiler can easily enough recognize some representation of [bit*32,contiguous,aligned(16bit)] and choose the appropriate hardware words.

In any case, I'm not demanding we limit choice of representations, only that we separate representation from type. E.g. a collection of codepoints could be represented in UTF-8 and another in UTF-16LE. If the representation is conflated with type, you might not be able to pass these two collections to the same set of functions, and there'd potentially be a great deal extra work appending one to the other if the language doesn't support a third representation for collections that operates as a logical append.

Every language implementation that I've ever looked at has a hard coded static object system rooted at "struct object" and a much more limited dynamic system built on top of that.

I've a feeling you'd see some more interesting things if more general purpose programming languages had a bootstrapped compiler written in the same language. Languages that lack direct support for memory layout would not fall back on memory layout of other languages; instead, they'd create systems to intelligently choose a layout for a type.

In the realm of data structures, we haven't made any progress in 30 years and the difference in expressiveness is still so huge that that arguing about performance is silly.

When it comes to representation for hardware IO, I fully agree that improved representation over that of C is fully warranted. I do a lot of network IO stuff. I'd like to be able to express packet-headers including recognizing extended packet-headers, variable-width fields, etc.

But I digress. My statement was that we shouldn't conflate representation with types, not that we shouldn't support expression of representation issues.

This is computer programming so everything can be done in terms of everything else. If you want to limit yourself to arrays, list and dictionaries as primitive data structures, you certainly can. But, when you get done, are you any closer to the truth? If it "doesn't matter" why work on it?

Indeed. So long as you're willing to live in a Turing-tarpit and sacrifice non-functional properties such as performance, error detection, reliability, security, etc. then everything (all 'functional' things) can be done in terms of everything else.

The goal of a language designer is to provide just enough features to achieve useful non-functional properties while constraining oneself enough to simplify implementation and reduce accidental complexity.

This doesn't get one "closer to the truth", but it still matters. Language design isn't about truth; it's about utility.

For me, this is all rather

For me, this is all rather vague. A concrete example of what you have in mind might be helpful.

Concerning compiler bootstrap and memory layout, the sheer number of self-hosting compilers indicates that the empirical evidence is against you.

What is it, exactly, that you have in mind here?

This is all rather vague. A

This is all rather vague. A concrete example of what you have in mind might be helpful. What is it, exactly, that you have in mind here?

I'm responding to vague accusations. Pardon me if I must provide broad answers.

Concerning compiler bootstrap and memory layout, the sheer number of self-hosting compilers indicates that the empirical evidence is against you.

Scott was describing language implementations as typically having a fixed 'struct object' representation with slight variations. As I understand it, he is saying: 'Everything is a __fill_in_the_blank__' is how most HL languages implement structures, where the 'blank' might be filled with such things as CONS cells, LUA tables, etc. And due to this common phenomenon he stated that types and representation are 'conflated'.

A simple concrete counter-example to Scott's claim would be the 'Integer' implementations in most languages that support arbitrary-sized integrals and automatically change representations. But I agree this doesn't represent the sort of systematic differentiation that I was describing as possible. A simple example of manipulating representation for purpose of optimization is the sort of automatic currying and uncurrying done to optimize application of functions to tuples - a single type with two different representations based on local purpose.

I don't imagine many of those 'sheer numbers' you mention actually meet the criterion I specified of a compiler being written in and for an HLL where memory layout can't be explicitly specified. Mostly I said this is to implicitly exclude interpreted languages and avoid the condition where programmers lazily fall back on some sort of 'struct' provided by an implementation LLL (E.g. if you compile to C or from C you're more likely to fall back on the limitations of the design-decisions of C.) I probably should have additionally specified a language where some sort of type or dataflow analysis is already performed for some reason. Without knowledge of dataflow, one cannot safely perform operations by operating on local representations; one-repr-per-type are probably too-high granularity to really benefit much from differentiation (though I've heard of avoiding 'every tagged union value is a cons cell' by folding certain tags into their pointers).

Anyhow, I'll try to work up a set of concrete example problems... say having some initial data (network packets, file input, strings) of known representations on the system IO periphery then taking these known external representations into the computations.

Type != Representation

I have to say that I think this idea deserves merit.

A pair or a record in my mind is a logical grouping of data, it is an interface over some sort of physical representation. In most cases, the code shouldn't have to care if the data is adjacent and/or aligned or not. I say most - because at some level, the compiler needs to know about locations in memory, and several optimizations would be impossible without this knowledge.

High-level languages seems to operate on the logical level, which is good enough for most code - but when you need to concern yourself with physical representation, there is usually not much you can do. And that's when the only option is to use C, C++ or assembly.

Is object representation specifications also a type? Can we not just include representations as a type of type? Can representation specifications be dynamic? How should we "type" systems with representation specifications?

Separating type and

Separating type and representation is an interesting idea. From a high-level, this requires specifying an injection and projection functions to/from some standard layout object that the compiler understands. Using said functions, the compiler can then automatically transform between type-equivalent but representation-differing data.

Codecs

That [edit: minus the 'official layout the compiler understands'] is essentially the essence of codecs: an injection function (encodec) and projection function (decodec) combined with other relevant information.

Unfortunately, it takes a rather intelligent system to automatically break down a monolithic encodec or decodec in order to figure out how to:

  • perform a persistent update
  • perform a mutation update (as an optimization or for languages that support mutation) - inserts and appends, especially - without converting the whole structure
  • automatically identify live memory regions for garbage collection
  • read just a small property of a large value without converting the whole structure
  • efficiently iterate across structures with variable-sized cells (e.g. a Hoffman encoding or UTF-8) without copying the structure
  • etc.

Since compilers lack this intelligence, codec approaches to separate type from representation must be specified with several support functions to help out with specific tasks. This set of functions often ends up looking a bit like an abstract base class or a rather complex Haskell type class.

Anyhow, codec-driven approaches are one of the two options I mentioned for dividing expression of type-concerns from expression of representation-concerns, useful for when representation (memory-mapped or serialization) needs to be specified on an IO periphery.

A compiler using codecs alone is free to convert codecs, specialize functions to particular codecs (effectively the same as specializing for a typeclass), etc. in order to trade off between space and time costs, so another mechanism is needed if programmers also need to have the compiler promise realtime or hard memory constraints. (I'm okay with this... expressions of computation concerns should also be separate.)

I think it interesting that Codecs may generally be composed (Type A represented in terms of Type B represented in terms of a block of bits) so long as the derivation of the 'utility' functions will automatically follow such composition.

Codec

I just checked Wikipedia:

A codec is a device or computer program capable of encoding and/or decoding a digital data stream or signal. The word codec is a portmanteau of 'compressor-decompressor' or, most commonly, 'coder-decoder'.

Since we're not dealing with compression in this context, I'd prefer words such as "layout", emphasizing that we're dealing with memory layout.

PKE.

Coder-Decoder

'Coder-decoder' (as 'codec' is "most commonly" used) doesn't imply compression, but does imply a transform in view from a representation to a semantic understanding of it. Also, as I noted, codecs are best applied on the system IO periphery... where the input might be some sort of memory-mapped IO... but could also be serialized or executed by other mechanisms. I think the term entirely appropriate.

But I'd rather not battle over definitions in a forum such as this one.

The vast majority of uses of codec

are, it seems, in the context of data compression.

Speaking of which, here's an interesting challenge for a type system (or a layout system, given the position that types and representations are independent things):

Develop a type/layout system which can describe an MPEG-2 transport stream. It is permissible to assume that the pointer or whatever to the start of the bitstream is valid (the bitstream can be "unboxed").

To pull this off, the layout system will need to support:

* Access to non-byte-aligned data.
* Structured data packetized in to various fixed-sized chunks.
* A way of specifying sentinel values/magic cookied; MPEG is full of these.
* Variable-length entities, with length encoded in various ways (either an end marker, an explicit indication of length in the header, or however-many-instances-can-fit-in-what-is-left-of-a-packet).
* Named algebraic sums (suitable for pattern matching).

Have fun! :)

MPEG-2

This just requires dependent types or a proof environment environment rich enough to prove the required properties of the representation's semantic function. It's not like you need (or want) a separate "layout language" for this.

(And if we're voting, I personally prefer "representation" or "rep" to describe these things. "Layout" brings to mind a limited format descriptor for which this kind of thing would be a problem.)

That [edit: minus the

That [edit: minus the 'official layout the compiler understands'] is essentially the essence of codecs

Well, for codecs the "official layout" is basically a byte array. For the compiler, I was thinking of an algebraic type for describing bit-precise layouts.

In any case, upon further thought, I'm not sure I agree that separating types and representations is all that useful. As long as programs are written against abstract types instead of concrete types, we gain the same benefits. The above injection/projection functions are basically injections/projections to/from an abstract type upon which client code depends.

All that's needed is to provide bit-level layout controlled by types, and for the language to encourage depending on abstractions instead of concrete types, by default (we basically need type constructor polymorphism and easier functor definition and composition). This is currently far too difficult in most functional languages, though Haskell now has views, which might encourage this practice more.

Taking an abstract object...

...and fitting it into a predetermined physical schema, is a lot like object serialization/deserialization (or pickling/unpickling, if you prefer).

Or, in many cases, like parsing text.

How would it work?

In any case, upon further thought, I'm not sure I agree that separating types and representations is all that useful. As long as programs are written against abstract types instead of concrete types, we gain the same benefits. The above injection/projection functions are basically injections/projections to/from an abstract type upon which client code depends.

Suppose you have a function foo :: Integer -> Integer, and it gives rise to representations foo :: Int8 -> Int16 and also to foo :: MyBigInt -> MyBigInt. Your plan is, if I understand it, that instead of having separate representations of foo which get called appropriately, we will have one polymorphic foo that can instantiate either of those. So it will have some big dependent type that says it works in either of those cases.

To make that modular, I think you'll want intersection types and a convenient mechanism to establish the pieces being intersected separately (as in separately compiled). But then how do you select that you want a different (but equally behaving) term used for a computation on a particular representation? For example, restricted to {0,1}, I can replace the function n->floor (n+1)/2 with identity. With such a trivial example, you could hope a code generator would make the optimization, but what if it's an optimization that simplifies a term when a certain matrix is symmetric? I think you need some mechanism.

I suppose you could extend your intersection typing mechanism with a mechanism to select terms based on which part of the intersection you're in, and come up with rules of what to do when the cases overlap or you can't figure out which cases apply.

But what do you end up with? Does a high level function that wants to remain completely flexible take a hundred type constructors as parameters, so that it can pass them down through the functions it calls? Right now, I don't see it.

Suppose you have a function

Suppose you have a function foo :: Integer -> Integer, and it gives rise to representations foo :: Int8 -> Int16 and also to foo :: MyBigInt -> MyBigInt. Your plan is, if I understand it, that instead of having separate representations of foo which get called appropriately, we will have one polymorphic foo that can instantiate either of those. [...]

I was thinking along simpler lines. My proposal is essentially "code depends on interfaces, not classes", or for functional abstractions, "most code is defined in a functor (or depends on a first-class module)". This way concrete representations and/or implementations can be swapped out at will. This would seem to achieve exactly the separation of type and representation for most code that David was arguing for, without fundamentally separating type and representation for the small bit of code needed for instantiation. It's really just an abstraction problem.

"code depends on interfaces, not classes"

I agree with this sentiment and have ideas in this direction. But this serves a different role than my formulation of representations does. Representations don't need to implement whole modules. To basically repeat my previous example: you can't represent arbitrary integers with 32-bit words, but there are lots of places where you want to represent a function on integers by a function on 32-bit words. I guess you're just addressing David's proposal where the "codec" is really an isomorphism on the type, but I think that misses many interesting cases. Functorizing code is about maximizing abstraction whereas representations are about efficient compilation.

Functorizing code is about

Functorizing code is about maximizing abstraction whereas representations are about efficient compilation.

ML modules have a fully static interpretation, so I don't think it gets more efficient than that (assuming you specialize based on that information).

Fair

That comment was more of a personal viewpoint than substantive - you could certainly optimize some things with modules. But you still haven't addressed my point that modules don't do everything you want. Also, the point I made two messages back seems to still apply. How are you going to expose the internal representations and conversions that a function uses? Example:

foo2 :: Nat -> Nat
foo2 x = f x + g x + h x

You might potentially want to change representations before or after calling f, g, and h. Are you going to make foo2 parameterized by all of these conversion functions? What about additional conversion functions that might be wanted by f, g, and h?

Basically, I'm suggesting

Basically, I'm suggesting that functors need more pervasive (and easier!) support in a language, so that all code by default would be parameterized by a signature. Depending on a concrete type/representation would be the exception, not the rule.

For example, consider the tagless staged interpreters paper, which implements 3 completely separate backends for a given signature, and the client program is simply parameterized by the signature.

Depending on the intended semantics of Nat, foo2, etc. could begin executing with a native Word unsigned integer type, but when overflow is detected, switch to a BigInt. The Word module need to be written with this in mind, I'm not suggesting it be automatic -- perhaps this is what you're objecting to?

Binary methods might also pose a bit of a problem, since one Nat may be promoted, while another need not be. I wonder if there's some overloading technique that can handle these mixed representation cases.

Basically, I'm suggesting

Basically, I'm suggesting that functors need more pervasive (and easier!) support in a language, so that all code by default would be parameterized by a signature. Depending on a concrete type/representation would be the exception, not the rule.

And I think that's a good idea, but for other reasons. I don't think that really solves the problem I want to solve.

Depending on the intended semantics of Nat, foo2, etc. could begin executing with a native Word unsigned integer type, but when overflow is detected, switch to a BigInt. The Word module need to be written with this in mind, I'm not suggesting it be automatic -- perhaps this is what you're objecting to?

I'm objecting to the fact that the overflow checking code is dynamic (with runtime cost), but is needed (assuming you favor type checking all of this stuff) because some integers are not representable by words and you need to implement the entire integer structure to have a proper functor.

[In case it hasn't been clear, a motivating example here is compiling a function Integer -> Integer down to a function, say, Int8 -> Int16, assuming you can demonstrate that the result will indeed fit into an Int16 when the parameter fits into an Int8.]

I'm objecting to the fact

I'm objecting to the fact that the overflow checking code is dynamic (with runtime cost), but is needed (assuming you favor type checking all of this stuff) because some integers are not representable by words and you need to implement the entire integer structure to have a proper functor.

Did I miss a constraint somewhere that specified dynamic checks weren't allowed?

The dynamic check is only necessary because I assumed the most liberal definition of Nat (arbitrary precision). If we were to use a more constrained type, like a number-parameterized type, we could eliminate runtime checks.

Still missing the example

Did I miss a constraint somewhere that specified dynamic checks weren't allowed?

Well, when the stated goal is efficiency you might have inferred it to at least be undesirable. Look at this example:

square :: Nat -> Nat -- Arbitrary precision (in general)!
square n = n*n

foo :: [Nat]
foo = map square [1..200] -- For *this* usage of square, optimize using representative square :: Int8 -> Int16

You can tell me that you're going to dependently type square, but I'm of the opinion that it's not practical to have precise dependent types everywhere and so I'm going to counter that such a plan won't scale up.

I must be missing something.

I must be missing something. A module that performs the above widening arithmetic based on fixed integer types provided by the language is all that's needed here (I assume that fixed width integer types have two multiplication operations, Int8->Int8->Int8 for wrapping ops, and Int8->Int8->Int16 for widening ops). The only difficulty I can see is if you're expecting the compiler to automatically generate this.

This module is only correct according the Nat's semantics for a certain class of programs, but as I said, the user will be specifying the module anyway, so the onus is on him.

This module is only correct

This module is only correct according the Nat's semantics for a certain class of programs, but as I said, the user will be specifying the module anyway, so the onus is on him.

Ah, this is where we diverge. I was assuming the module would be dependently typed such that only a type isomorphic to the naturals would work. Otherwise I fear this functorization idea will spread around too much onus.

The stronger typing can be

The stronger typing can be used in those contexts in which assurance is required, hence why I brought up number-parameterized types. But types get quite unwieldy the more of the behaviour you specify, which I suppose is what you were getting at.

But consider that Int8 has a widening operation, as does Int16, as does Int32, and eventually we widen into arbitrary precision integers. We can get far by a judicious choice of base types. Suppose the Nat signature was parameterized with the "larger" type for those widening operations (pseudo-ML):

type Nat(S : Nat) =
struct
  type t
  val (*): t -> t -> t   (* wrapping operation *)
  val (*): t -> t -> S.t (* widening *)
end;;

Int8 is parameterized by Int16, which is itself parameterized by Int32, which is parameterized by BigInt. Narrowing operations like division, can cascade in the reverse direction. We obviously need something a little more flexible than ML modules for this mutual recursion.

The stronger typing can be

The stronger typing can be used in those contexts in which assurance is required

Ah, but continuing my example, to use the optimized representation in 'foo', you would need to refine the type of 'square', which might not be defined in your code. I'm ok with a pay-as-you-go approach that requires the programmer to demonstrate their representation works, but I think requiring all code be precisely typed just in case someone in the future wants to pass in an optimized representation is unreasonable.

Regarding the narrowing and widening idea, I don't see it. Who decides when to use the widening operation? Figuring out what types are sufficient for a representation requires a fine grained analysis (needing dependent type-ish capabilities to establish formally). Consider that incrementing a number could require widening the representation, and so if you used a widening version of increment in a loop, you'd be at BigInt in a handful of operations.

Ah, but continuing my

Ah, but continuing my example, to use the optimized representation in 'foo', you would need to refine the type of 'square', which might not be defined in your code.

If square is dependent solely on the Nat signature, then the concrete representation used is not yet selected, so where's the problem?

I think requiring all code be precisely typed just in case someone in the future wants to pass in an optimized representation is unreasonable.

I defer judgment until "precise" is defined. ;-)

Regarding the narrowing and widening idea, I don't see it. Who decides when to use the widening operation?

I confused myself momentarily there, thinking I had a better solution. The widening operation would have to be manually selected.

In any case, this thread started because I stated that abstracting over representation without separating representation from types is possible already, if we slightly massage existing abstraction techniques with some manual specification (which you would have to do anyway if you're providing a custom representation). Do we at least agree on that?

Ah, but continuing my

Ah, but continuing my example, to use the optimized representation in 'foo', you would need to refine the type of 'square', which might not be defined in your code.

If square is dependent solely on the Nat signature, then the concrete representation used is not yet selected, so where's the problem?

Right, the problem is when Nat has a precise (dependently typed) signature, as would likely be required to establish the correctness of the code using it. Obviously, programming in a style where all of your code is open to arbitrary semantics-changing transformations affords you a bit of freedom in choosing optimizations. But I see this as trading correctness for efficiency. What started me arguing was this:

In any case, upon further thought, I'm not sure I agree that separating types and representations is all that useful. As long as programs are written against abstract types instead of concrete types, we gain the same benefits.

I have yet to see a proposal using functors that I think has the same benefits as what I have in mind.

I have yet to see a proposal

I have yet to see a proposal using functors that I think has the same benefits as what I have in mind.

Did I miss your proposal in this thread? Or was it the intersection types and/or dependent types?

Did I miss your proposal in

Did I miss your proposal in this thread?

No, I haven't given too many details - I started out supporting the general idea of "separating type and representation" that David was advocating.

The dependent type / intersection type stuff was going down the road of trying to make the functors idea work as a substitute for representations.

Tangent: Haskell does not have views

View patterns are not views.

I see, view patterns seems

I see, view patterns are a restricted form of views due to problems implementing the more general approach. Thanks for the correction!

Control over representation

The way I see it, you primarily need a semantic function from representations to values. Specification of conversions should be possible, but there doesn't need to be a canonical representation for types - failure to unify representations can be an error.

I personally think this is a pretty important feature missing from current languages, though I disagree with David on the issue of 'soft control' vs. 'absolute control'. I think the core language tools should provide absolute control over representation. Tools for soft control can be built upon absolute control - not vice versa.

Soft Control, Absolutely

on the issue of 'soft control' vs. 'absolute control': I think the core language tools should provide absolute control over representation. Tools for soft control can be built upon absolute control - not vice versa.

I agree that tools for soft control can be built upon absolute control, and generally not vice versa. But there is a corollary: language tools built for absolute control over implementation details cannot usefully be extended to provide soft or independent control over separate concerns represented in those implementation details.

I guess I favor, at least in principle, the route of extensible language tools, even for systems programming. I understand the necessary features are not yet available... but people are slowly working on that.

Semantics of representation annotations

What I am advancing is that annotations for selecting representations should have well defined semantics that act to construct a value representing another value in a well defined way. As these constructed values should be usable in other constructions, they are not mere hints and cannot be ignored.

This does not preclude extensibility because of a fact you mentioned in another post: representations compose. Or, stated another way, representation values can themselves have representations. Thus there does not need to be a "bottom" of primitive types to end the regression. Code can be annotated to choose one representation, and an optimizer can later be directed to choose an optimized representation of that representation. I offer that for many systems programming tasks, one wants the ability to "turn off" the optimizer and use a simple mapping of types to representations. (I'd prefer access to more than a monolithic "compiler", but that's another topic)

Optimizable implies Soft Control

Code can be annotated to choose one representation, and an optimizer can later be directed to choose an optimized representation of that representation.

If the language is free to optimize a detail that you've written as a programmer, then what you've got as a programmer is what I would call 'soft control'.

As these constructed values should be usable in other constructions, they are not mere hints and cannot be ignored.

This does not preclude extensibility because of a fact you mentioned in another post: representations compose. Or, stated another way, representation values can themselves have representations.

It's really the other concerns for which the language can't so readily be extended. You might be able to turn your 'string-on-a-blob' to a 'string-on-a-blob-on-a-rope' (thus representing a representation), but that doesn't mean you'll be able to ignore the 'commandment' to pass a value as a 'string-on-a-blob' if representation concerns are, in fact, expressed as 'absolute'.

Regarding extensibility: It's the extension of a language to support other dimensions of concern that will suffer due to 'absolute' control of implementation details in any particular dimension of concerns. I'll accept that effective expression of representation concerns does not preclude the ability to further express even more representation concerns in terms of prior representations.

But performance concerns are another issue. Representation affects performance, thus programmers often choose to control representations in order to better control performance. Unfortunately, it is extremely difficult for a programmer of a component to get a 'big picture' view of how a representation will be used within a system... so, by controlling the representation received by a particular function within the system the programmer might be able to optimize that function at the cost of making it more difficult or more expensive to call that function in the first place.

In a system where programmers have absolute control over representation, the only way for them to change global performance costs related to representation is for them to change the representation. I.e. they could decide to change their code so they now say: this function takes a 'string-on-a-blob-on-a-rope' instead of 'string-on-a-blob'. By doing so they might wisen up a particular codebase... but still won't possess a global view in which to best make the decision. Eventually, "that choice is 'good enough'... let's do something more interesting". And that choice probably is 'good enough' until along comes the guy who needs to do some realtime work and needs to re-implement all those functions for another representation.

In a system where 'absolute' representation is only an interface issue on the system-IO-perimeter rather than something controlled in the implementation, however, a programmer working on implementation might 'suggest' such strategies such as 'consider string-on-a-blob and string-on-a-blob-on-a-rope' to the compiler, and give ropes and blobs a default place when searching for good ways to represent strings. But, what is ultimately chosen for the implementation won't necessarily be what the programmer suggested... it could depend on a number of strategies, global optimizations, dataflows from system-perimeter-to-perimeter, etc. Thus, these suggestions are 'soft control'.

And, in a soft control system, expressions of other concerns, such as whole-system-performance (e.g. for an actor-configuration or a dataflow component), might ultimately eliminate all 'suggestions' made by the programmers during the search to achieve a particular set of goals.

Of course, compilers aren't AIs (or shouldn't be). After running out of suggestions, or even before then, compilers would do well to request guided compilation from (and provide feedback to) expert systems and programmers. Rather than waving my hands about sufficiently smart compilers, I'd much prefer a stupid compiler that systematically try suggestions as guided by programmers and external expert systems. (Usefully, expert systems might use feedback and a database to 'learn' how to best compile a project, and be able to export this data for shipment with source code.) An IDE could include some useful visualization tools for searches... I think Alice has done some similar stuff for its logic programming component.

"The language"

If the language is free to optimize a detail that you've written as a programmer, then what you've got as a programmer is what I would call 'soft control'.

One issue is, what's "the language"? If it's just the value semantics and you insist that representations be faithful, then representations aren't a part (or are a noop) of the language. This gets back to my comment about not wanting a monolithic compiler - I think there are a number of reasons to want a separate semantics that govern some rules of compilation. It's under these semantics that representation selection occurs and where I favor absolute control.

Unfortunately, it is extremely difficult for a programmer of a component to get a 'big picture' view of how a representation will be used within a system... so, by controlling the representation received by a particular function within the system the programmer might be able to optimize that function at the cost of making it more difficult or more expensive to call that function in the first place.

Agreed, which is why optimization work done should be commensurate with your view of the system. A library designer should probably just expose a number of representation options (and maybe provide hints or hueristics as to which to use). A library user might either explicitly select which representation he wants (ideally being able to write new representations for values provided by the library), or might tell the system to search for a representation statically (providing more or less help), or might employ some lightweight mechanism to pick representations at runtime.

In a system where programmers have absolute control over representation, the only way for them to change global performance costs related to representation is for them to change the representation.

You mean in a system where programmers only have absolute control.

Of course, compilers aren't AIs (or shouldn't be). After running out of suggestions, or even before then, compilers would do well to request guided compilation from (and provide feedback to) expert systems and programmers. Rather than waving my hands about sufficiently smart compilers, I'd much prefer a stupid compiler that systematically try suggestions as guided by programmers and external expert systems. (Usefully, expert systems might use feedback and a database to 'learn' how to best compile a project, and be able to export this data for shipment with source code.) An IDE could include some useful visualization tools for searches... I think Alice has done some similar stuff for its logic programming component.

I largely agree with everything you've written here; I also envision programmers using tools (expert systems, profilers, etc.) to help find optimal representations for their needs. But the programmer's wishes here should be followed, whether they be "use this particular representation" or just "use these hueristics to pick a best representation."

The language is...

One issue is, what's "the language"?

ideally the same thing as "the operating system" =)

In a system where programmers have absolute control [...]

You mean in a system where programmers only have absolute control.

Nope. But I do mean: "in system where programmers, or the programmers of any source components they link in, apply absolute control" over an implementation detail... which I believe in practice identifies the same set of systems as "in a system where programmers have absolute control."

But the programmer's wishes here should be followed, whether they be "use this particular representation" or just "use these hueristics to pick a best representation."

Programmers will be providing heuristics and strategies and suggestions for a bunch of different concerns all at once. These concerns will conflict in any non-trivial project, especially those that are expressed locally within the source-code rather than less locally as part of the project (which creates association and maintenance problems).

I'm cool with a programmer saying: "these are the things I need to guarantee" and "those are the features I want but am willing to compromise on so here's a few search strategies and some heuristics". But I'm really only cool with this so long as the 'guarantees' are only demanded from a more global view of the system... e.g. a bounded-box computation space with strict definitions of expected inputs, outputs, etc.

It's when these 'implementation' guarantees, e.g. of a particular intermediate representation, are embedded (or are even capable of being embedded) in myopic source-code that is unaware of the greater scope into which the code shall later be applied that I become extremely skeptical.

Perhaps our views are compatible.

Not incompatible

In my view, you should have total control over the artifact you're building. In what I'm proposing, there isn't a way for a library designer to force anyone else to use a particular representation (short of building as his artifact a C-style library). Values can have multiple representations. If you don't like any of the representations provided by the library designer, build your own.

(which creates association and maintenance problems).

I agree there are problems to be solved here.

More a runtime issue

Scott was describing language implementations as typically having a fixed 'struct object' representation with slight variations. As I understand it, he is saying: 'Everything is a __fill_in_the_blank__' is how most HL languages implement structures...

In my experience this is correct, but it is less a compiler issue than a runtime issue. There is a strong need to keep the core of the garbage collector simple, and this tends to lead to designs in which there is a small number of "core" object layouts known to the collector from which the compiler composes other representations. While more recent compilers emit code for use by the checker, doing so raises safety validation concerns. A middle position is what .Net does: emit a checkable object repr description, but not actual code.

I don't imagine many of those 'sheer numbers' you mention actually meet the criterion I specified of a compiler being written in and for an HLL where memory layout can't be explicitly specified.

I agree, but that wasn't your criteria. You wrote:

I've a feeling you'd see some more interesting things if more general purpose programming languages had a bootstrapped compiler written in the same language.

No mention of memory layout there.

In any case, I agree with you, Pal-Kristen, and naasking that separation is an interesting idea. I can see motivating examples, so I don't feel that I really need that. What I would find helpful is just an example of the kind of notation you have in mind.

Let me return to the topic at hand in a separate response.

Criterion stated in following sentence

I've a feeling you'd see some more interesting things if more general purpose programming languages had a bootstrapped compiler written in the same language. Languages that lack direct support for memory layout would not fall back on memory layout of other languages; instead, they'd create systems to intelligently choose a layout for a type.

But I can see how the confusion would arise.

Well, ain't worth arguing about. I'll wait for your separate response.

There is indirect data on this

The C language and its ilk are well known for giving good 'localized' performance, but I've never been impressed with the cost of that 'local' performance when considering the entire system (runtime libs, separating memory spaces, heavyweight threads, complex security, etc.)

The paper Deconstructing Process Isolation from the Singularity project offers some preliminary direct comparison data on this. I don't recall that they did anything comparable to shared libraries. That actually has pretty significant impact, but it's inconsistent with Dave Tarditi's "tree shaking" tricks. So for the moment, the best data we have is that the techniques are tied for performance.

In my opinion, those results tend to support the view that language-based mechanisms will win out in the end, because they present increased opportunities for copy avoidance (via data sharing) and type-based security analysis. Things like that tend to reduce the need to copy for the sake of being safety-conservative, and copies are the ultimate death of systemic performance. As a mature understanding of type-derived opportunities emerges, I expect language methods to come out ahead.

Actually, it's just that I'm a mugwump

I understand you're aiming to support explicit memory layout because you associate memory layout with approaches that have succeeded before.

My real view on memory layout is much more cowardly than this:

  • It sounds nice to have, but in some places I still need the ability to do prescriptive layout. I don't think that we have any disagreement on this.
  • I am very concerned about assurance. The current complexity of compilers is overwhelming, so I am hesitant about any proposal to add new complexity to compilers. In this case, I'm not up on the literature, so I'm not clear about the degree to which this type of transform is correctness-checkable.
  • In high-confidence contexts, it is often required that programs operate in tight-constant-bounded space and bounded memory management variance. The techniques for memory layout rewriting that I can imagine (again, not having explored the literature very much) don't seem to lend themselves to this. This leads to a requirement that I be able to turn this class of optimization off for the types of programs that I am focused on, which in turn drives me back to needing relatively rich language support for layout specification.

My view of monads is similarly mugwump-ish. I'm actually convinced that linear typing techniques are quite a good idea. What I'm not convinced about yet is that we know how to do all of the different things that we need to do using these techniques. Until we do, or until we find some more comprehensive approach, I would prefer not to be left in the position where my choice is between unsafe expressiveness and fully safe inexpressiveness. I would prefer to have a language in which I can use the stronger constructs where possible, and fall back to the weaker constructs where I must.

A tricky issue in assurance is compiler independence. It is pragmatically important that a program, when compiled by two different compilers, generate roughly comparable code. This is never perfect, of course, but it is necessary to insulate a critical system builder from the failure of a compiler vendor. One concern I have about compiler-driven memory layout is that it seems easy for the presence or absence of a single technique to alter performance by a very large multiple. If that is true (is it?) it's a "no go" for critical systems.

In our collective haste to focus on performance (my own included), we sometimes lose sight of the importance of behavioral predictability.

The point is to avoid mutation

Any sort of transaction support pretty much nullifies whatever benefits you'd hope to achieve by favoring mutable memory, which in turn diminishes the benefits of controlling memory layout. Further, I am convinced that support for transactions (as per software transactional memory or transactional actor model) is the only viable (composable, reliable, safely allowing for priority-driven preemption, etc.) approach to programming large massively concurrent systems - including operating systems.

This is exactly backward. The whole point of extending the type system to cover physical memory layout is to avoid mutation. Transforming 20Gb of memory mapped data structure into 40Gb of heap so that we may then enjoy the benefits of FP is a non starter. What is needed is a first-class way to describe that data so that functional programs can access it without consing anything.

Transaction support and how you mutate data is a separate issue and one that there's little point in addressing until you have a viable answer to reading data.

Further, I am convinced that support for transactions (as per software transactional memory or transactional actor model) is the only viable (composable, reliable, safely allowing for priority-driven preemption, etc.) approach to programming large massively concurrent systems - including operating systems.

You are so convinced, in spite of the fact that people who actually write massively concurrent systems have overwelmingly chosen a different tool set?

What is needed is a

What is needed is a first-class way to describe that data so that functional programs can access it without consing anything.

I agree an approach to achieving that is useful. Not certain it needs to be 'first-class' though.

Transaction support and how you mutate data is a separate issue and one that there's little point in addressing until you have a viable answer to reading data.

Those are only separate issues for the user of the language. For the language designer, input, output, transactions, collection, and protection tend to all tie into a neat little Gordian knot. Or at least you hope it's a neat knot... hairy knots tend to indicate an asymmetric language that might be easier to implement but is probably harder to use.

You are so convinced, in spite of the fact that people who actually write massively concurrent systems have overwelmingly chosen a different tool set?

Quite so. But not "in spite of" anything. Those who wrote the massively concurrent systems of the past have not truly possessed opportunity to select transactions support for coordination. To get transactions, they'd need to grab an RDBMS - an approach that is difficult to justify for both its performance hit and the translation efforts.

I suspect software transactional memory is of significant benefit if one has 'cells' that aren't particularly fine-grained (as per Clojure), but I'm currently pursuing transactions at the level of 'actors' and their creation, destruction, and continuation status.

Of course, to avoid confusion, transactions themselves aren't the units of concurrency. But for any operating system, you'll have some unit of concurrency and some need to coordinate them. With a few exceptions (like restricted dataflow programming) all such systems have race-conditions, and those that don't have race-conditions are usually not the sort that may act as services (i.e. involving service discovery and arbitrary users from the outside). Transactions support ad-hoc negotiations between services and other systems... and greatly simplify certain other OS tasks.

Those who wrote massively concurrent systems in the past chose most often something more equivalent to "localized best-effort". They might use locks or message passing or rendezvous to increase the granularity of any race-conditions, but they by no means avoid them on any larger scale. The results haven't been pretty... but sometimes worse is better.

I think a language should aim to do quite a bit better than an OS. I'm reading the BitC spec; I'm a bit curious as to how they handle concurrency (... ah... it handles concurrency with the same panache that C handles concurrency... oh well.)

Umm. No. The point might be

Umm. No. The point might be to avoid the semantics of mutation....

You get that for free when you avoid actual mutation

Any sort of mutation multiplied by billions and billions results in a system which simply doesn't scale. Like it or not, consing is mutation. The harder harder you push a pure functional idiom the less functional the resulting program behaves at the OS level.

My interest is in being able to write interesting programs which, have large complex functions which really and truly are functional right down to the bare metal. To do this, I need a more expressive type system than those currently out there.

You have a funny notion of "evidence"

Within a few years of the advent of C, people had written operating systems and databases with it. "Modern" FP languages (Scheme, ML, Haskell, ...) have been around for 20 years give or take and have been applied to system programming tasks sporadically with no mainstream success.

Memory layout is absolutely critical for optimizing cache and disk interactions.

I don't disagree that memory

I don't disagree that memory layout is critical to optimizations and certain hardware IO (including disk interactions, which are often memory-mapped). I disagree only with the notion that programmers need to thread data-representation stuff throughout their programs in order to optimize a few communication details on the system periphery, and I oppose tying data representation directly into the type system. The extra responsibilities in the type-system complicates things for the programmers.

If you're going to object, then please object to what I'm saying rather than to statements I've never made.

Regarding "mainstream success" I can only say that I've no reason to believe, for better or worse, that technical proficiency of a development language is even among the top ten causes for achieving it.

Certainly that would be

Certainly that would be because C intentionally only includes features that map closely to the hardware.

If you back up a bit and consider arbitary high-level languages--Perl, Ruby, Python--then they were used for quite a lot of production work right from the start. Haskell and ML are clearly different, haven take a long time to get any traction.

There's a big difference

There's a big difference between a "production" web application and something like a database or an OS or a language implementation.

That functional languages have had little mainstream success in an application domain which consists overwelmingly of transformations which are perfectly functional in nature (ie. take this data from SQL and turn it into HTML) is probably indicative of something (a 5-year discussion consisting of attempts to explain mnads, perhaps?) but I doubt that it can be laid at the feet of an insufficiently expressive type system. Indeed, all of the HL languages you mention offer pretty comparable control over memory layout.