STEPS 2011 Progress Report (personal computing in 20kLOC)

The recent mention of the Nile programming language reminded me to post about the STEPS 2011 Progress Report.
The overall goal of STEPS is to make a working model of as much personal computing phenomena and user experience as possible in a very small number of lines of code (and using only our code). Our total lines of code‑count target for the entire system “from end‑user down to the metal” is 20,000, which—if we can do a lot within this limit—we think will be a very useful model and substantiate one part of our thesis: that systems which use millions to hundreds of millions of lines of code to do comparable things are much larger than they need to be.
...
Our general approach is to pick an area whose parts seem to be related—for example: 2.5D anti‑aliased alphaed computer graphics—try to find the mathematical relations that cover the desired phenomena, design a “problem oriented language” (POL) that is a “runnable math” version of the mathematics, implement that language, then use it to write the program code for the target area.
...
STEPS is now to the point where it is more fun to use and demonstrate than to talk and write about. Quite a bit of the original proposal is now working well enough to give all our presentations and write this entire report for the NSF and Viewpoints Research Institute websites. In the STEPS system, this document is live: the examples are actually runnable demos; the code shown is changeable and testable, etc.

A large part of the work over the last year has gone into making the test versions more real and usable.

Previous discussion of this project on LtU includes:

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Nice

I've read the different progress reports over time, and this is a very nice project indeed. Hopefully some of the domain-specific languages and, more importantly, abstraction/interfaces that they have designed will find their way into the mainstream.

There is a lot of discussion around "code bloat" (see for example this summary of a talk at Linux Conference Australia 2012 on the size of userland binaries between old and recent unixes), but it's not easy to know for sure what is really "bloat" that should be removed, and what is convenience that was added for a good reason, only if it's not used as often as the core functionality. Is cutting features at the edges a good price to pay for simpler code?

John Regher has some interesting blog posts discussing the STEPS:
- Can Simplicity Scale?
- It's All About Interfaces

I believe this is also related to Rich Hickey's Simple Made Easy, arguing for paying the research cost to get simple tools.

Regehr should read Sutherland's Wheel of Reincarnation paper

He has qualitatively missed the right criteria for interfaces. Take a look at Ivan Sutherland's wheel of reincarnation paper from 1968 about the design of graphics display hardware interfaces: The Design of Display Processors.

Saying UNIX is a durable systems interface is like saying a raster blaster is a durable graphics pipeline. Neither design is bad, of course, but certainly not durable and both will share the same fate. We have yet to reach the second inflection point, after the creation of Lisp, where the 'specific interface' becomes too costly to maintain as a special case of a more general language/system.

At any rate, composition-ality is king.

Non-leaky interfaces require a lot of error-checking code to ensure that each side adheres to its part of the contract.

How is it somebody can praise a project for its succinctness and not be inspired to ask if their traditional assumptions about size can be overturned? What is an example of 'a lot'?

Let's keep this on-topic by tying it back to STEPS. For example, how can we easily test Nile against implementation defects? Well, image manipulation algorithms have invariant properties, and we can use these invariant properties to write generic unit tests that can find basic programming implementation mistakes. We can potentially detect if one image transformation function is implemented incorrectly if the composition of it with an inverse image is different by changing the order of composition and comparing the two algorithmic sequences. Or we can employ a completely different kind of tool altogether: perceptual image differencing. And we can also sleep sound knowing that, across languages and within languages, lines of code tends to be more strongly correlated to defect count than any other factor. Food for thought: What would pdiff look like in STEPS?

Outfix notation?

Interesting read so far. On page 10 they mention that they support infix, outfix, prefix and suffix notation in their Nile language. I'm not sure what outfix notation is, and Google isn't much help.

A guess

I'm assuming "outfix" means, basically, custom brackets (e.g., bananas, lenses). Glancing at the Nile example, this looks plausible.

Not plausible, real

The absolute value function in their Bezier example, typically represented in most languages as abs(real), is denoted using standard absolute value syntax seen in any math textbook: | real |.

outfix

Example:

In linear algebra, we want to represent the magnitude of a vector as |v|.

Languages such as Maude support all of the above and mixfix.

I thought mixfix generalized

I thought mixfix generalized all the *fix operators, since it consists of a token sequence with "holes". At least that's the impression I got from Parsing Mixfix Operators.

Yes, they say that (_) is a

Yes, they say that (_) is a mixfix operator that "binds tighter than everything else". This means that "a (_) b" is mixfix and () doesn't bear any relationship to the surrounding a and b. This is a consistent interpretation.

But why stop here? Can't we turn each character of a string into an infix operator? Just consider the character "a". "a" can be written as an infix operator _a_. For each character we can write down the particular associativity to the right and to the left. _(_ binds tight to the right and not to the left. _a_b_ is an example of a mixfix operator. _+_ needs a little more elaboration.

Hmm... at this point I've already forgotten the problem I wanted to solve.

Pratt parsing

Isn't that essentially how top-down operator precedence parsing works?

Yes

Mixfix notation is an arbitrary sequence of tokens and "holes", as you call them. As for precedence and associativity, those are orthogonal concepts but generally mixfix operators have higher precedence due to need for ambiguity resolution.

Actually the precedence is

Actually the precedence is low as the "X ? Y ! Z" example demonstrates. The degenerated "(_)" mixfix with highest precedence only serves the authors systematics.

Let me rephrase

Precedence is generally closely linked to hole positions, since operators with more than one hole can have holes at either the beginning and/or end. In this way, when thinking of infix, outfix, prefix and postfix operators as a form of mixfix, we can automatically assign a default precedence based on the hole positions of the operator. We know postfix ends in a token, and so there is no hole position on the outside, and so on for other operators.

This phrasing removes my own opinion, experiences and biases.

Mixfix precedence

In my experiences with mixfix, precedence is whatever integer you annotate it to be. ;)

wow

seems too quiet here for such a remarkable project, and such a remarkable status report. are you-all too busy shouting from the rooftops about it to post? or do you not find this to be a wonderful example of Hamming's You and Your Research? (since i've been really hating how much damned CODE there is in life for a while. :-)

To be honest, there isn't

To be honest, there isn't anything technical to talk about here that I can tell. Great writing with lots of dogma, but what should we get out of it? And there is just so much there, no focus, we have to discuss their entire philosophy...too much work!

They should break their thoughts up into smaller more focused publications/blog posts/etc...then they would be easier to digest.

There are some individual

There are some individual papers here: http://vpri.org/html/writings.php The work they are doing is very interesting and perhaps the most important work currently being done, but their publications about it are a bit sparse.

I was both unsurprised and disappointed...

...by their observation it turned out they "only" needed good design to hit their target. I reproduce it below:

The STEPS Project was originally funded by NSF from a pool allocated to investigations in “The Science of Design”. The best results so far in STEPS have been design intensive—the parts that turned out to be the most understandable wound up using the kinds of design processes associated with mathematics, and the successful parts of the rest have much in common with structural engineering design.

The most surprising reflection at this point is just how much ground was covered with the latter. The large amount of covered ground was good—in that there is a lot that needed to be done compactly, especially in the “productivity applications” area of the project—but was also a bit underwhelming because just doing a lot of work on “parts and wholes”, partitioning modules and module boundaries,
and other classic well known and recommended practices, resulted in a small high functioning system.

One way to think about design in this context is that a lot can be done before “something algebraic” can be pulled out of the needed organizations. And it is tempting to stop there—because one part of the engineering criteria has been achieved—“it works pretty darn well!." But the lack of deeper mathematical integrity will show up when the scales and ranges are increased, and it will not be clear where or how to tack on the new code.

This is a damning indictment of much modern software, but that's not the most interesting part.

I also appreciate their desire for understanding the mathematical structure. I think there's a lot of beautiful mathematics hidden in the design intuitions of good programmers, and uncovering it has pretty much set my own research agenda for the last few years. (Just thinking about subject-observer has been enough to drive my my last five papers...!)

At least looking at the Nile

At least looking at the Nile portion, I wasn't really able to learn much I didn't already know.

I'm with Meijer on the basic non-lesson here. The PL community is great at making little languages -- Nile seems well in the tradition of many others -- but terrible at making big ones. For example, STEPS claims awesome parallelism speedups in the graphics layer. However, my suspicion is because the baseline is slow and the language is simple: a lot of effort, expertise, and edge case handling goes into optimizing these, so getting a 'real' speedup is very hard. I'm reminded about this essay on failing to seriously apply Forth.

While I agree that creativity may come out of extreme restrictions, and some notions of fundamental abstraction, I've been repeatedly left with the feeling that this project is similar to ones along the lines of "let's implement most of an OS in Haskell."

The fact remains that they

The fact remains that they apparently have a desktop OS including internet access, compilers, and word processor / presentation / spreadsheet software running in less than 20k lines of code. Even if say the graphics layer is 10x slower than a hand optimized one, that's still amazing given that it's 5000x less code. Even if the languages are trivial (which I don't think they are; OMeta & Nile are interesting in their own right), this is still great work. Of course they didn't invent a magic silver bullet that reduces the size of arbitrary software by a factor of 1000x guaranteed without effort.

BTW I'm not convinced that performance comes from raw human optimization effort. Look at the numerical computing space. The fastest discrete fourier transforms and matrix multiplication are not directly written by humans. Rather a human writes a program that explores a space of alternatives and then outputs the most efficient one. A fast DFT can be 1 megabyte of intricate C code, which humans are typically not able to write. A well optimized matrix multiply can be over 100x faster than the traditional triple for loop due to it being custom generated for a specific processor (so knowledge about the number of registers, the cache line width, the cache sizes, the available vectorized floating point instructions, and the number of cores can be exploited). This seems like a domain where a STEPS-like approach would work as well, with a DSL that lets you express a space of programs that the compiler explores to select the best one, instead of a special purpose custom generator for one particular problem.

Even if say the graphics

Even if say the graphics layer is 10x slower than a hand optimized one, that's still amazing given that it's 5000x less code

It may be unique that they're all together, but the individual tiny DSLs, as far as I can tell, are not unique.

BTW I'm not convinced that performance comes from raw human optimization effort. Look at the numerical computing space.

I agree, but you should be more critical here. To be competitive in performance, these systems (e.g., SPIRAL) encapsulate domain expertise. Likewise, beyond generating big binaries, the compilers are big themselves: each generally takes at least one thesis. I think this will go down in time, but it will take awhile (e.g., I wish I had time to explore SMT-based synthesis of polyhedral codes).

As a concrete counterpoint here to Nile is my thesis. I've been building optimizing synthesizers that generate parallel layout engines for languages like CSS, and more recently, big data visualizations. We support essentially a more modern form of attribute grammars. We already knew that attribute grammars can be used for simple layouts -- the little language problem, in terms of what Nile wants to support, is old news.

To be an actual choice for say a next generation web browser, we need to balance performance (statically ordered grammars + enabled optimizations) with flexibility (higher-order attribute grammars would otherwise be the full story). For performance, optimizing a dynamic constraint system (e.g., parallel computation on a dynamic dependency graph) is doable, but uninteresting: a good sequential implementation should beat it. The question is how to generate good sequential code and parallelize that: using the best techniques we knew, including inventing several, it took us a 20,000loc code generator (without worrying about loc). This is not including our extensive use of libraries. It is also with making some restrictions on what can be expressed: much of the research challenge is finding a good trade-off.

As in STEPS, our work is also grounded in the ideas of reboots and DSLs. However, we're taking care to address the lessons and needs of domain specialists. Simplifying their issues away -- the Forth and STEPS approach -- is a non-solution and something we already have a handle on.

I am not saying this is a bad / poor project. There are important lessons to be learned and contributions to be made. I suspect, for example, it can help just as the Jikes JVM did: for manipulating the 'full' computing stack, Linux is a little too unwieldy and archaic for many experiments while TinyOS is too tiny. There are other benefits as well, but we should be more mindful :)

It may be unique that

It may be unique that they're all together, but the individual tiny DSLs, as far as I can tell, are not unique.

What do you mean by unique? For them not to be unique, you'll have to have a pretty high standard of uniqueness which would make almost nothing unique relative to its predecessors. Is Haskell unique? No, it's basically ML. Is ML unique? No, it's basically ISWIM. Was that unique? No, it's basically Lisp x Algol. Etc. Take for example OMeta. It has several unique features relative to its predecessors like Yacc: it uses packrat parsing, it can match arbitrary streams and structured data, parsers can be built up in a modular way by inheriting from other parsers, and it can be implemented in itself in about 100 lines. Nile is a bit similar to APL, but the syntax is different, it's statically typed, automatically parallelized, but the major achievement is writing a complete practically usable graphics library in a couple of hundred lines in it.

I agree, but you should be more critical here. To be competitive in performance, these systems (e.g., SPIRAL) encapsulate domain expertise. Likewise, beyond generating big binaries, the compilers are big themselves: each generally takes at least one thesis. I think this will go down in time, but it will take awhile (e.g., I wish I had time to explore SMT-based synthesis of polyhedral codes).

Certainly. But the domain expertise can be written down very concisely as a program space. For example for matrix multiply you have basically two templates: block-wise matrix multiplication (either sequential over the blocks or in parallel) and unrolled tiny matrix multiplication. A good matrix multiply will use multiple levels of block-wise multiplication to optimize for parallelization (divide the matrix up into the number of cores), the different caches (a block should fit in the respective cache) and the inner vectorized matrix multiply. But what the optimal decomposition is can be found automatically. The same goes for DFT: the domain knowledge can be written down in half a page. I'm sure that a naive approach that optimizes for LOC won't win any performance contests, but it should be able to get pretty close.

As a concrete counterpoint here to Nile is my thesis. [...] the little language problem, in terms of what Nile wants to support, is old news.

Note that Nile is addressing a different, much more low level problem: doing the graphics rasterization, not the graphics layout. Given a vector graphics representation, for example a circle or a shape consisting of bezier curves (e.g. a glyph from a font), it draws the pixels.

For performance, optimizing a dynamic constraint system (e.g., parallel computation on a dynamic dependency graph) is doable, but uninteresting: a good sequential implementation should beat it. The question is how to generate good sequential code and parallelize that: using the best techniques we knew, including inventing several, it took us a 20,000loc code generator (without worrying about loc). This is not including our extensive use of libraries. It is also with making some restrictions on what can be expressed: much of the research challenge is finding a good trade-off.

Surely that code size would make you find the STEPS results so far all the more amazing ;) I see what you're saying, but they don't claim to win any performance contests: they merely show that they can build something that's good enough with very low LOC. Their point about parallelism was that instead of spending a million LOC on optimizing the heck out of the sequential version like Cairo, a better strategy to improve performance that does not defeat their LOC goals is to parallelize it and wait until Intel delivers more cores:

The  multithreaded  runtime  is  used  by  our  2D  vector  graphics  renderer  to  support  other  STEPS 
subprojects while maintaining good performance. To gauge performance potential farther into the future, 
we have also run one of our rendering benchmarks on a 40‑core machine, resulting in a nearly 30 times 
speedup (see figure 2). We believe that this demonstrates that implicit parallelism provides an avenue for 
high performance that doesnʹt sacrifice the minimalism and clarity that are important goals of the STEPS 
project.
Simplifying their issues away -- the Forth and STEPS approach -- is a non-solution and something we already have a handle on.

Certainly STEPS simplifies a lot of stuff away, most importantly backwards compatibility with existing formats (that's the point). However, comparing STEPS with Forth does not do justice to STEPS. You can't do much in Forth effectively, for example where is the graphical OS in Forth? The email client? The word processor? Compare this with STEPS: they were able to write the STEPS document in their OS on their word processor where text and images are rendered with their graphics pipeline compiled with their compilers and exported to PDF with their tools.

The most important lesson of STEPS, if they succeed, is exactly what they say it is: 99.9% of code is unnecessary to get an experience that's roughly equivalent to current personal computing, if you are willing to start from scratch. I don't think that the fact that this code reduction is achievable and how to "get a handle on it" is as obvious to everybody as to you; it certainly would surprise me, certainly before I heard about STEPS. 10x perhaps, 100x unlikely, but 1000x?

BTW, what do you mean by "SMT-based synthesis of polyhedral codes"?

What do you mean by

What do you mean by unique?

Tiny DSLs. The particular case of Nile, targeting UIs, is very well studied.

Certainly. But the domain expertise can be written down very concisely as a program space... the domain knowledge can be written down in half a page.

You're trivializing the optimizations performed by HPC specialists and over-estimating the abilities of modern compiler technology. The specialists *do* use code generation, and the synthesis/compilers community has long been trying to do what you're assuming to be solved. I agree with the vision; we're just not there.

Note that Nile is ... doing the graphics rasterization, not the graphics layout. Given a vector graphics representation

They're doing DSLs for both. Rasterization is also also a well-trodden path (I didn't find it surprising enough to even comment on). E.g., Flash's API is/was pretty much built on Bezier curves and a few other primitives, and a similar intuition guides the hardware itself. I should back off a bit here in that writing small code was not not an explicit goal here, but reducing to optimized kernels was.

a better strategy to improve performance that does not defeat their LOC goals is to parallelize it and wait until Intel delivers more cores

The point of parallelism is performance per Watt, so I have no idea if their results are interesting. It is easy to inflate a workload and then parallelize it. Finally, if you're talking rasterization, multicore is a misguided target (you should aim for SIMT/SIMD or better, like hw/sw-codesign).

an experience that's roughly that's roughly equivalent to current personal computing

This equivalence is where I disagree. This perspective trivializes what domain users actually require. It is similar to claims about verified and secure operating systems. I'm fine with throwing away legacy specifications, but we should acknowledge that they're also glossing over basic domain abstractions/requirements.

BTW, what do you mean by "SMT-based synthesis of polyhedral codes"?

This is a little beyond LtU. Many (regular) loop optimizations can be phrased in terms of the polyhedral model, but actually building them on top of it is hard: specifying them, composing them, instantiating them, code generating them, etc. Synthesis is going through a revolution right now thanks to SAT/SMT solvers, so I think we're in a good time to rethink how to design and build such tools. In this case, we might do (and improve upon) polyhedral synthesizers by phrasing them as part of a SMT synthesizer framework. My interest came up because I've hit diminishing returns fairly early in building optimizing synthesizers: we need to tame their complexity.

Edit: If I am coming off as overly critical, my intent was to drill down to what is actually surprising and cool in the project (like an actual full system). Their work is good enough to withstand that sort of discussion :)

Tiny DSLs. The particular

Tiny DSLs. The particular case of Nile, targeting UIs, is very well studied.

Tiny DSLs are certainly not unique, I don't think anybody would claim such a thing. Do you have any references to something that's has the same characteristics as Nile? I feel like your reasoning goes like this: "DSLs have been applied to UIs, so Nile is not unique". That's the same as saying "There exist general purpose programming languages, so Haskell/Scala/etc. are not unique, since they are general purpose programming languages". That doesn't say anything about the merits of the research, and neither does it say anything about the uniqueness of the actual language. It only says "there exist languages for domain X".

I agree with the vision; we're just not there.

Certainly, but the existing research is slowly approaching a more general purpose solution. That's why it's interesting.

They're doing DSLs for both. Rasterization is also also a well-trodden path (I didn't find it surprising enough to even comment on).

Nile is only targeted at rasterization. They do have a DSL called Ants for layouts. They express all the layouts they need for a WYSIWYG word processor in 35 lines.

Aren't you setting extremely high goals here? Expressing something that usually takes a million lines in a couple of hundred lines is not enough, it also has to solve a fundamentally new problem (i.e. not rasterization)? Compare this with your own work on CSS layout: that is also a well-trodden path. If we apply the same high standard that would certainly not not be worth commenting on: an x% speedup is less notable than a 1000x reduction in code size. The problem of computing layout fast enough is also far easier and less interesting than drawing anti-aliased vector graphics.

This equivalence is where I disagree. This perspective trivializes what domain users actually require.

Can you give examples? STEPS is definitely not complete yet, but what they have been able to show so far already comprises a good chunk of personal computing: the OS, an office suite, email, the web, and programming tools.

In this case, we might do (and improve upon) polyhedral synthesizers by phrasing them as part of a SMT synthesizer framework.

I don't immediately see how polyhedral loop representation is related to SMT solvers or how it can use SMT solvers, but it sounds interesting.

topsy-turvy

Jules, I'm feeling trolled at this point. Naively rasterizing bezier curves is hard and so is building primitives on top of them? They built a real web browser and a rich layout language? They built an end-to-end system, but you're trivializing what actual applications -- and languages -- do.

As I've demonstrated, I and a lot of literature struggle with balancing expression with performance. Ignoring performance ignores essential forces on the application code. Their work is still interesting, but it contains non-trivial trade-offs.

I don't immediately see how polyhedral loop representation is related to SMT solvers or how it can use SMT solvers, but it sounds interesting.

Eh, that is research-level. Not really appropriate for here. Feel free to email me though!

As I've demonstrated, I and

As I've demonstrated, I and a lot of literature struggle with balancing expression with performance. Ignoring performance ignores essential forces on the application code.

Just as another thought: I've seen lots of papers on optimizing axes like performance or memory use; squeezing even a small 5% improvement seems worthy of publishing.

I haven't come across many publications where LOC has been a metric of interest though, except in off the cuff comments, or informal discussions of language expressiveness. STEPS seems novel and interesting purely for optimizing that metric alone. It seems like a good project that bridges academia and industry, by showing the latter what can be achieved by taking academic work seriously, ie. building good domain tools, highly reusable abstractions, DSLs, etc.

The other novelty of STEPS from my view is rather too broad a scope for single paper publications: gluing so many disparate DSLs into a single, coherent system. I haven't come across many publications that crosscut so many different domains to show how such systems can be cohesively built.

The misuse of metrics is

The misuse of metrics is very scary; especially a single one. APL optimizes LOC. That LOC is somehow a reasonable proxy for complexity is also very controversial.

That's what I thought at

That's what I thought at first too, but if you look at their code you'll see that it's perfectly readable. They are even quite liberal with line breaks. It's nothing like the typical APL. There is also a point of code reduction where it can't possibly come from compressing the syntax alone. 1000x is well past that point.

That LOC is somehow a

That LOC is somehow a reasonable proxy for complexity is also very controversial.

I agree that it might not be the ideal metric, but it still seems like the best metric we have. There was a paper recently published that reviewed empirical data in past publications on supposed correlations with other code metrics, like cyclomatic complexity, and found that LOC was still a better predictor of complexity (or it might have been a better predictor of bugs -- can't find the study at the moment...).

I agree that readability must also be a factor of some sort. Perhaps some source analysis on how close the code is to noise might be a decent way of determining readability. There's probably some objective number beyond which programs become unreadable to humans.

As an exercise in critical

As an exercise in critical design, the project is still interesting (given this LOC constraint, let's see what we can do). All of our complexity metrics kind of suck. Don't get me wrong, they don't appear to be gaming their metric, their code still looks elegant enough.

I've found higher-order functions and heavy recursion to be problematic to understanding; basically humans get lost as they are forced to simulate the computer to understand code. This is why for loops are still more usable than recursive analogues, or why LINQ's syntax for candy coating certain higher order functions is appreciated. I have no way of evaluating this empirically, though I suspect its somehow related to Hick's law on menu depth (how much do you have to remember about where you are). Doing more empirical psychology in this space would be great for designing languages and tools that support them.

Citation needed

Citation needed

On Hick's law? Really?

On Hick's law? Really?

No, for the other points

No, for the other points made in the post.

Reread my post. I "found"

Reread my post.

I "found" generally means anecdotal. I thought that this is related to Hick's law, or some other cognitive load theory, but I haven't done any experiments to validate that hunch. I made no claims since I have no proof of anything. It would be quite cool to hook programmers up to an MRI and evaluate how their brain fires up when debugging a heavily nested recursive function.

Humans getting lost?

I've found higher-order functions and heavy recursion to be problematic to understanding; basically humans get lost as they are forced to simulate the computer to understand code.

Eh? Humans have been at it since, at least, the end of the 19th century, long before the modern computer. Frege allowed the substitution of functions for variables in 1879 and ten years later Peano was defining arithmetic recursively. No need to think like a computer back then.

Funnily enough Frege had problems with higher-order functions (Russel's paradox), and Peano didn't get his recursive definitions quite right. (He didn't define inference rules, only the postulates).

In software design, its a

In software design, its a good idea to limit indirection, call depth, and abstraction layers, it becomes easier to understand what the program is actually doing. Its no fun breaking in a 100-deep call stack of recursive function calls and indirected calls through higher order functions!

But if we are talking about theory, recursion and HOFs are great and useful. But please, don't ask me to debug code based on recursively defined arithmetic!

One paper does not make the argument convincing

There is a lot of evidence that lines of code are a very good metric of complexity.

Please see my comments above, where I mention that defect rates tend to be relative to lines of code. It is also fairly well-known that to achieve higher quality than average, e.g. Sigma Six quality, one must not use simply a good language but a good process and methodology. Ultimately most problems will be accidental due to communication.

However, as for "recent studies", well, I don't consider them that new but nevertheless you can quote the following studies:

Graves, T. L., Karr, A. F., Marron, J., and Siy, H. (2000). Predicting fault incidence using software change his- tory. IEEE Transactions on Software Engineering, 26(7):653–661.

Evidence that changes to existing code are a better predictor than lines of code, and that so-called complexity metrics were no better than lines of code at assessing complexity. The shift here is in how we think about complexity. If something has a known defect that requires fixing, who cares how complex it is, we care instead how hard it is to fix without causing regressions. Therefore, they analyze deltas rather than sizes. However, they were arguably unsuccessful in finding a model to associate with their ideas on deltas. For example, how do you weigh deltas over time?

Herraiz, I., Gonzalez-Barahona, J. M., and Robles, G. (2007). Towards a theoretical model for software growth. In International Workshop on Mining Soft- ware Repositories, pages 21–30, Minneapolis, MN, USA. IEEE Computer Society.

Evidence that calculating the famous McCabe Cyclomatic Complexity is strongly correlated to a double Pareto distribution to determine a project's overall complexity, which matches up with Mitzenmacher's observations on the size of files in a filesystem and his Recursive Forest File Model for how and why files and filesystems grow. So the thing to note here is that Mitzenmacher's observations had nothing to do with programming, it was a discovery in another field where there is no Cyclomatic Complexity metric for PDFs or TXT files or whatever because they're just blobs of 1s and 0s to him. The essay is also very good in that it highlights major inconsistencies in how researchers measure complexity.

Hassan, A. E. and Holt, R. C., "Predicting Change Propagation in Software Systems," In Proceedings of International Conference on Software Maintenance (ICSM 2004), Chicago, Illinois, USA, Sept. 11-17, 2004.

This paper has a number of fascinating observations, such as:

If two files change together, the likelihood they will change again together in the future is high. (For me, what this could suggest is that although connectedness of modules does not indicate a rate at which software decays or becomes brittle, it indicates that if you can pinpoint clusters of files with changes often then that coupling is an issue simply from a time to implement changes perspective.) This relation holds even at the more granular levels of code layout, so if you have a class and make an edit in two different places, there is a predictor that the next time you edit either of these places you will need to update the other as well.

Also, Jim Whitehead of WebDAV fame has had several Ph.D students investigate the nature of bugs in order to better predict their occurrence. His paper "Toward an understanding of bug fix patterns" argues that if we look at the problem from the perspective of bug fixes and try to mine patterns in the bug fixes, such as whether the root cause was a bad conditional expression or assignment, etc., then we can observe that root causes tend to be very common within a developer and among developers. This indicates that regardless of the type of software you are writing or what language you write it in, you will still make the same dumb mistakes, just encode them differently!

J. Rosenberg, "Some Misconceptions About Lines of Code," metrics, pp.137, Fourth International Software Metrics Symposium (METRICS'97), 1997

Source lines of code (SLOC) is perhaps the oldest of software metrics, and still a benchmark for evaluating new ones. Despite the extensive experience with the SLOC metric, there are still a number of misconceptions about it. This paper addresses three of them: (1) that the format of SLOC is relevant to how to properly count it (a simple experiment shows that, in fact, it does not matter), (2) that SLOC is most useful as a predictor of software quality (in fact, it is most useful as a covariate of other predictors), and (3) that there is an important inverse relationship between defect density and code size (in fact, this is an arithmetic artifact of plotting bugs-per-SLOC against SLOC).

For point (1) addressed by the above 1997 paper, the author essentially validates Mitzenmacher's filesystem observations made in 2003. More recently, Les Hatton has proposed a unifying principle for understanding these observations, using Shannon information theory. He models symbols in source code as representing choices the programmer made, meaning we don't have to care about the meaning of the symbol in a language, only that the programmer chose to use it. The only factors that matter are fixed size and fixed choice. Note also that in smaller systems built using languages with a large number of pre-defined symbols, the languages keyword list is an overhead and will represent within the code less choices the programmer made -- instead you are measuring choices the language designer made!

In other words, searching for general principles here, the only complexity and growth metrics that matters are the file size distribution and the symbol distribution within and among files. Tying this back to STEPS, the approach of using problem-oriented languages to solve problems (creating new fixed symbols and then re-using them in interesting ways), we can predict using Hatton's formula as our covariate that if STEPS can really build OSes in 20,000 lines of code, the "complexity" will be mitigated to the language itself rather than the choices made by the programmer! That would be really cool to show/prove empirically :D

Cheers.

sounds valid to me

@ the "complexity" will be mitigated to the language itself rather than the choices made by the programmer!

rings intuitively true to me. i think of 2 dimensions that are required to make it turn out well: (1) the dsl/library must have a narrow enough domain that it doesn't end up bloated itself; (2) the people implementing the dsl/library must be good at doing that kind of thing. should either of those fail to be true then all you end up with is another layer of abstraction-leaky crap.

Sorry, I did not mean to

Sorry, I did not mean to trivialize your research, merely to demonstrate that if you apply the same kind of reasoning as you're applying to STEPS, that's not hard to do. Rasterizing vector graphics with acceptable performance is harder than doing layout with acceptable performance, yes, especially if you are rendering a million pixels with thousands of letters on the screen each of which is a vector graphics shape that needs to be redrawn in every frame of an animation. Of course it is a trade-off, that's the point. They're saying: by giving up X% performance and Y features that you don't really need to get the same kind of experience, you can get a 1000x reduction in code size. They are not ignoring performance, they are simply happy with acceptable performance in 400 lines, instead of max performance in 1,000,000 lines. They are exploring a radically different balance of trade-offs, but still one that leads to a usable personal computing experience.

See this for what it is

After Alan Kay's 1997 OOPSLA talk The Computer Revolution Hasn't Happened Yet, Alan conveyed regret that many did not seem to understand the points he was making. He then challenged the Squeak mailing list to think about OOP differently and build something better than Squeak. 10 years later, nobody stepped up to the challenge, so Alan's research group a VPRI tackled the challenge he laid out:

I would suggest that more progress could be made if the smart and talented
Squeak list would think more about what the next step in metaprogramming
should be - how can we get great power, parsimony, AND security of meaning?

What you are missing here with your polyhedral-whatevers discussed below is that Squeak was never a fast system, just a good enough system. Bitblt has acknowledged limitations, as does the Morphic approach to graphics. Even with those limitations Squeak could still be thought of as bloated. So all the things you are worrying about are non-issues, and even when they are non-issues we still have problems building systems small today.

Answer why, rather than talking about polyhedral models.

Still the "external world" issue

> I was both unsurprised and disappointed by their observation it turned out they "only" needed good design to hit their target.

The thing is their target is a bit low, they made a wordprocessor with a very small number of lines, great, but, call me back when their wordprocessor can read and write ODF docs, or they have a web browser which pass the ACID3 test, etc.

The main point of the STEPS

The main point of the STEPS project is that our existing infrastructure is hopelessly bloated. They've essentially build the next iteration of Smalltalk; which has always been a 'live' system, with binary serialisation. ODF, HTML, CSS, etc. are simply cumbersome, lossy serialisation formats, which the "image" has always made obsolete.

Essentially, such formats are meant to allow the exchange of data by different applications. The STEPS approach is OO: there is no data, there are no applications. Documents are self-contained virtual machines, just like everything else. The issues to address are standardising the platform so that objects are portable between implementations (eg. like existing Smalltalks) and sandboxing, so that untrusted objects (which we might call "emails" or "documents" or "games" or whatever) can't break the system.

They're also pushing in the other direction, where there isn't a "local" system which sandboxes arbitrary "remote" objects; the whole system itself is meant to be inherently distributed, with each physical machine contain enough bootstrap objects to hook into the system. This is essentially turning the network itself into the platform, as opposed to client/server where the network is essentially unimportant (everything is server-side or client-side, nothing is (conceptually) one the network).

Also, I think it's worth

Also, I think it's worth pointing out that STEPS aims to be a simplified model of personal computing; an object for study, since existing personal computer systems are insurmountable. It's not actually meant to be a production system (although I certainly wouldn't mind running it!)

Also, many of the people involved in STEPS have been around long enough to appreciate the fads that computing is prone to. Existing standards are snapshots of the current average practice. Rather than reinvent these wheels, why not (as Alan Kay says) invent the future? :) It's common knowledge that he's less than impressed by current efforts http://stackoverflow.com/questions/432922/significant-new-inventions-in-computing-since-1980