Impact of static type systems on productivity of actual programmers: first experiment I've seen documented.

It is disappointing that no particular results positive or negative were observed. But it is gratifying, and long overdue, that someone finally thought the question was important enough to perform an actual experiment to answer it.

http://courses.cs.washington.edu/courses/cse590n/10au/hanenberg-oopsla2010.pdf

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Hmm

someone finally thought the question was important enough to perform an actual experiment to answer it.

You realize that they implemented a new OO language with static and dynamic variants to control for many of the variables? I'm not sure 'gee, why did no one bother to do this before?' is a fairly phrased question.

Yes, it is a fair question.

In light of the fact that static and dynamic languages have both been sold (with zero evidence) as tools for making programmers more productive, yes, I would say it is a fair question.

When claims are made, there is a burden of evidence to be met.

No evidence

There is likewise no evidence that computers on desktops (as opposed to mainframes in banks and brokerages) have improved worker productivity, but which of us would choose to do without them?

which of us would choose to

which of us would choose to do without them?

At a guess, a lot fewer than favor dynamic typing.

static vs. dynamic typing are increasingly fuzzy categories.

I'll admit to being an old Lisp weenie who prefers languages with the type erasure property, of the sort usually referred to as 'dynamically typed.' I dislike systems intended to make certain kinds of semantics inexpressible, but at the same time, I consider it worthwhile to be doing extensive analysis on programs and doing proofs to detect misalignment between intent and semantics. I guess I sort of have feet in both camps - and a few others, in fact. Why limit oneself to just two feet, eh?

Anyway, it seems more the rule than the exception these days for runtimes to carry and use some type information at runtime. In fact, even so-called static languages that have no well-defined semantics until their type equations are solved, such as Java, often still need to store and refer to some type information at runtime.

It isn't fair to say such languages are 'dynamically typed', but at the same time they don't fit the classic definition of static typing where type information is all boiled out during compilation either.

Likewise, when you create a program using the Stalin dialect of Scheme, the type information *is* all boiled out during compilation, but it's still a language with the type erasure property, and the type system doesn't really express semantic intent as program semantics are clear even before the type equations are solved. Stalin's static types are mainly about binary representation, not semantics as such, and it seems unfair to call the language statically typed.

So I've been thinking maybe we should be describing type systems in terms of the type erasure property (intentional vs. descriptive types), runtime polymorphism, and other semantic categories, instead of describing them in terms of presumptions about how types are implemented in compilation phases.

This may be more than tangential to the current study. If they have 'controlled' for these semantic issues I'm identifying as more important or essential to what we mean when we use these terms, leaving the main difference between their 'statically' and 'dynamically' typed dialects being just whether they run the typechecker before or after starting the program, then it's not clear that any result would have been directly addressed to what is actually different about programming in these different kinds of systems.

In other words, are they measuring what they think they are measuring? Does their conclusion reflect the greater complexity/difficulty of a type system with the potential to change a program's meaning? Does it reflect the greater expressiveness/ease of a type system that *CAN* be used to change a program's meaning?

Not to mention typescript

Not to mention typescript and Dart, which exist mostly because programmers love code completion. One thing not tested in the above study is code completion, which I suspect might tip the balance to static typing (but not for the reasons most of us would think, its not really about checking at all, but feedback).

So now we new unsound type systems that harness in on how type systems are really useful.

Static vs Dynamic Polymorphism

This is the distinction that makes more sense to me. Viewed in this way a language that supports both static and dynamic polymorphism is a superset of those that support only one (and none)

Now if I want to enforce a static type constraint why not (you need it for static overloading anyway)

So with this view we can end the argument, as clearly languages with static polymorphism (and dynamic polymorphism) are best because they allow you to do everything the dynamically polymorphic language do and more :-)

Of course a language that requires specific annotation to allow dynamic polymorphism would be more to my taste, but maybe a compiler switch could swap modes.

Unsure about type erasure as the border

ML-family languages traditionally have type erasure, and yet they are very frequently opposed to Lisp-family languages as being "statically typed", vs. Lisp-family "dynamically typed". In my mind ML is still the prototypical type system.

Besides, type erasure may depend on the runtime you implement your programming language with. For example, your runtime system may allow you to access record/structs by field name, or only by integer offsets. In presence of implicit width subtyping of record/structs, the former gives you a type-erasing semantics, while the latter does not (you need to explicitly insert coercions to reorder record/struct fields to fit the offset mapping of the type being coerced to).

I strongly agree type erasure is a very important property (that is not black or white, it is important to understand "to which degree" our language supports type-erasure), but I'm rather doubtful that it would be the foremost dividing line between "typed this" and "typed that" concepts.

Why is type erasure

Why is type erasure important? Type erasure precludes efficient value representation. I'd argue that what's important is parametricity, not type erasure.

Type erasure and type-based optimization are compatible

My understanding of type erasure is that it says that the dynamic semantics of your language is observationally equivalent to one defined solely on the untyped terms. It does not preclude type-based optimizations as long as they are semantics preserving.

I don't believe that only one criterion is important when designing a language. On the contrary, I'm on the constant lookout for more "design criteria" that constitute tests for a design (but are generally to be understood as shades of grey rather than a binary pass/fail). Type erasure is one, parametricity is another, principal types... recently I've been working on full reduction as another "conceptual test" to deepen our understanding of language features.

What are the terms?

Do they have to be things you typed (clickity clickity)? Or can they be influenced by the type system, as with type classes supporting overloading of e.g. (+)? If we have this situation:

Source Language => Typed Representation => Untyped Representation

Here the first transform resolves ambiguity and is influenced by types and the second transform is type erasure. Does that count as supporting type erasure in your book?

In the middle

In presence of type classes, the source language does not have a type-erasure semantics (... in general, see Ela). Depending on what the "Typed representation" is, it may have a type-erasure semantics if the dynamic-stuff-depending-on-typing have been fully explicited out. The untyped representation certainly does have type-erasure.

The first language in your pipeline that has type erasure is the language you have to explain (in some way) to the users for them to understand the dynamic semantics of their programs. For this reason, you probably want this language to be as close to the source language as possible (eg. you probably don't want to *also* elaborate into a rich optimization-expressing type system with exception types and whatnot at the exact same transformation step).

This first type-erasure language may only exist in your specification and the mind of your users; it needs not exist as an explicit step in your compilation pipeline (which may merge this type-directed elaboration process with other transformation passes, although it's not quite clear what the benefits of this would be). It is however what you want the "be more explicit about what this line does" button of your IDE to pretty-print in toolboxes. So you should work on making it nice from the start.

Agreed all around

Agreed all around

Maybe we disagree on the meaning of the term?

Well, no. Stalin, for example, is a language that has the type erasure property but still contains absolutely no runtime type tagging - its value representations are all statically solved and its code vectors generated before runtime starts.

Type erasure means that it isn't _necessary_ to completely solve the type equations in order to determine what semantics the program ought to have, so your programs will have the same semantics if you do, or don't, "erase" the type annotations from the program text. With type erasure, you _can_ do runtime type checking and detect type errors at runtime, or run the program without first deriving a complete type solution - but if your language is otherwise amenable to analysis (like Stalin, or ML) you can also do type checking or derive that complete solution before runtime.

It's a language property, not an implementation property.

Type erasure is a language property. It just means that the observable semantics are the same whether you include type information in the source code or not.

So if I have some mathematical expression, the language semantics may determine that it returns a rational number, whether I declare any types or not. If I declare that it returns an integer (I use types here in describing, not prescribing, the intended semantics of the program), then a static typechecker could tell me, no, this expression could return a fraction instead.

Now, if this is happening in, say, Common Lisp, the availability of a static typechecker, and whether or not the runtime code will have and check any typetags, does in fact depend on the implementation. But this is not what the type erasure property means.

Aristotelian physics

Aristotelian physics was also considered satisfying at its time. The "types" debate is at a similar prescientific stage (I'm quoting from "A Physics of Notation"): instead of checking what happens, we try to have intelligent arguments about it. There's more debate, but it's just bad philosophical debate.

You (and everybody) are entitled to be fine with the state of the art, but when we will have science on this (say, in a hundred years), people will laugh at us. I also hope they'll be able to laugh at our programming languages and have something much better. (But I'm just arguing for the bit of historical perspective that Alan Kay reminds us of).

Makes sense

One would think the results of the experiment might be changed by modulating various parameters — some that come to mind are the character of the (created) language, the character of the task, and the character of the participants. But tbh I'd broadly expect static/dynamic typing to not make much of a difference for a typical task when enough other factors are weeded out (kudos to them for finding a pretty good way to weed out a lot of factors). It does occur to me that, for experienced programmers, a big factor is likely to be whether or not the programmer likes the language, which I'd expect to correlate with whether the language caters to the programmer's strengths.

language wars

Stefan and Andreas's recent essay might be of interest.


The Programming Language Wars: Questions and Responsibilities for the Programming Language Community

Abstract:

The discipline of computer science has a long and complicated history with computer programming languages. Historically, inventors have created language products for a wide variety of reasons, from attempts at making domain specific tasks easier or technical achievements, to economic, social, or political reasons. As a consequence, the modern programming language industry now has a large variety of incompatible programming languages, each of which with unique syntax, semantics, toolsets, and often their own standard libraries, lifetimes, and costs. In this paper, we suggest that the programming language wars, a term which describes the broad divergence and impact of language designs, including often pseudo-scientific claims made that they are good or bad, may be negatively impacting the world. This broad problem, which is almost completely ignored in computer science, needs to be acted upon by the community.

The paper can be downloaded from the DL without membership.

too much time and money

I appreciate what they are pushing for, but unfortunately I don't see there being much if any $ that can go to it.

for the sake of argument, that this scholar followed our
guidelines from the previous sections. They have provided a
formal proof that the feature worked, conducted a random-
ized controlled trial with human beings showing the feature
had a positive impact, and conducted surveys with industry
partners, which collectively provided a solid foundation of
evidence. From this, we would conclude that the researcher
has done their due diligence, plausibly obtainingseveral pub-
lications and doing their job as a scholar adequately.

Doing things with people is

Doing things with people is much harder than doing things with computers. People are not very consistent or reproducible; just getting them into a lab for a study is difficult (and reformatting them to eliminate bias is against the geneva convention).

A bit strange?

I find their discussion of "one language to rule them all" and "unique snowflakes" to be a little strange. They assert these are logical opposites. Is that really the case? It seems to me that they they come from different discussions entirely, that they don't even belong on the same line.

Roughly speaking, I would say that the former comes out of mathematical discussions of models and a search for some elegant unifying logic, the latter comes from engineering discussions of views and observations of diversity in form.

Of course when we start inventing languages we might start with one view and it will most likely be as direct (one to one, if possible) a metaphor/representation/visualisation for our logic as we can devise. We don't have to stop there, as demonstrated by the various ecosystems of interconnected (even interchangeable?) languages we can now find - though these are typically ruled by bytecode VMs rather than The One Ring.

A more current (statistical) analysis

A Large Scale Study of Programming Languages and Code Quality in Github. Reasonable summary:

By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size.

Note that this is more of an

Note that this is more of an archaeological approach: study the artifacts left behind by the humans rather than the humans themselves.

Code completion

I wonder how including code completion tools that only allow valid statically typed completions would affect this.

Possibly big confounding factors

As far as I can tell, there is insufficient controls for two confounding factors that could be large: The kind of people that select languages of type X and the kind of projects for which languages of type X is selected. There are controls for many important variables, but not personality types, and while there is project type overall (application, database, code analyser, etc), there isn't for project type in terms of "criticality of code".

For example, if statically typed languages have a reputation for creating fewer bugs and dynamically typed languages have a reputation for being good for rapid development, then the kind of people that want to avoid bugs will use statically typed languages and the kind of people that want to do rapid development will use dynamically typed languages. People that are neutral but believe in those descriptions will choose statically typed languages for projects where bug avoidance is critical, and dynamically typed languages for projects where high development speed is critical.

This creates a bias that is very hard to counter, and with somewhat unpredictable results. If you've got a "avoid bugs" project, you are likely to do more pre-submit review, do more testing, and chose a design with less risk of bugs - which will decrease your bug count. However, you'll also be likely to more aggressively fix bugs - which will increase your bug count.

If you're going for development speed, you're more likely leave bugs in - which will decrease your bug count. However, you're also likely to be more careless, which will increase your bug count.

These effects seem like they could easily be large enough to be significant. Unfortunately, I can't immediately think of any reliable way to determine how large they are.

Scale is the most relevant and least measurable factor

The even bigger problem with such experiments is that the benefits of types (and discipline and structure in general) vastly increases with code size, team size, complexity, heterogeneity, and age of a software project. While small projects can get away without, large ones become unmaintainable quickly.

That is an observation that has been made over and over in practice. But it is entirely impractical to scale these variables to an interesting degree in a controlled experiment. Hence, in reality such experiments are as significant as quantifying the benefit of traffic rules based on traffic in a remote village with 5 cars.

Maintenance experiments

You could probably experiment with scale by asking people to add a feature to an existing large project, with and without static types, and perhaps with and without a large set of unit tests.

But I agree that scale is frequently a confounding factor in analyses and controlled experiments and arguments regarding the value of OOP vs. FP, static vs. dynamic typing, and so on.

There was an experiment like

There was an experiment like this at UCB if I remember correctly (I think Leo mentioned it, not sure).

Controlled experiments can unfortunately take us only so far until we can figure out how to do this without people (e.g. with computers simulating people or with animal testing).

This is a good paper

It is close to inevitable that an empirical study that tries to break new methodological ground is going to have large lacuna and make doubtful programmatic choices. Since the methodology does appear to be very novel and the discussion shows a high level of care, the author should be encouraged, not sneered at.

The rhetoric of statistics often tricks intelligent people into thinking that absence of evidence is evidence of absence. But the failure of a study to provide significant results is only negative evidence if we can be sure that the study's methodology is not only sound but complete. The failure of this study to find significant evidence does not discourage me: if there is enough interest, the good starting point this paper provides can be built upon in further, more refined studies. And these studies might provide results we prefer; if not, tough.

The features of the paper I like include:

  • Construction of a new language and a new IDE reduces programmer familiarity as a confounding factor (or in the crystal clear words of the paper "we wanted to exclude any influence...caused either by...know[ledge] of the language or...practical experience with the IDE" and see the section treating related work), and makes it easier to set up things so that the static and dynamic versions of the language are comparable;
  • The choice of students with a certain level of skills but who claimed to have never constructed a parser reduces task familiarity as a confounding factor;
  • A substantial number of students have each performed a large number of comparable tasks;
  • The section "threats to validity" (i.e., soundness & completeness of the methodology) is conscientious about possible failings. Of course the section will be incomplete; science progresses in steps.