Quantitative comparison of unit testing vs. static typing?

Has anyone seen an objective comparison, e.g., quantitative analysis, of whether unit testing will find the bugs an ML-ish type system will reveal? There are additional benefits to both approaches, and many claims and anecdotal experiences about these things, but I'm wondering if anyone has seen anything more objective?

For example, the DiamondbackRuby etc. guys might have something like a true positive and false positive rate on checking codes that were only unit tested.

Edit: Ask and ye shall receive (in a tiny part): a study of the bugs found by translating some unit tested Python apps into Haskell.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Racket's speed & practicality

I believe news.ycombinator.com runs on Racket. While Racket might not be the fastest language, it's certainly faster than Ruby & Python (often by over an order of magnitude).

I think you're confused

I think you're confused about Racket. It's much faster than paradigm "dynamic" languages such as Perl, Python, or Ruby. Single-threaded performance in Racket is most comparable to Clojure or Erlang, with the caveat that these are always terribly flawed comparisons. There are people building real, large-scale (1,000,000+ LOC) Racket applications.

Communication

It's much faster than paradigm "dynamic" languages such as Perl, Python, or Ruby.

I meant to say it's not a *fast* dynamic language (Lua, some Javascript engines), but indeed my comment was misleading.

There are people building real, large-scale (1,000,000+ LOC) Racket applications.

That's neat, and I think you should communicate more on this. I subscribed to the Racket blog (or, for that matter, your blog) and I don't hear about it at all. Where can I get relevant information?

Fast

Looking at small benchmarks such as the Language Shootout, Racket is ahead of v8 in single-core performance on both x86 and x64.

Here is Neil Van Dyke talking about "Racket in the large".

I stand corrected.

I had looked into Racket performances a few months ago, and for some reason remembered the wrong thing -- maybe it was being compared to native language so the results were less impressive. Thanks for clarifying matters.

Works For Me :)

As far as I can tell from this discussion, the Racket contract system currently lets you express all of the invariants you get out of a standard HM-style type system (leaving aside Andreas' constructor polymorphism). Racket provides contracts for both polymorphism and existential types in the standard library.

How far does it go?

I'm curious in practice how far this goes, in particular, if you look at a large, sophisticated codebase written in ML, how much of the static guarantees can in practice be doable by contracts. Also, I wonder how much is required in the way of annotations to capture those invariants. One thing that's lovely about the ML type system is how very tersely your invariants can be written down.

I do think it would be lovely to have a contract system in ML that was well integrated with the type system, so that guarantees captured by types would be implemented and checked as they are now, and where contracts that require run-time enforcement could be handled as well.

Ocaml contracts prototype

Something like the Dynamic contract checking for OCaml prototype?

Indeed!

That definitely looks interesting. Thanks for the pointer.

Squadron scramble

Looking back on it, it seems that the Squadron Scramble code was actually unit tested extensively. There, 5 bugs were found.

The random number generation code probably didn't have an extensive test suite (SRFI's rarely do). Additionally, almost everything in that program is of type Real, making the type system less useful for finding errors.

Cool, good catch. For the

Cool, good catch. For the benefit of others:

Squadron Scramble This application, created by Matthias Felleisen, is a version of the multi-player card game Squadron Scramble [Kaplan 2002], which resembles Rummy. The original untyped implementation consists of 10 PLT Scheme modules, totaling 2200 lines of implementation code, plus 500 lines of unit tests. The program maintains data about the set of cards in an external XML file, which it reads on startup.

The 4:1 code/test ratio probably isn't a good way of quantifying unit test quality, but it's something. I'm not sure what would be good (e.g., the automated testing community likes branch coverage).

This is a fairly unrelated

This is a fairly unrelated remark, but while I don't think branch coverage is related to typing information per se, I have (anecdotally) found it quite related in result to the warning provided by an exhaustiveness checker for pattern matching of algebraic datatypes.

In MLish languages (Haskell included), and in a certain range of applications, an important part of the control flow is done by potentially deep pattern matching, and non-exhaustive patterns are good markers of ignored cases (and, very often, bugs). A pattern exhaustiveness checker, even naive, can be very helpful.

This is also related to Andreas's remark that code evolution and maintainability is paramount. Just like adding a method to a base class in an OOP program requires cross-cutting changes to the program, adding a case in a sum type is a frequent case of program evolution, and is quite effectively supported by the warnings emitted by the exhaustiveness checkers -- if one does not abuse fragile patterns. But for pragmatic reasons this static analysis is not directly part of "type checking" (except for OCaml polymorphic variants).

PS: I don't mean that exhaustiveness checker can replace code coverage checkers or conversely, only that they provide the same feeling of "ah, I forgot to handle this case again".

On the bugs of Squadron

On the bugs of Squadron Scramble in your thesis:

In one location, the string->number procedure was used, while
neglecting that the result might be #f if the provided string cannot be parsed
as a number. The assert procedure converts this possibility into an exception,
as described above.
Four similar functions, all used as examples, call a constructor function
with the wrong number of arguments. Fixing these was very simple.

Can you expand on the last paragraph here? Were these function calls example code that was not updated to reflect the change to a define-struct, presumably because the example code was never run after that? I'm not sure if lmeyerov wants to count these (or the insertion of an assert) as bugs found. Don't get me wrong, Typed Scheme is awesome and valuable, but perhaps the value is not in finding bugs in unit tested code, because, it seems that there simply aren't many? (you write that the porting to Typed Scheme has required a thorough review of the original code, which is generally a very effective way to find bugs on its own)

Yes, the example code hadn't

Yes, the example code hadn't been updated. Ironically for the general subject of discussion, the examples were intended for integration testing, if I remember correctly.

The thorough review was generally only needed to revise code that didn't typecheck, which didn't substantially come in to play for this example (which wasn't done by me, I should point out). So that typically didn't find bugs, in my experience.

On the larger point, finding bugs in non-trivial working software is not easy, and not the goal of Typed Racket, so while I think the experiment Leo proposes is interesting, it wouldn't really change my opinion about types one way or the other.

To be clear, I don't think

To be clear, I don't think the value of types for commercial software development lies in simple bug finding. In the future, it may be useful for finding more bugs (e.g., sharing state bugs in concurrent code), but I agree with other statements here about enforcing good documentation, providing hooks for tool development, etc. My concern is that, if programmers *do* think types are primarily for finding simple localized bugs, type enthusiasts are going to have serious adoption issues. I've been thinking about the adoption / sociology issues, not the traditional technical ones.

Types do provide bug finding

Types do provide bug finding as you type (in IDEs at least). With unit tests you first have to actually write the tests, and they tend not to run unless your entire file is in a valid state.

As Jules points out, I think

As Jules points out, I think that "most bugs typing would find are also found by untyped unit tests" does not indicate that typing is not efficient at bug finding, only that it does not find much more bug than untyped unit tests.

It could be argued (and I think it is very reasonable) that static checking has a tighter feedback loop and lower costs than writing unit tests. Or at least the part of static checking that can be performed at relatively little cost, such as ML-style type inference. I decide to ignore for a moment that the mere possibility of inference is the sign of some language design choices that may have other, possibly negative, effects on overall development cost.

An obvious counter-argument is that despite the presence of such static checking, unit tests are still needed to evaluate functional correctness (at least in the 99% of cases where fully static proof of functional correctness is judged a bad cost/assurance compromise for the domain). What is the point of catching bugs through types if you still need to write the unit tests after this, and would catch bugs at this point anyway?

One could still argue that pre-testing bug catching has a lower cost. One could argue that the knowledge that the code is well-typed decreases the cost of writing unit tests: maybe a lot of untyped tests are unnecessary and can not be written? I must say I have very little practical experience of unit testing in dynamic language, and am unsure which amount of work solely increases assurance in type-correctness.

There is this same question of transition between a static and a dynamic model when using contracts. Why would I put effort into annotating my code with static code annotations if I know that I will later make the choice to add more expressive dynamic contract checks anyway? This conflict is lessened in languages where the transition between the static and dynamic variants of an assurance-increasing technique is facilitated. (That is, Racket, mostly.)

Testing for the type in unit

Testing for the type in unit tests unnecessary 99% of the time, because if there is a type error then normal tests will fail anyway (either because some function raised a dynamic type error or because the assertion fails). For example:

def sort(xs):
	if len(xs) <= 1: return xs
	return sort([x for x in xs[1:] if x <= xs[0]]) + \
	       [xs[0]] + \
	       sort([x for x in xs[1:] if x > xs[0]])

Lets write a test for this:

def is_sorted(xs):
	return all(a <= b for a,b in zip(xs,xs[1:]))

assert is_sorted(sort([3,2,5,1,2,3,2]))

The chances of this assertion still passing after introducing a type error are very low. For example if you forgot to put the pivot xs[0] into a list [xs[0]], then is_sorted will raise a type error. If you put the pivot in too many lists [[xs[0]]] then there will be no type error but the assertion will still fail. In general things have to conspire in very strange ways for unit tests to pass despite type errors. The more common case is that you didn't test a particular piece of code (usually error handling).

Contracts nibble at this as-you-type advantage of types. One can easily imagine a contracts system that in addition to checking values dynamically, also uses quickcheck-like randomized testing as you type. If you annotate a function with the contract int -> int then it will throw random ints at the function and the dynamic checks will make sure that the contract holds. You can also easily imagine exploiting this to do "contract inference": assign a trial contract to the value that we want to infer a contract to, and decide by randomized testing whether it is a good contract. You have to do this in an intelligent way because the space of all contracts is huge, and because you want to assign the most general contract. I think it's doable by searching a space of contracts from weakest to strongest.

For example if you had the value x -> x+1 you might have the following series of tests when searching over the space of contracts C = {Any, None, Int, List[C], Func[C,C]}:

Any? passes.
  Int? fails.
  List[Any]? fails, this prunes an infinite number of candidate contracts List[C]
  Func[None,Any]? passes.
    Func[Int,Any]? passes
      Func[Int,Int]? passes
      Func[Int,List[Any]]? fails
  etc.

This way you can infer a set of "locally strongest" contracts. Perhaps instead of exploring weakest to strongest it is better to maintain a a set of upper AND lower bounds on the contract and do a kind of binary search to deal with positive an negative contract positions in a symmetric way (you'd start with Any...None, then go to Func[None,Any]...None, etc.). You probably need some additional mechanisms and a lot of engineering to make it work in practice, but it would be an interesting experiment I think. Imagine an IDE that displays the inferred contract above each function as you type them. Functions with type errors would get very weak inferred contracts that you'd be able to spot. For example the contract of the function x -> 1 + x + "foo" would be inferred to be Func[None,None], meaning that the function doesn't work for any values.

Cost is one of those fluid

Cost is one of those fluid things. I've found both static and dynamic checks to be rapid, and depending on scenario, to have different benefits. Likewise, a lot of useful lightweight static checks need not adhere to rigid type disciplines (e.g., tab completion is still effective for dynamic languages). Many researchers are observing an inflection point in the static/dynamic design space because the cost of compute is going down much more rapidly than the increase in program size and number of tests. This gets interesting because, for example, I might write a dynamic contract for a function and ask for it to be synthesized, or expect continuous unit testing.

For example, when hacking on the C3 web browser, I ran unit tests maybe every 5-10 minutes because most of my cores were idle anyways. If I can develop a browser in this manner today, what does that say about the design of research tools that will be adopted further out? I think it's worth being critical about the comparative and long-term costs/benefits of different approaches. For example, types will make many search-based techniques more tractable, and thus act like an optimization for tools relying upon them. That, to me, is a significant change in thinking.

Tooling will have a lot of

Tooling will have a lot of influence on this. If you can write unit tests that are automatically invoked by the IDE when the code they targets changes, you can get timely feedback about your errors in a very pervasive way. Static typing gives us that for free, but its always very limited. If we focus too much about static typing, we might ignore other tools that are equally useful in different ways.

Static types make some search tasks easy, but not all. Many types are too complex to encode as static types, but we'd still like to "find" them; and so we have naming conventions and comments. I actually don't think formal-encodings are in competition with run-time/informal-encodings, we really need to understand and leverage both.

Hey, dude, it already exists.

If you can write unit tests that are automatically invoked by the IDE when the code they targets changes, you can get timely feedback about your errors in a very pervasive way.

In fact, not only does it exist today, but there is no requirement "if you write unit tests", because an IDE testing plug-in does it for you: Pex! Here is a screenshot where the IDE presents a test coverage progress indicator of unit tests it has generated so far.

Science!

[edited top-level post to reflect finding]

Maybe...

...it justs proves that the unit tests were not quite complete enough. But that's really more of polemic point of view.

Maybe this is not the right place, but in my opinion the striving for Proof using typing systems such as Haskell, while nice academically, is not wholly applicable in the engineering world.

If there is measurably benefit, after cost/benefit analysis, of applying particular technologies such as static typing systems of the Haskell's ilk then it should be used.

But in engineering, tolerance limits are often set that do not require the 'bestest' 'safest' solutions; sometimes 'just good enough' will do. E.g. some mission critical system, such as nuclear reactor, might benefit from more Proof but a simple script to check for unreachable URL's might not.

It's not even clear that

It's not even clear that formal methods lead to more reliable systems at all, even disregarding the costs. If you take the two step process of first developing a formal specification and then implementing it VS developing the implementation directly, it's not clear which leads to better software. There is the implicit assumption that the specification is good or at least that the specification is more likely to be good than a directly developed executable program. For all but the most mathematical programs I find this highly unlikely. The continuous feedback you get from being able to look at the results of the executable program while you're developing it probably vastly outweigh the advantages you get because the specification language is not constrained to be executable (there are of course programs where that is very advantageous, for example specifying a sorting routine is easier than implementing it). More focus on the whole development process is needed than "we proved it correct according to the spec so we're done" to be able to effectively identify which parts of a program benefit from which reliability techniques (verification being one of them) and to have them smoothly interoperate for the complete program. For example I want to be able to specify properties in a single notation for both randomized testing and formal verification, so that I can first specify it and later decide whether the added guarantees of formal proof are worth it on top of automated randomized testing of the property.

That sounds very much like

That sounds very much like the reasons some people like to use dynamic and interactive languages such as Lisp.

To be able to easily test loose concepts independently or as part of the overall system is very appealing. It allows for a very iterative development approach which is nice when a problem domain is ill defined.

As the system moves closer towards the known requirements then extra robustness can be included.

For some people they may prefer to do that 'exploration' at compile time; I can not speak from experience there because C++ is the only statically type language I use at the moment (Work), and I prefer CL for personal hacks.

Straw man

You're doing a bit of a straw man argument here: you're opposing a rigid, waterfall specify-then-implement model to a continuous implement-test feedback loop. In practice certified software developments I know of also use an implement-prove loop: specifications and software are usually "co-designed", built mutually from an unclear idea of the goal to a precise end product; similarly, both of them can be left unfinished in areas where it was not deemed worth the costs at a given point in time.

I completely agree with the need for smooth integration of all kind of correctness-improving techniques. There are people working on that; in fact there is a whole conference series called "Test And Proofs" (TAP).

I agree, it was a bit of a

I agree, it was a bit of a straw man especially if you already share a similar view. But other people don't, and find it a crazy idea that using Ruby could be better for correctness than Coq. You can do iterative development of the specification/implementation, what I was trying to say is that you need the tools to facilitate this (or phrased negatively: the presence of formal verification should not inhibit iterative & experimental development). One of those tools is testing/proving of the same specifications. Others are being able to experiment in ways that violate preconditions, to base proofs on unproven assumptions, to have tests where a human is the correctness/error detection oracle, etc. More generally that defining correctness as binary property based around a formal spec is too narrow a view. Whether software works is a probabilistic and subjective property, and that tools should guide the development process as much as possible to good solutions (quickly and cheaply).

Formal methods, and even

Formal methods, and even more importantly, *available* and *usable* formal tools can help tremendously. While unit tests require a minimum discipline to even write them correctly (read: usefully and cover what your many code reviews still have overlooked).

And tons of unit tests can be useless, too, if over-redundant to each other or poorly fed with improper inputs. I saw all sorts of strange animals over the years, including unit tests that wouldn't even cover decently if only the nominal cases for the components being tested. The sort of things that make you mad when you discover those weeks or months later after jumping in a project where you had been reassured with a "oh, yeah, we do unit tests of course, duh!"

Yeah. Right.

But if we're speaking of large scale software, a fortiori with heavy legacies, still, I would never drop one in favor of the other exclusively (unit tests "versus" static typing/analysis).

Not with Turing complete languages, anyway. That would just be silly and accepting eyes wide open your shooting yourself in the foot.

Whoever claims otherwise is daydreaming, IMO. We're only in 2012. Or the person doesn't actually care about which real world dilemma the vast majority of programmers is actually busy (or, coping) with.

I appeal to have always more user friendly and ubiquitous formal tools, I'd almost beg the budget guys to get us those, if needed, but I'll also demand to keep whatever fraction of my unit tests is deemed necessary. And likely for a couple more decades to come, very minimum, I'm afraid.

Now I wonder how much better

Now I wonder how much better they can get than haskell with Coq or F* solutions :)

Similar to the other peer

Similar to the other peer comments, I don't think this study proves very much. They look at the end result of the programming activity rather than the activity itself. What the study shows is that static types can improve on a mature solution, but it doesn't say anything about the maturation itself, it doesn't say how static types can improve the act of programming. As if program code were simply construction material assembled in a deterministic way by construction worker programmers, which is not true!

What the study shows is that

What the study shows is that static types can improve on a mature solution

Yes. Crucially, it starts to quantify the improvement (and a bit of the cost). I agree a lot more can be done -- science is incremental :)