Parameterized Unit Tests

Parameterized Unit Tests. Nikolai Tillmann; Wolfram Schulte; Wolfgang Grieskamp

Parameterized unit tests extend the current industry practice of using closed unit tests defined as parameterless methods. Parameterized unit tests separate two concerns: 1) They specify the external behavior of the involved methods for all possible test arguments. 2) Test cases can be re-obtained as traditional closed unit tests by instantiating the parameterized unit tests. Symbolic execution and constraint solving can be used to automatically choose a minimal set of inputs that exercise a parameterized unit test with respect to possible code paths of the implementation. In addition, parameterized unit tests can be used as symbolic summaries which allows symbolic execution to scale for arbitrary abstraction levels. We have developed a prototype tool which computes test cases from parameterized unit tests; it is integrated into the forthcoming Visual Studio Team System product. We report on its first use to test parts of the .NET base class library.

By adding parameters the authors turn a closed unit test into a universally quantified conditional axiom that must hold for all inputs under specified assumptions. Adding parameters improves the expressiveness of unit cases, at the expense of not providing concrete test cases. Symbolic execution is used to automatically find the set of parameters that have to be tested.

This seems like an interesting approach to unifying unit testing and algebraic specifications.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Abstract interpretation?

In addition, parameterized unit tests can be used as symbolic summaries which allows symbolic execution to scale for arbitrary abstraction levels.

Though not quite the same, but reminds me of Abstract interpretation:

In practice, it is used for constructing semantics-based analysis algorithms for the automatic, static and conservative determination of dynamic properties of infinite-state programs. Such properties of the run-time behaviour of programs are useful for debugging (e.g. type inference), code optimisation (e.g. run time tests elimination), program transformation (e.g. partial evaluation, parallelisation), and even program correctness proofs (e.g. termination proof)

I have to look closer to their similarities and differences.

QuickCheck

Having only read the above abstract I'd say it sounds like a lesser QuickCheck.

Well...

To be fair, this is more than that, because of the symbolic execution and whatnot -- finding a set of inputs for all codepaths is really an interesting feature. Of course, I suspect there are many boundaries in interesting and realistic code that defy any symbolic execution, or where it becomes infeasible.

If it's just parameterized unit tests, then yes, that's an extremely useful feature, but certainly not a novel one. That it isn't well integrated into the standard XUnit frameworks is, IMHO, rather shocking -- so it's worth fixing, but it's not something to be that proud of (on the contrary, the standard XUnit is something to be ashamed of!)

Against abstract unit tests

I haven't read the paper, but it seems to me that this idea ignores the human factors of writing unit tests. The major reason why unit tests are popular in commercial settings, while proofs aren't, is that unit tests are much easier to understand, precisely because they have such a low level of abstraction. Raising the level of abstraction is going in the wrong direction - the unit test will be no easier to understand than the code it tests.

While they say that traditional unit tests can be generated through their scheme, doing this automatically doesn't provide the customer with confidence that the system is working, because there could be a bug in the specification for the generation procedure. At most, generating unit tests provides additional examples that the customer then has to verify by hand. While that might be useful sometimes, to properly verify the output you really should figure out what the expected result should be independently, without being influenced by any suggestions from the program. Otherwise it's too easy to say "yeah, that looks right", without really thinking about it.

QuickCheck : The Test of a New Generation

Traditional unit tests are only as good as the data chosen for the tests, QuickCheck properties are only as good as the generator pumping out the data.

The main difference is that generative testing can cover a far greater range of values for much less programmer effort. If you haven't tried QuickCheck or one of its clones for Erlang, Python, Common Lisp, Perl, etc. you're missing out on a great tool.

The "Yeah that looks right" practice is called GuruChecksOutput and is classed as an AntiPattern on Ward's Wiki.

In order to get around that problem (and several others), I added basic test-driven-development support to QuickCheck. Now you can do test-driven-generative-testing and have even more confidence in your code!

I regularly use both HUnit and QuickCheck in my Haskell code. In my experience the only advantages of traditional unit testing with HUnit is the ability to specify test values ahead of time, and the ability to test IO actions without using unsafePerformIO. Even so, both of those abilities could be added to QuickCheck with a few days of work.

--Shae Erisson - ScannedInAvian.com

What is testing?


Traditional unit tests are only as good as the data chosen for the tests, QuickCheck properties are only as good as the generator pumping out the data.

Good.


The main difference is that generative testing can cover a far greater range of values for much less programmer effort.

Second score. But...

Hence combinatorial explosion. The given tests stated in the paper are very small; and the reported found bug, well, yeah true, it is one bug. So?


If you haven't tried QuickCheck or one of its clones for Erlang, Python, Common Lisp, Perl, etc. you're missing out on a great tool.

Agreed, they serve a purpose. But they have limited applicability.


The "Yeah that looks right" practice is called GuruChecksOutput and is classed as an AntiPattern on Ward's Wiki.

I think the both of you are talking about different things. Often, industrial tests are at least concrete and repeatable (it is not uncommon to have a test procedure description of several pages) and related to the original requirements document. In such settings, I agree with the original poster that it usually is not very sensible to abstract from the tests.


In order to get around that problem (and several others), I added basic test-driven-development support to QuickCheck. Now you can do test-driven-generative-testing and have even more confidence in your code!

Good for you!


I regularly use both HUnit and QuickCheck in my Haskell code. In my experience the only advantages of traditional unit testing with HUnit is the ability to specify test values ahead of time, and the ability to test IO actions without using unsafePerformIO. Even so, both of those abilities could be added to QuickCheck with a few days of work.

Hmpf, it's all a matter of scale. Every program less than say 10000 LoC is small (and in most cases probably even doesn't warrant elaborate software engineering or real unit testing.) The paper discussed is nice, and all that, and what I find good is that it leisurely deals with topics, as they say, from both volumes; but the claims are, ahum, mildly exaggerated, and -to me- the real question is: does it scale?

Testing is automated, informal, proof of functionality.

Agreed, they serve a purpose. But they have limited applicability.

How so? I claim that QuickCheck has fewer limits than static unit tests.

Often, industrial tests are at least concrete and repeatable (it is not uncommon to have a test procedure description of several pages) and related to the original requirements document.

That sort of test is called an AcceptanceTest rather than a UnitTest.

Every program less than say 10000 LoC is small (and in most cases probably even doesn't warrant elaborate software engineering or real unit testing.)

I completely disagree. In my experience, as soon as there is reusable or repeatedly used code, it's time for test driven development. It's always faster to write TDD code at the very least because the ratchet effect of a unit test suite dramatically increases speed of change.

but the claims are, ahum, mildly exaggerated, and -to me- the real question is: does it scale?

What is exaggerated about the claims?

QuickCheck scales just fine for any purely functional code. I have not yet used QuickCheckST for model equivalence checking, but I hope to do so soon.

Software testing of any flavor is difficult in highly stateful situations like J2EE. The rise of in-container test frameworks is one symptom of that.


--Shae Erisson - ScannedInAvian.com

Sanity Check

Agreed, they serve a purpose. But they have limited applicability.


How so? I claim that QuickCheck has fewer limits than static unit tests.




Good, back it up. I said it has limited applicability; I was not comparing.


Look below, excellent static unit test, one test pretty concrete, and not a thing which is easily done by random test generation:



(As a sanity check for my Python script, I should probably ask Anton to verify whether or not the 1,000,000th Hamming number is 519381797917090766274082018159448243742493816603938969600000000000000000000000000000, assuming you count 1 as the 0th number.)


I think the both of you are talking about different things.

Often, industrial tests are at least concrete and repeatable (it is not uncommon to have a test procedure description of several pages) and related to the original requirements document.


That sort of test is called an AcceptanceTest rather than a UnitTest.



Yeah well, that's somewhat what I said, right? Anyway, in my experience, they just called them, well, "tests" or maybe functional tests (and no customer in sight).

Every program less than say 10000 LoC is small (and in most cases probably even doesn't warrant elaborate software engineering or real unit testing.)


I completely disagree. In my experience, as soon as there is reusable or repeatedly used code, it's time for test driven development. It's always faster to write TDD code at the very least because the ratchet effect of a unit test suite dramatically increases speed of change.



Good, I disagree too.


The paper discussed is nice, and all that, and what I find good is that it leisurely deals with topics, as they say, from both volumes;
but the claims are, ahum, mildly exaggerated, and -to me- the real question is: does it scale?


What is exaggerated about the claims?




Well, it ain't new, and it probably doesn't scale (at least they give no proof of that).



QuickCheck scales just fine for any purely functional code. I have not yet used QuickCheckST for model equivalence checking, but I hope to do so soon.



I wasn't discussing QuickCheck; you are. Quickcheck doesn't scale over tractability so I think your claim is an overstatement. Succes.

Why not have compile-time tests?

Unit testing is good, but there is another solution that is much better than unit testing and solves the problem of having bugs. The only 'drawback' is that it requires some compiler additions, but the advantages are so great that they justify the effort.

The problem with programming is that programming answers the 'how' question, but it does not answer the 'what' question. All the assumptions made by the programmer are in the programmer's mind; it is the programmer that is responsible for honouring the assumptions. Unfortunately, assumptions of a program are too many for a programmer, and therefore bugs are introduced when these assumptions are violated.

The solution for dealing with all the world's programming problems is to annotate the code with the assumptions, and let the compiler check them. By coding the assumptions along with the implementation, the code also answers the 'what' question, and write-once bulletproof software is possible. This solution makes unit testing almost reduntant.

Here is an example: a C-like piece of code annotated with assumptions:

void free(void *data : in valid out invalid)
{
   //bla bla free block
}

int main()
{
    char *str = readln();
    free(str);
}

In the above example, the function 'free' assumes that the given parameter is non null. Therefore, the code shall not compile, since it is not guarranteed by the code that the value passed in free is not NULL! A correct implementation would be:

int main()
{
    char *str = readln();
    if (str != NULL) free(str);
}

Let's assume now that, in some place in the code, the variable 'str' is used:

void print(const char *str : in valid out valid)
{
    //bla bla print string
}

int main()
{
    char *str = readln();
    if (str != NULL) free(str);
    print(str);
}

As we can see, the string 'str' is printed after it is freed. The code above shall not compile, because the 'print' function assumes a valid input, whereas 'str', upon exit, as status 'invalid'.

This solution can be used for all kinds of values, ensuring full code correctness. A compiler should not accept the code if the assumptions are not honoured.

You may say that this solution will not work, because algorithms can not be proven to terminate, control flow analysis is needed that is expensive etc. Well, control flow analysis is not needed! a block of code has assumptions at start and end, and if these assumptions are not validated at the place that the code is used, then the compiler shall not accept the code. Assumptions are inherited, i.e. they ripple through the code from inner to outter blocks.

let's see how the above example can be represented as simple annotated text list, in order to understand how it will work:

char *str = readln();
(1)'str' may be invalid
(2)'str' is valid
free(str);
(3)'str' is invalid
(4)'str' is valid
print(str);

As you can see, there are two clear contradictions in the above piece of code:

  1. Assumptions (1) and (2) are contradicting.
  2. Assumptions (3) and (4) are contradicting.

Problem 1) can be resolved by checking if 'str' is null. Then it becomes:

char *str = readln();
(1)'str' may be invalid
if (str != NULL) free(str);
(3)'str' is invalid
(4)'str' is valid
print(str);

Now the first contradiction is gone. The second one can be solved by moving the 'print' before 'free'. The code becomes:

int main()
{
    char *str = readln();
    print(str);
    if (str != NULL) free(str);
}

but there is still a contradiction, because the 'print' function accepts only valid 'str', and NULL is not a valid string. The annotated code becomes:

char *str = readln();
(1)'str' may be invalid
(2)'str' is valid
print(str);
if (str != NULL) free(str);
(3)'str' is invalid

There is a clear contradiction between (1) and (2). Therefore, we should move the 'print' inside 'if' block:

int main()
{
    char *str = readln();
    if (str != NULL)
    {
    	print(str);
    	free(str);
    }
}

In the above code, there is no contradiction, since the 'if' block makes sure both functions 'print' and 'free' accept valid data.

All the above can take place at compile time: the compiler can mechanically check if there is any contradiction between two successive instructions, thus allowing coding without errors.

The compiler does not need to do expensive control flow analysis, since all that it matters is for the compiler to check the assumptions made at the edges of each code block. In the above example, the function 'free' may do a whole lot of things, but the end result is that the input block of memory is freed and thus becomes invalid for usage after the function is used. Even with variable aliasing, it is possible to apply the same solution, by attaching the assumptions to aliases.

The principle behind the above is:

If a program is not 100% certain to be correct, then it is rejected by the compiler.

If the compiler can not prove that a program is correct (for example when there are no static constraints to check against), then the program should be rejected.

In a programming language that would support this solution, assumptions would be part of the code and be distributed along the binaries/bytecode, thus 'transmitting' program correctness to applications.

I am posting all this in the context of unit testing: it is boring and cumbersome to write unit tests; it takes a lot of time, and no one guarantees that the test is correct. A better solution exists, that is easily implementable with current computers, that does not require the program to run, and does not involve full control flow analysis.

Another hunt for a free lunch...

By coding the assumptions along with the implementation, the code also answers the 'what' question, and write-once bulletproof software is possible. This solution makes unit testing almost reduntant.

If you are following a TDD approach, the kinds of assumptions you are talking about are exactly what you should be testing.

If the programmer doesn't understand his assumptions well enough, or is too lazy to write the unit tests for his program, what makes you think he will code up his assumptions in some other form, or if he does, do them correctly?

If the compiler simply rejects his program (possibly with a cryptic error message), he is likely to just remove the offending assumption, since he probably doesn't understand the problem anyway.

This whole idea seems like yet another attempt to skip over a relatively inexpensive technique that provides a high degree of confidence (TDD), in favour of a very expensive, if not impossible, technique that promises the mirage of certainty for nothing.

And if you don't agree, I have a Halting Oracle I'll sell you for cheap. ;-)

If the programmer doesn't und

If the programmer doesn't understand his assumptions well enough

I don't think that is possible: if the assumptions are not understood, then how is the programmer going to code the program? For example, if I don't understand that I should open a file before reading from it, then how am I gonna do an 'open file' command?

is too lazy to write the unit tests for his program

Unit tests take a great deal more work to write than assumptions. And assumptions are coded at the place they are thought of, whereas with unit tests one forgets assumptions, and must re-think about them later. It's quite harder.

possibly with a cryptic error message

Error messages don't have to be cryptic.

he is likely to just remove the offending assumption, since he probably doesn't understand the problem anyway.

Well, the problem is that the programmer does not understand the problem, not that the assumptions mechanism is wrong.

a relatively inexpensive technique

Writing unit tests properly is not inexpensive. I work for a defense application company, and we are required to write unit tests. It takes more than triple the time to code the actual program to make the unit tests. And the unit tests have to be tested, too.

If the programmer doesn't und

If the programmer doesn't understand his assumptions well enough

I don't think that is possible: if the assumptions are not understood, then how is the programmer going to code the program? For example, if I don't understand that I should open a file before reading from it, then how am I gonna do an 'open file' command?

is too lazy to write the unit tests for his program

Unit tests take a great deal more work to write than assumptions. And assumptions are coded at the place they are thought of, whereas with unit tests one forgets assumptions, and must re-think about them later. It's quite harder.

possibly with a cryptic error message

Error messages don't have to be cryptic.

he is likely to just remove the offending assumption, since he probably doesn't understand the problem anyway.

Well, the problem is that the programmer does not understand the problem, not that the assumptions mechanism is wrong.

a relatively inexpensive technique

Writing unit tests properly is not inexpensive. I work for a defense application company, and we are required to write unit tests. It takes more than triple the time to code the actual program to make the unit tests. And the unit tests have to be tested, too.

in favour of a very expensive, if not impossible

expensive in what terms? why impossible? as I have told above, the compiler does not need to do a full control flow analysis for this to work. All that it takes is to match conditions at the edges of program blocks, usually functions.

I have a Halting Oracle

The halting problem does not affect this technique: the compiler would not check if a program terminates, but rather if assumptions hold across program boundaries. A program may not terminate due to its input, but that does not mean it is invalid. And the reverse also holds: a program may terminate, but it may be invalid in terms of having subtle bugs that are revealed only when certain input is present.

Hope springs eternal

I don't think that is possible: if the assumptions are not understood, then how is the programmer going to code the program?

You would think so, but there is plenty of evidence that such programmers "succeed" every day. It is very easy to have an assumption and not realize it.

Error messages don't have to be cryptic.

I think it is much harder generating non-cryptic error messages than you might suppose.

Writing unit tests properly is not inexpensive. I work for a defense application company, and we are required to write unit tests. It takes more than triple the time to code the actual program to make the unit tests. And the unit tests have to be tested, too.

We clearly have a different definition of "writing unit tests properly", and by extension what a good program design looks like.

In my view of things, writing tests should be no more difficult (and likely easier) than writing up assumptions in the way that you seem to want to. If it is, then some part of the program or the test is more complex than it needs to be, and this is a sign it needs to be rethought.

expensive in what terms? why impossible?

Ultimately, what you want to the computer to do for you is to exercise judgement. We know that this is a very hard thing for computers to do, quite likely impossible in the general case. Unless you can prove a mathematical formulation of the kind of certainty you want the computer to give you (an undertaking whose probability I doubt), I think it is unlikely that the computer will succeed where the full power of a human intellect (even one so "weak" as the "average" programmer) has failed.

The halting problem does not affect this technique

Not directly, but it is a good example of the problem of finding iron-clad certainty in complex systems.

With the power of human reasoning and some good practices and techniques, a good deal of confidence is possible, and it comes much more cheaply than any deus ex machina solution on the horizon.

It is very easy to have an as

It is very easy to have an assumption and not realize it.

Exactly. By putting the assumption in code, two things might happen:

  1. The assumption is wrong, therefore forcing the programmer to rethink what he is doing.
  2. The assumption is correct, therefore the program is correct.

It's a win-win situation.

I think it is much harder generating non-cryptic error messages than you might suppose.

It is not easy, but not that hard. Let's take C++ templates, for example. Instead of saying:

the instantation of Foo with T = Bar at file somefile.cpp, line 10, 20 creates the problem of 'Bar' not having member 'cee'

the VC++ compiler says:

file foo.cpp, line 50, col 4: Foo<Bar>::dee(A, B&): unknown member.

The compiler knows the source of error, but it prefers to say that the error is in the template class instead of its instantation. It's clearly a fault of the compiler designer.

The VC++ compiler should also hide template parameters with default value: instead of saying

std::map< std::basic_string< char, std::allocator< char> >, some_class>

It could say:

std::map< std::string, some_class>

which looks much better.

In my view of things, writing tests should be no more difficult (and likely easier) than writing up assumptions in the way that you seem to want to.

Let's see. With unit tests, one has to:

  1. write the code.
  2. compile the code.
  3. create a new project for the unit tests.
  4. write the unit tests.
  5. choose a good set of values.
  6. test the unit tests at least once.
  7. try to find the bug in the code.
  8. fix the bug.

With assumptions:

  1. write the code with assumptions.
  2. compile the code.
  3. observe compile errors.
  4. fix the program.

Ultimately, what you want to the computer to do for you is to exercise judgement. We know that this is a very hard thing for computers to do, quite likely impossible in the general case.

No, I don't want the computer to exercise judgement. I want the computer to automate my judgement, because I tend to forget things. That's all there is. After all, the assumptions are coded by the programmer, not by the computer.

Unless you can prove a mathematical formulation of the kind of certainty you want the computer to give you

I am not very good on math, but I think logic will do:

for any code block B1 with assumptions A1 and code block B2 with assumptions A2 where A1 contradicts A2, the code sequence 'B1 then B2' is invalid.

I think it's provable, don't you think?

With the power of human reasoning and some good practices and techniques, a good deal of confidence is possible, and it comes much more cheaply than any deus ex machina solution on the horizon.

What if we can automate the tasks, so we don't have to remember everything and let the computer validate it for us?

Auto generated subjects

Too bad Marc didn't write "It's very easy to have an assumption and not realize it."

The mechanism I describe will

The mechanism I describe will not need to test all the cases, just if assumptions hold with the current sequence of instructions. In other words, there should not be a try to validate an assumption across all the program domain, but rather at the place that the relevant piece of code is used.

For example, let's say we have the following function:

inc(x) = x + 1

with the following assumptions:

  • input: x > 0
  • output: x > 1

The input assumption is a precondition and does not need to be proven. It will be tested when code is compiled.

The output assumption is valid because anything greater than zero added to 1 yields a value greater than one.

Now let's see another function with contradicting assumptions:

div(x, y) = x / y

with assumptions:

  • input: y != 0

Let's see some applications of the above functions:

  • inc(1): valid because 1 > 0 => true.
  • inc(0): invalid because 0 > 0 => false.
  • inc(-1): invalid because and -1 > 0 => false.
  • inc(input()): invalid because input() is not surely greater than 0.
  • div(1, 1): valid because 1 != 0 => true
  • div(1, 0): invalid because 0 != 0 => false
  • div(1, input()): invalid because input() may be 0.
  • div(1, inc(-1)): invalid because inc(-1) is invalid.
  • div(1, inc(0)): invalid because inc(0) is invalid.
  • div(1, inc(1)): valid because inc(1) is valid and greater than 1.
  • inc(div(1, 0)): invalid because div(0) is invalid.
  • inc(div(-1, 1)): invalid because we can't be sure that div(-1, 1) > 0 (note that we don't have to prove the result of div(-1, 1) to be > 0).

I think that the above can be done mechanically by a computer. Maybe I am wrong somewhere, but that error has not registered with me yet :-).

Your example sounds not unlik

Your example sounds not unlike the restrictions and pointer type annotations implemented by Cyclone, a variant of C that gets mentioned here every now and then.

However, those annotations are fairly simple: whether a pointer can be null, the bounds of a pointer, the lifetime of an allocation, etc. And that can guarantee keeping out a certain class of errors -- here, memory mismanagement -- at the expense of some perfectly well-behaved programs where the type system can't express its invariants.

Okay, so you have preconditions and postconditions written in arbitrary code -- but then the compiler has to take an arbitrary boolean function (over whatever parameters) and say if it will always return true for any input. Clearly it can't do that; it can make a conservative estimate, but then you're back to throwing out perfectly good programs, and that's also basically the definition of a static type system.

Pierce expressed this idea much more elegantly in the preface to his big book of types, IIRC, but I don't have a copy of it to hand.

Your example sounds not unlik

Your example sounds not unlike the restrictions and pointer type annotations implemented by Cyclone.

Well, lots of people usually reach the same conclusion, which is really there waiting to be discovered.

but then the compiler has to take an arbitrary boolean function (over whatever parameters) and say if it will always return true for any input. Clearly it can't do that

No, I don't speak about that. I know algorithms can't be proven to terminate. I am talking about possibilities: if a possible state is incompatible with another possible state, then these states can't be used together.

For example, if a function returns a value that is within the range [1, 10] and the subsequent function accepts only a value in the range [5, 15], then the code is clearly invalid. This system will only check that the first function can't pass its result to the other.

but then you're back to throwing out perfectly good programs

A program is not perfect if it allows invalid states. If an invalid state can be reached, then the program should be fixed to not allow invalid states.

and that's also basically the definition of a static type system.

Not exactly. It's beyond that: it extends the type system over program states. In other words, program states become strongly typed, and the compiler is able to match those types, just like the types of values.

Extended static checking

This approach is sometimes known as "Extended Static Checking". See this earlier node on LTU for some links.

There are commercial tools that are used in safety-critical application development, e. g. SPARKAda from Praxis High Integrity Systems and Perfect Developer from Escher Technologies. I haven't used any of these.

Just to clarify about the halting problem and decidability of logical specifications... There are programs that are provably correct and clearly halt, such as:

main(){printf("Hello world\n");exit(0);}
and
add1(int x){return x+1;}
And there are programs that are clearly incorrect
double sqrt(double x){return -34.12;}
(for the obvious specification) or that provably do not terminate:
main(){while(1)printf("Hello world\n");}
What the proofs of undecidability mean is that there will always be a gray area of programs that are totally correct, but we can't prove they are, mixed in with programs that are incorrect (that we also can't prove). That doesn't mean that, given the right language and right tools, you can't prove your program works (or at least meets some partial correctness criteria of the kinds proposed by the ESC people).

Tony Hoare recently gave a ta

Tony Hoare recently gave a talk on "The verifying compiler", which I was unable to attend. Anyone know anything about his recent work in this area?