archives

Exhaustiveness checks for algorithms that manipulate imperative data structures

Let's start with something that ought not to be controversial, although it probably is: Programming languages exist to express algorithms! Thus, programming languages ought to be judged according to the ease with which they express algorithms, especially the not quite trivial ones. Moreover, algorithms are known to be tricky beasts - you make one slight mistake and the thing does something completely different from what you want. Thus, it is reasonable to emphasize how reliably humans can write correct algorithm implementations as the main criterion for judging languages.

In this regard, one of the greatest inventions in PL history are algebraic data types and pattern matching with automatic exhaustiveness checks. As long as your programs only manipulate purely functional data structures, it is often possible to write them in such a way that the compiler can verify the exhaustiveness of your case analyses. This is a huge win for reliability, because computers are ridiculously better than humans at performing exhaustive case analyses.

Unfortunately, this advantage of pattern matching over if/else chains completely vanishes as soon as you introduce imperative data structures. As the state of an imperative data structure evolves over time, you gain the ability to perform new operations on it, and also lose the ability to perform operations on it that were previously possible, in a way that algebraic data types are “too static” to correctly describe. For example, suppose you are writing a compiler and, for efficiency reasons, you want to perform syntax tree transformations in place whenever possible. In particular, you want to replace variable names with de Bruijn indices in place. So you define your abstract syntax like this:

datatype state = Name of string | Index of int

type var = state ref

datatype term
  = Var of var
  | Apply of term * term
  | ...


However, then it becomes impossible to express invariants such as “all names have been converted to indices”, i.e., “the abstract syntax tree has been converted to an abstract binding tree”. You might want to define a helper function like

fun getIndex v =
  case !v of
      Index n => n
    | Name _ => raise UnboundVariable


This is terrible, because it means that you have failed to convince the type checker that your case analysis is exhaustive, i.e., “no variable is in the Name state by the time you call getIndex”.

For a more serious example, consider the algorithms for computing the strongly connected components of a directed graph due to Tarjan and Gabow. Both algorithms construct an auxiliary stack whose topmost few nodes belong to the strongly connected component of the node currently under consideration. When the strongly connected component has been fully explored, this stack is popped. See here and here, respectively. We know that, at this point, the stack is nonempty, this does not need to be checked at runtime. And yet, in every memory safe language that I know of, the check will be performed, creating a useless unreachable control flow path.

Long ago, when I was a reckless C++ programmer, and thought I could keep every detail in my head, I would have replied “well, then a memory unsafe pop it is!” Nowadays I expect better from programming languages. Are there any programming languages that do a better job of tracking what situations may actually arise when manipulating imperative data structures?

EDIT: I am aware of typestates. They could help with the first example, but I do not see how they help with the second.