## Exhaustiveness checks for algorithms that manipulate imperative data structures

Let's start with something that ought not to be controversial, although it probably is: Programming languages exist to express algorithms! Thus, programming languages ought to be judged according to the ease with which they express algorithms, especially the not quite trivial ones. Moreover, algorithms are known to be tricky beasts - you make one slight mistake and the thing does something completely different from what you want. Thus, it is reasonable to emphasize how reliably humans can write correct algorithm implementations as the main criterion for judging languages.

In this regard, one of the greatest inventions in PL history are algebraic data types and pattern matching with automatic exhaustiveness checks. As long as your programs only manipulate purely functional data structures, it is often possible to write them in such a way that the compiler can verify the exhaustiveness of your case analyses. This is a huge win for reliability, because computers are ridiculously better than humans at performing exhaustive case analyses.

Unfortunately, this advantage of pattern matching over if/else chains completely vanishes as soon as you introduce imperative data structures. As the state of an imperative data structure evolves over time, you gain the ability to perform new operations on it, and also lose the ability to perform operations on it that were previously possible, in a way that algebraic data types are “too static” to correctly describe. For example, suppose you are writing a compiler and, for efficiency reasons, you want to perform syntax tree transformations in place whenever possible. In particular, you want to replace variable names with de Bruijn indices in place. So you define your abstract syntax like this:

 datatype state = Name of string | Index of int type var = state ref datatype term = Var of var | Apply of term * term | ... 

However, then it becomes impossible to express invariants such as “all names have been converted to indices”, i.e., “the abstract syntax tree has been converted to an abstract binding tree”. You might want to define a helper function like

 fun getIndex v = case !v of Index n => n | Name _ => raise UnboundVariable 

This is terrible, because it means that you have failed to convince the type checker that your case analysis is exhaustive, i.e., “no variable is in the Name state by the time you call getIndex”.

For a more serious example, consider the algorithms for computing the strongly connected components of a directed graph due to Tarjan and Gabow. Both algorithms construct an auxiliary stack whose topmost few nodes belong to the strongly connected component of the node currently under consideration. When the strongly connected component has been fully explored, this stack is popped. See here and here, respectively. We know that, at this point, the stack is nonempty, this does not need to be checked at runtime. And yet, in every memory safe language that I know of, the check will be performed, creating a useless unreachable control flow path.

Long ago, when I was a reckless C++ programmer, and thought I could keep every detail in my head, I would have replied “well, then a memory unsafe pop it is!” Nowadays I expect better from programming languages. Are there any programming languages that do a better job of tracking what situations may actually arise when manipulating imperative data structures?

EDIT: I am aware of typestates. They could help with the first example, but I do not see how they help with the second.

## Comment viewing options

### Dependent Types

Trying to find tricky type algorithms that automatically check invariants like this doesn't seem like a good idea to me. I'm sure it's possible to do it with GADTs or something, but this a dark path. Wait for a well designed language for ordinary programming that incorporates a good theorem prover that's relatively easy to use for establishing simpler invariants. This should be "hit it with a hammer", not a research project on type systems.

### I do not mind manually

I do not mind manually supplying proofs if that is what it takes. After all, in the case of the strongly connected components algorithms, I already have the proofs that the algorithms are correct.

However, I doubt GADTs (or any other typical GHC extension) will help. The crucial observation is that, given an edge A -> B in the graph, the recursive call dfs(B) will not pop A from the auxiliary stack. At most, it will pop B from the stack, in which case B is the root of a strongly connected component, and A is not reachable from B. This is why we can be sure that the stack is nonempty when it needs to be nonempty, since, in particular, it contains A. Moreover, “until you have popped A” is a valid terminating condition for the popping loop. There are way simpler things that GADTs cannot easily express.

As you say, expecting absolutely everything to be verified by a type checker is a dark path. IMO, verification ought to be a collaborative effort between humans and computers. Computers should perform case analyses, because they are better at it than us. On the other hand, we are better at figuring out why algorithms terminate, assuming that no case has been missed in the case analysis. (This boils down to the fact that there is no universal recipe for determining what the right inductive hypothesis to use is.)

---

On a completely unrelated matter, I really enjoyed your paper on the deletion operation in purely functional red-black trees. Inspired by it, I managed to implement red-black tree operations as total functions in Standard ML, i.e., all pattern matching is exhaustive and all control flow paths are legitimately reachable. (It is the success of this kind of thing that I am trying to replicate for imperative data structures!)

Here is the interface of search trees, manipulated using zippers.

Here is the implementation of red-black trees.

Here are the uses of search trees in implementing set and map ADTs.

### Constraint Solver Type Checker

The F* language would be suitable for expressing imperative algorithms. It supports flexible refinement types. The cost of this is that 'type checking' involves a full powered SMT solver.

Cornell's SHErrLoc project also rewrites behavior into constraints. Although focused on diagnosis of bugs, the constraint languages can express flexible type systems.

It is feasible to convert imperative code containing preconditions and postconditions or assertions to constraints, and solve for causing assertion failures. But we do need clear boundaries to limit the scope of the solver.

### Thanks! F* looks

Thanks! F* looks particularly promising, because F* programs can be extracted to C code, which, at least in principle, should not contain unnecessary runtime safety checks for already established preconditions. I will play with this thing for a while.