Generic Discrimination: Sorting and Partitioning Unshared Data in Linear Time

Generic Discrimination: Sorting and Partitioning Unshared Data in Linear Time, Fritz Henglein. ICFP 2008.

We introduce the notion of discrimination as a generalization of both sorting and partitioning and show that worst-case linear-time discrimination functions (discriminators) can be defined generically, by (co-)induction on an expressive language of order denotations. The generic definition yields discriminators that generalize both distributive sorting and multiset discrimination. The generic discriminator can be coded compactly using list comprehensions, with order denotations specified using Generalized Algebraic Data Types (GADTs). A GADT-free combinator formulation of discriminators is also given.

We give some examples of the uses of discriminators, including a new most-significant-digit lexicographic sorting algorithm.

Discriminators generalize binary comparison functions: They operate on n arguments at a time, but do not expose more information than the underlying equivalence, respectively ordering relation on the arguments. We argue that primitive types with equality (such as references in ML) and ordered types (such as the machine integer type), should expose their equality, respectively standard ordering relation, as discriminators: Having only a binary equality test on a type requires Θ(n^2) time to find all the occurrences of an element in a list of length n, for each element in the list, even if the equality test takes only constant time. A discriminator accomplishes this in linear time. Likewise, having only a (constant-time) comparison function requires Θ(n log n) time to sort a list of n elements. A discriminator can do this in linear time.

If you're like me, at some point you probably heard about linear time sorts like radix sort and then dismissed them as an interesting curiosity --- the restriction to sorting numbers seems pretty limiting, after all.

Henglein did not, and in this paper he shows how to write a teeny-tiny library of type-directed combinators that can lift linear time sorts to structured types like lists and trees, and in addition lets you quotient out by order or repetition, if you want. It's terrifically pretty code.

Note: This is a link to the ACM digital library, and so is behind a paywall. I found a link to a tech report by Henglein covering the same material, but I don't exactly what is different.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Source Code?

I'm in the middle of reading this paper, and spent some time trying to get the code to work in ghc with rather poor success.

Do you know of anywhere to find the complete working source code for this, as my Haskell Fu is evidently not sufficient to make the leap?

(Even some hints of what to import might do the trick...)

Re: Source Code ?

The code in this article works fine with ghc (6.8+)

Pragmas / Imports:

{-# LANGUAGE GADTs #-}
import Data.List hiding (sort)
import Data.Array
import Data.Char

The GADT definition looks like this

data Order where
  Char :: Order Char
  Nat :: Int -> Order Int
  ...

Copy disc, part, sort and usort, and you have a working implementation:

main = print $ sort (Inv $ list $ Map toLower Char) (words "Ab cd aaa EF gc aA")

Thanks!

Thanks Benedikt!

That was just the mojo that I needed.

(For those following at home, you also need to copy over list and fromList)

Containers?

Would it be possible to build functional containers on top of techniques like these (as an alternative to hash tables)? In particular, how does this relate to Hinze's "Generalizing Generalized Tries"? The interfaces are all functional even though the underlying primitives are strongly imperative, but I'm not sure there's anyway to transform this stuff into an online setting where elements are added one at a time.

He discusses this question

He discusses this question in the future work section -- it hasn't been done yet, but he seems optimistic. If you squint at the code for the case of pair types, you basically build a list of list of groups, and then flatten it to a list of groups. This really seems to suggest that something trie-ish is going on and is getting hidden by the choice of lists.

Nice but...

I read this the other day after it was mentioned on the types mailing list and wound up pretty disappointed by the end. I was about halfway through the large block of code in the middle when I realized that all that the author was going to present was a setup for radix sorting (with all the good and bad features that come with it) some data structures/types other than the standard int & string examples. Whoo.

I don't really know what I was expecting, but I guess I was looking for more than a beautiful implementation of a not particularly breathtaking idea.

Comparefunction <> Discriminator

Sorry for maybe asking the quite obvious... can someone please clarify in layman terms what the difference between a comparefunction and a discriminator really is? Are those compositable comparefunctions? Any hints to the ideas presented/described in non haskel syntax?

Thanks in advance. Cheerio.