## We're doing it wrong....

Poul-Henning Kamp makes an interesting case in favor of his opinion that CS education in analyzing algorithm complexity, is focusing on a model of computation which is simplified to the point of being stupid and wrong, and that it leads students to real-world failures in writing performant code.

By taking page swapping into account, he has achieved a one to two order-of-magnitude speedup in normal use cases over a "standard, provably optimal" algorithm used throughout computer science. He argues that the model of computer memory as having uniform access time is and has for several decades, been wrong and misleading to the detriment of the students and standards of performance. He's on about virtual memory and disk swapping in particular, but his points are just as valid in a system with a high-speed memory cache fronting a slower main memory, which is essentially all of them. Here is a link to his article:

You're doing it wrong

This is of interest to language developers because most people aren't actually writing those fundamental algorithms anymore. They're using the libraries that the language developers provide. So let's not provide stupid and wrong, eh?

On the other hand, I don't see how his described application (a server) requires the ordering that a tree structure provides. It isn't in the business of efficiently providing lots of dynamic ordered lists of the documents it serves; it's in the business of efficiently providing lots of served documents. If I were attempting to optimize access to documents, I'd be creating a hash table of pointers to these documents rather than a tree of any description, thus never needing to access more than one page (or maybe two for rehashes in the small fraction of cases where hash bucket overflows have happened) in order to find out where the document is kept.

On the gripping hand, his described tree structure is perfect for collections that have to support a lot of insertion and deletion while remaining ordered, so I can see it as having good use to implement, say, database tables.

## Comment viewing options

### Well-understood

Cost models for cache and I/O efficient algorithms are fairly well-developed and well-understood, see e.g. one of the most recent papers on the subject at this year's POPL [1] and follow the references from there.

Of course, it may well be that this subject hasn't sufficiently trickled down to teaching yet.

### Might still be wrong...

because instead of doing performance analysis he should maybe be doing power analysis

### Or energy analysis.

Or energy analysis.

### He's probably assuming a

He's probably assuming a polynomial reduction between them...

### Eh.

Hasn't somebody confused complexity with latency? Isn't complexity an abstract way to get a grip on the number of operations performed as a function of the input size? Isn't it a tool deliberately kept in the abstract so as not to be tied to the specifics of some fashionable design paradigm --- think looooong term!

I wasn't taught to use a "big O" value as a measure of latency. Were you?

Maybe I'm wrong. I just checked out
http://en.wikipedia.org/wiki/Big_O_notation
and it actually mentions processing time:

big O notation is used to classify algorithms by how they respond (e.g., in their processing time ...

Are students being taught this?

### I hope Kamp meant "provably

I hope Kamp meant "provably optimal" in a tongue-in-cheek way. As you say, big O notation doesn't talk about specific values; it talks about scalability, usually at scales much larger than current hardware.

The best example I've seen of a "provably optimal" yet practically useless algorithm is Hutter Search.

### Cache-oblivious algorithms

Cache-oblivious algorithms have been developing nicely for the past 15 years or so. Relevant to this article are the cache-oblivious b-trees, which show optimal performance at all levels of the memory hierarchy, surpassed only by cache-aware equivalents.

### how do we give this a PL focus?

I have worked on both languages and HTTP caches of exactly that sort. I might be one of the few folks in the intersection, so I'd like to contribute, but you didn't ask a question. I agree with most things you said.

So let's not provide stupid and wrong, eh?

Can you give me an example of a bad PL choice in this vein? The problem is large data structures, to the extent they cause memory pressure on cache line or disk resources. Normally that will be a library and not a PL, though a PL can mitigate or worsen problems by, for example, how it manages concurrency and presents latency as sync or async. Do you mean something like that?

On the other hand, I don't see how his described application (a server) requires the ordering that a tree structure provides.

A hashmap is better if the index fits in real memory. A partially disk-based index is simpler as a tree. So far I have used hashmaps for systems dedicating resources to a cache for indexing. (I'm the nutball who wants every operation constant time when possible.)

I'm not familiar with Varnish, but I have worked on a similar problem. If I rewrote the first sentence of Kamp's second paragraph to describe myself instead, it would go like this:

A dozen of years ago, I fell into some interesting company and became the author of an closed source HTTP accelerator called Pivia, basically an HTTP cache to put in front of slow Web servers.

It was a Squid replacement in several senses, including dynamic content caching because the server was able to rewrite parts depending on values in the request header on the fly, and use a pre-compressed gzip version even with the dynamic changes included, without recompression. (I hacked zlib to emit clear text stubs at dictionary flushed boundaries that would be hot-replaced without changing any compressed content; the cached template described the edits necessary using bytecode for a virtual machine.) The latency budget was sub-millisecond to parse HTTP headers and find the description of how this app was handled, then a few milliseconds to serve a full reply. The tech is still a going concern, but other folks rewrote it all.

Crash: Wait, if you changed zlib, how does a browser decode correctly?
Wil: I didn't change zlib. The format spec allows uncompressed fragments.
Crash: Oh, because that's the fallback strategy when shrinking fails?
Wil: Right, because incompressable content is smallest in the original.
Crash: Then I don't get it: what did you change?
Wil: I added new api to do it to literals on purpose during deflate.
Crash: That hardly sounds like work. So what was the tricky part?
Wil: Showing where a clear literal lands, to later patch at that offset.
Crash: What about the crc32 checksum in the trailer at the end?
Wil: Oh yeah, you have to regenerate the checksum and patch it too.
Crash: And patch the final size. Each clear literal flushes the dict?
Wil: Only if you don't know if a replacement will have identical size.
Crash: Uh, I guess zlib uses offsets as back references named in a dict?
Wil: Yep. A new size patch breaks back refs if you don't flush the dict.

### I was pointing out interesting rather than asking a question.

I was just pointing out what I thought was an interesting article.

This is more relevant IMO to standard library design and implementation than to language design in general. If you're a Guido Van Rossum or somebody who's trying to provide comprehensive "batteries included" libraries with their language, it's worthy of note that firstly, the library design should avoid anything so implementation-dependent as to forbid cache optimization, and secondly that here is a worthwhile optimization to apply to search trees when they are used on large datasets.

That said, I think that if you're going to optimize a tree for cache awareness, you can probably do better than the design presented in this paper and eliminate those two nonbranching nodes in each cache page. Instead, let one of the "leaf" nodes in each cache page have two subnodes. You give up the left-right symmetry of each cache page, but firstly on average you go through fewer cache pages on your journey from the root to the datum and secondly you make the cache-optimized algorithm big-O equivalent to the theoretically-optimal algorithm.

On closer reading I see that he's implementing a heap rather than a tree, and that his purpose is to find in logN time the oldest item so that it can be expired from cache.

I have always been more willing to accept a constant-time heuristic than a log-n optimal, so I would organize cache as a hash table and expire items from cache when a bucket overflow occurs. They would not even usually be the single oldest item, but they would on average be among the 1/M oldest where M is the number of items per hash bucket.

I don't see a problem with having a hash map that doesn't fit in main memory. As a result of performing the hash function on the key, you find which hash bucket something ought to be in. You calculate the address of that bucket and the virtual memory manager loads it directly when you access that address. You read the hash bucket (which may be a few tree-nodes or not) to find the address of the actual item. That makes one cache hit spent on indexing rather than one per tree level in the usual tree implementation or one per 3 or 4 tree levels in the cache optimized tree implementation.

Finally, if we want to treat big-O strictly as a complexity notation then we need something else which is specifically a time-of-response estimate. Latency is at least as important as complexity, and complexity gives us no way to talk about the effects of, eg, parallelization gains, communications overhead, and cache behavior -- all three of which are more important to actual performance now than they ever have been and show no signs of abating importance.

Maybe all these things are architecture dependent. But so is every actual computer ever built. Engineers have to study a lot of mathematics, but at some point mechanical engineers also have to study the properties of real available materials if they want to produce usable results; similarly software engineers need to study the properties of actual available computer systems. A computer science education which neglects real, available computing systems is only half the education a software engineer needs.

And maybe this is the division between theoretical computer science as a mathematics discipline and applied computer science as an engineering discipline.

### Complexity perfectly adapts to latency, caches, parallelism...

Finally, if we want to treat big-O strictly as a complexity notation then we need something else which is specifically a time-of-response estimate. Latency is at least as important as complexity, and complexity gives us no way to talk about the effects of, eg, parallelization gains, communications overhead, and cache behavior -- all three of which are more important to actual performance now than they ever have been and show no signs of abating importance.

[...]

And maybe this is the division between theoretical computer science as a mathematics discipline and applied computer science as an engineering discipline.

Sorry, but this is all wrong. Big-O notation is simply a way to look at asymptotic behavior (what happens when linear factors dominate constant start-up cost), and can be applied to reason about latency, parallelism, cache behavior and what not. Links have been given before in the discussion to complexity studies of so-called "cache-oblivious algorithms", that exactly take a complexity approach to design cache-efficient datastructures and algorithm. (The Blelloch and Harper paper quoted above is even more ambitious, as it is about giving an operational semantics to existing functional programs that makes them provably locality-efficient without introducing caching concerns in the code or programming language.)

There are definitely areas where the constant factors matter (and a few cases of algorithms with good asymptotic complexity but dominating constant factors that make them ill-adapted to practical use), and everyone knows that. There has been scientific work on how to study those constant factors, and it is not a revolutionary divorce from complexity theory, but mostly a refinement of those techniques to get more precise approximations. These are used to compute, for example, Worst-Case Execution Time for critical embedded software,.

Anyone you hear saying that computer scientists are "doing it all wrong" is sensationalist and egocentric at best, and possibly ignorant as well.

### Exactly, big-O notation is

Exactly, big-O notation is just a notation for studying the asymptotic behavior of functions. In computer science that function is almost always running time as a function of the input size or memory use as a function of the input size. But big-O notation is just a mathematical tool for studying any kind of function. It's used all the time in mathematics and physics. For example a mathematician may write log(1+x) = x + O(x^2).

If you want to be precise, then O(.) is a higher order function:

O : (num -> num) -> Set (num -> num)
O(f) = {g | g : num -> num, lim g(x)/f(x) < âˆž}


It takes a function f and returns the set of all functions that do not grow faster than f. The notation is really unfortunate: the first problem is that where the limit is going is implicit. In CS it's usually the limit of n -> âˆž, but for the log example the limit is x -> 0. The second problem is that usually the function is written as an expression. The correct notation would be O(fun n -> n^2), but people usually write O(n^2). The third problem is that people implicitly let the whole set stand for an element of the set, which is really convenient but leads to nonsense like n = O(n^2) which should really be written n IN O(n^2), or if you want to be completely correct: fun n -> n IN O(fun n -> n^2). It gets even worse for multiple parameters.

### just discussion, not disagreeing

I suppose standard library design and implementation might cater to large data structure efficiency by offering a more complex suite of options for things like maps, when they might get large and have several subtle aspects loosening or tightening requirements. To developers, a scaling subset of a library might feel disproportionately complex, and complexity strikes a lot of folks as a bad code smell. (Allowing it in design does seem to induce descent into bike-shedding when an anything-goes attitude results.) I like the implicit idea that part of a library should focus on optimization detail.

I haven't read the paper closely, but I did notice a heap was used. That might be a good idea, to use a logN priority queue to pick LRU resources for replacement. But if a data structure is inside one address space, and is not persistent, I much prefer a doubly-linked list with most recently used items moved to the front in constant time, and popping the LRU end for replacement also in constant time. Under concurrency that needs locks, a mutex protects an LRU list in its entirety, even when built using intrinsic links embedded in member objects within a cache. The tiny size of a critical section is good when a doubly-linked list can be edited this way in constant time, even if you have a cache miss on both the predecessor and successor siblings in the list on each edit. If a thread-safe hashmap tracks member LRU status this way, the trick is using only one lock order to prevent deadlock: either bucket-then-LRU or LRU-then-bucket, but never mixed. Whichever order you pick, one direction of finding something you might want to remove from a map will be the wrong way, and require you release and relock in proper order, in which case the target might be gone after losing a race with another thread. If the LRU lock goes first, this is one reason to use an open/chaining hashmap, so you know which mutex to lock and which bucket to search when you have to go back after locking in the proper order. (One mutex per bucket is overkill; you only need a constant factor more mutexes than threads to suppress lock contention, then let many buckets share one mutex in a stripe.)

When the data structure is persistent instead, details in a good answer must cover more things, and might be dominated by specific internal algorithms already used by the code. ("Since I use X here to manage this part in one operation, over here I get Y almost for free, and I can use that to approximate LRU tracking.") Whoever writes the storage system usually gets to pursue their pet theory, as long as it survives criticism. I like direct comparison of alternatives to pick a winner, but that requires being able to write an alternative very fast, in case it loses and gets thrown away (or just relegated to testing only, because swapping alternatives is good for diagnosis).

I have always been more willing to accept a constant-time heuristic than a log-n optimal, so I would organize cache as a hash table and expire items from cache when a bucket overflow occurs. They would not even usually be the single oldest item, but they would on average be among the 1/M oldest where M is the number of items per hash bucket.

I sometimes do that with non-persistent hash tables, in pursuit of simplicity. When I use last-one-wins style hashmaps (and there are several in current code I maintain), I call them look-aside caches. This works well with a weak reference you can just throw away without cost when a newer collision replaces it.

Once a data structure exceeds what can fit in memory, and folks commit to using signficant disk space, marketing folks are under pressure to cite the largest possible number they can in product specs. So an index will be under pressure to grow without bound until it competes with actual content indexed for available resources. (Folks will ask you to use the same resource twice, or even more, for both content and index; I wish I was joking.)

I don't see a problem with having a hash map that doesn't fit in main memory.

It sounds okay, I just haven't done it much. It's irritating to think about when planning to allow a lot of dynamic growth, and you wish the result just sounded simple. (When everything ends up being more complex in practice, it helps when the summary sounds simple in overview, when you retell yourself and other folks the story.) If you go with a tree, you can aim to make hierarchy very flat via big internal node fanouts, and then growth scenarios sound uniform and well-behaved. And when I think about making persistent hash table management more and more flexible, it starts to sound more tree-like. I don't have a specific objection to disk-based hash maps; retreat to trees just seems like a coping mechanism when handed ever-growing secondary storage resources one is obligated to use.

Whether a map (either hash table or tree) can be partly disk-based depends on the expected payoff of queries. When an algorithm is speculative, say because you want to see whether you recognize something you might dedup, you can't afford to hit disk for maybe when this occurs often when granularity is high, unless success is likely. Speculative optimization can lose and make a system go slower than it would unoptimized, so tentative things that are very expensive must be avoided, or else worst case results can be dismal and not just inferior. Hitting disk only seems worthwhile when you find an entire answer and you win outright as a result of a cache hit.

Latency is at least as important as complexity...

I'm with you there: it'd be nice to see taught more to students. It seems a little fuzzy like operations research though, and less well-organized than topics more amenable to description via math. An incentive to educators does exist, even though it may be hard to present to students. Folks who learn this can find jobs more easily. Being able to manage latency issues in code often pays really well. It would make a program popular when students are in demand.

### If I were attempting to

If I were attempting to optimize access to documents, I'd be creating a hash table of pointers to these documents rather than a tree of any description, thus never needing to access more than one page (or maybe two for rehashes in the small fraction of cases where hash bucket overflows have happened) in order to find out where the document is kept.

O.K. suppose I'm a student or sw engineer who is told to implement a hash-table and has to select a hash function. All the introductory level text books recommend algorithms which minimize the likelihood of a collision by equally distribute keys and don't discuss possible cache miss issues.

Now I read "we are all doing it wrong" and a number of LtU participants claim that all of this has been well understood for years if not decades and the solutions only need to trickle down to the educators. Knuth et al have to rewrite their books. What exactly should be taught about hash functions instead? Of course people with an engineering mentality would demand 'measure, measure, measure!' but a little more a priori knowledge would be appropriate.

### a few random hash considerations for folks new to a few ideas

Since it will take me some time to reply to Ray, I'll comment first here because it may be shorter. I guess I should write for folks who know little about hashing, but I'll assume you know everything I say already. Several factors affect what makes a good hashmap choice: 1) whether losing map members is okay (for a cache that need not be exact), 2) whether shadowing is needed (like lexical scoped variables hiding others with the same name), 3) whether access is concurrent (via threads or shared memory), and if so 4) whether locks are required or optional. Maybe I'll write this all up sometime; it would make sense to provide sample code in each case to avoid the sin of being too abstract.

Two basic formats are called open and closed in the literature, referring to whether buckets are open to adding more members (typically in a linked list), or whether buckets are closed to adding more members (typically causing linear probing). I usually contrast those as open/chaining vs closed/probing. Which is better depends on answers to factors noted. A closed/probing map is smaller, touches fewer cache lines per access, and can be done in very cheap lockless manner when members can be lost in a cache. You can even corrupt parts of such a map with random writes without breaking it, if you're careful. (About six years ago, in an old cache incarnation, I switched to such a hashmap format in a 32-bit Linux system with 3GB of real memory, of which 1.5GB was in my map. Wild memory stompers had a 50% chance of corrupting my map, so I got tired of it and arranged that it survived corruption.) Linked list buckets in an open/chaining map can represent shadowing easily, and can be more suited for an exact membership problem, such as when managing in-memory cache state that must not be leaked. It's also better if map entries are also members of an LRU list for constant time reclamation of most stale entries.

If you're interested, I can tell you how to write a thread-safe HTTP cache using an open/chaining hashmap with an LRU list for replacement, if membership must be exact to manage RAM in the cache. A short description isn't a whole lot longer than the last paragraph.

... told to implement a hash-table and has to select a hash function.

Empirical tests to measure collision rates is a good idea, but picking sufficiently representative key sets is also a problem a critic can challenge by asking, "But what does it do when keys look like this instead?"

However, crc32 or crc64 both make very good hashes, despite the fact their collision resistance is primarily designed to detect bursts of bit errors in communication transmissions. After my last round of empirical tests, I even stopped using prime numbers in hashmap sizes when the hash is crc64, because it made no difference at all, even when I used power of two sizes. (Distributed bloom filter digest management is easier with powers of two sizes.) If you have no idea what will appear in keys, a CRC is fast and collision resistant. But if you know a lot about the keys, you can craft a hash function specifically to the expected key population.

All the introductory level text books recommend algorithms which minimize the likelihood of a collision by equally distribute keys and don't discuss possible cache miss issues.

If there's any tendency to cluster, even if actual collisions are rare, it risks causing linear search over a substantial subset of all map keys. A lot of effort should go into ensuring this doesn't happen, and one should collect stats, such as the longest single search ever observed. In a cache where losing members is allowed, because it's just an optimization, it's a good idea to empty slots specfically to terminate long runs of used slots in a close/probing hashmap. Max capacity must also be enforced. If a map becomes full, you can use one's complement of a hash to start a search to remove a "random" old map member; this causes exponential decay of old cache entries when nothing else removes them.

In a really big hashmap, typically every single access is a cache line miss. Lists in open/chaining hashmap buckets cause a constant factor more cache line misses, and it's very significant, so closed/probing maps are preferred for performance.