Revisiting Google's MapReduce

Google's MapReduce Programming Model -- Revisited. Ralf Lämmel.

MapReduce is a programming model (and an associated implementation) used at Google for processing large amounts of input data (such as millions of documents) and generating keyed or indexed query results (such as the indexes used for Google's web search). The programming model is stunningly simple, and it allows the programmer to largely abstract from the parallel and distributed execution of the data-processing computations. Parallel execution is promoted by the assumed skeleton of MapReduce computations. Load balancing, network performance and fault tolerance are taken care of by the MapReduce implementation.

We revisit the MapReduce programming model in an attempt to provide a rigorous description of the model. We focus on the key abstraction for MapReduce computations; this abstraction is parameterized by the problem-specific ingredients for data extraction and reduction. We use Haskell as a lightweight specification language to capture the essence of MapReduce computations in a succinct, executable and strongly typed manner. Our study substantiates that clarity, generality and correctness of designs (or the presentations thereof) are easily improved, if modest functional programming skills are put to work.

This is a fun example of type-directed exploration and prototyping, and pokes some gentle fun of Google's (ab)use of standard terminology. It's neat to get a glimpse into the programming process of a very experienced Haskell programmer.

(From the forums.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

It's neat to get a glimpse

It's neat to get a glimpse into the programming process of a very experienced Haskell programmer.

Quite. Thanks for posting this.

Enjoyable read

I also really enjoyed reading this - it gives a good overview of applying functional language idioms in an informal sense. I've worked through Hudak's School of Expression, for example, yet I had no idea GHC's Data library existed! Any more articles like this out there?

MapReduce itself was

MapReduce itself was discussed before.

The undefinedness idiom

... is one very simple (and, boy, so cool) trick I had never seen before. Googling for it resulted in Lammel's paper being the first hit, without any seemingly relevant results. Does this go by some other name in other programming language communities? Is it a standard trick on experienced Haskell programmers belts? I'm just amazed I had never seen it before, and that google doesn't seem to have, either.

Not usually named

This isn't really a programming idiom though there are idioms related to it. It may arguably be a deriving-from-specifications idiom. It is not uncommon to stick something like error "not implemented" or something more informative to get compiling code. Usually, though, this is not used to derive the types and implementation, except, of course, when you are doing something akin to deriving-from-specification; rather, it is used simply to work with the partially defined code which may involve seeing what types are inferred for various terms given terms with known or given types.

Updated

This paper was updated on April 10, 2007 (there seems to be a typo in the paper which lists the draft updated as of April 10, 2006). According to Ralf's blog entry (which I can't find currently), this was almost a complete re-write.

Blog post

The blog post is here.