MapReduce

Hi

I was just going thro abt MapReduce for my final year project work.....

I got confused in the middle.... What i thought is "MapReduce deals greatly with key/value pairs only... For fitting a problem into mapreduce we should find the key/value pairs"

I want to know whether im right or wrong....

I got confused after looking at the explanation in wikipedia... The following is the content in wikipedia abt mapreduce...

"Continuing the previous example, what if one wanted to know the average of the test scores? One could define a reduce function which halved the size of the list by adding an entry in the list to its neighbor, recursively continuing until there is only one (large) entry, and dividing the total sum by the original number of elements to get the average."

Here in map function we are simply adding up the test scores.... we are not using any key/value pair..... Im totally confused....

I might be wrong at any point... please someone help me out..... Am i wrong in the basic understanding of MapReduce itself..... Ill be thankful if anyone explains me clearly...

please help me out to successfully complete my final year project....

Jaya

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Connection Machine

Look at the Connection Machine Lisp language. Map is similar to the alpha operation, and reduce is the beta operation. (The true beta can produce a set of results; the wikipedia article claims mapreduce produces only one, but it's wrong.) The difference between map and alpha is that map produces new indices as well as values for the intermediate result.

The Wikipedia article doesn't do a very good job of separating what mapreduce accomplishes from its implementation. Basically what it does is that you supply a map function and a reduce function. Your map function is applied to each of a bunch of input files, and it can "emit" as many key/value pairs as it wants. Then your reduce function gets called once for each key, and it can loop over the values for that key (that's how it looks to the programmer). So the map function supplies keys to tell the reduce function which items to bring together, just as in the CM beta operation.

It's really worth looking at the examples in the Google paper, even though they don't separate functionality and implementation very well either.

Josh

ps there are 2 LtU threads on the subject referenced in the wikipedia article.

Google’s MapReduce Programming Model—Revisited

Google’s MapReduce Programming Model—Revisited is a paper by Ralf Lammel that might help to get a better understanding.

fold and map

Jaya,
Map and reduce don't have to operate on a key/value pair, in fact, the way I learned them, they don't operate on key/value pairs.

Say you have a list of numbers:
[1, 2, 3, 4, 5, 6, 7, 8, 9]
Now you want to add one to each number in this list. Using a for loop, you might do the following:
...
10 var newlist;
11 for(int i;i 12 newlist[i]=mylist[i]+1;
13 }
14 return newlist;

What if you wanted to multiply each element by 10, you would simply change line 12: change '+' to '*' and 1 to 10. Ideally you would like to generalize the solution by creating a skeleton, similar to the code above, but passing in an additional parameter to the function: either (+ 1) or (* 10). Languages that allow you to pass around functions as easily as objects usually let you do that, and it is called a map!
So "map (+2) [1,2,3,4,5,6,7]" adds 2 to each element in the list (this expression was checked in haskell).

Instead of adding or multiplying something to each element of a list, what if you want to either add up or multiply all elements of a list and return a single value?
...
10 var accumulator=0;
11 for(int i;i 12 accumulator = accumulator + mylist[i];
13 }
14 return accumulator;

The previous example adds up all the elements of the given list. If you wanted to multiply them all together, you obviously have to change '+' to '*' in line 12, but you ALSO have to change line 10 from '0' to '1'. In this problem, we are not just changing the function applied to the list, we are also changing the 'accumulator.' In other words, if we wanted to create a skeleton out of this code, we would have to pass in TWO arguments: the accumulator and the function. This is called a reduce (sometimes a fold, sometimes even a catamorphism)!
So "foldr (+) 0 [1,2,3,4,5,6,7]" adds up the numbers, whereas "foldr (*) 1 [1,2,3,4,5,6,7]" multiplies the numbers (again haskell).

There you have it, map and reduce. When the two are combined, you can take a list and either do something to each of its elements, or reduce the list to a single value.

It would be a shame if you left it at that.

Instead of thinking of map and reduce as generic skeletons, think of them as following:
a list [1,2,3,4...] is really several numbers combined through some sort of costructor:
[empty list].append(1).append(2).append(3)...
When you passed in '+' and 0 (or '*' and 1) to reduce, you basically replaced .append with '+' and '[empty list]' with 0 and ended up with the following:
0 + 1 + 2 + 3 + 4 ... (or 1 * 1 * 2 * 3 * 4 ...).

Now we don't have to think in terms of accumulators or loops, just think in terms of replacing list constructors with a function of your choice.

Instead of passing 'reduce' simple functions, what if you pass in some crazy ones, such as: "append(add1(" and "[empty list]" . Now watch what happens:
[empty list].append(1).append(2).append(3)...
= replace "append" with "append add1" and [empty list] with [empty list]
[empty list].append(add1(1)).append(add1(2)).append(add1(3))....
= actually applying these methods returns
[2,3,4,...]

In other words, we used reduce to create a map! You can use similar trickery to create a filter function which lets remove items from your list that match certain criteria.

After you have created reduce (or fold), map and filter, how much power have you gained? According to papers by Wadler and Grust (among others I'm sure), you have pretty much created something with the same power as SQL (more actually)!!

I'll let the experts correct and expand this feeble, off-the-cuff, attempt at an explanation.

Clarification

Just to be clear, the original question was specifically about Google's MapReduce system, which (despite confusing terminology) is quite a bit different from what FPers would recognize as "map" and "reduce"... The Lämmel paper linked above makes this very clear, and makes the differences precise. I highly recommend it.

Falcon, im still confused...

Falcon,

im still confused... i cant get clear idea... Is there any way of interactive sessions, so that i can ask my doubts instantly.... plz let me know if there is any possibilities of chatting... Ill be thankful if u spend time to clarify my doubts...

[OT, Meta] #haskell

I'm not sure if falcon is on the #haskell IRC channel, but it would also be a good place to ask (in this particular case and in general).

[edit] And, uh, as falcon says, #ltu. (Which I half forgot half didn't know about.)

interactive help

Jaya,
Although I'm usually logged on to irc, I'm not actually in front of the computer so much. Derek is right, #haskell folks will help you out. In fact, why not go to #ltu (on freenode) :). #haskell folks will help you out, but #ltu folks will give you an hour long lecture if you want :)

Also, although I haven't read the paper Matt suggested, I do recall reading somewhere else that Google's use of MapReduce terminology may not exactly be the same as the Map/Reduce wikipedia talks about. If there is a conflict between wikipedia and Lämmel's paper, probably best to go with Lämmel.

If your project involves the whole MapReduce system, including layers below it, search for "bigtable" in google video for an hour long presentation describing the stuff under MapReduce. You may also want to say a few words about the scope of you project and which programming language you understand best (so examples can be given in that language).

Falcon, Will u be available

Falcon,

Will u be available in yahoo messenger?

no messenger access most of the day

You can reach me through email at: shahbazc on gmail.com.
It is better to interact through this forum so others can benefit. I'll introduce you to #ltu where you will find real experts :)

please do introduce me to

please do introduce me to them....

MapReduce in Cat (explanation and one-liner)

I've written a short article on MapReduce on the Cat wiki which I hope help alleviates some of the MapReduce confusion.

It also explains a implementation of MapReduce using the one-line Cat program:

  define map_reduce { [map flatten self_join] dip map }

MapReduce in the news again

I've been noticing a lot of blog activity about a recent criticism of MapReduce by two well-known DB researchers, Dewitt and Stonebraker. They criticize MapReduce as being a poor substitute for a relational database. Many bloggers are fairly critical of their criticism!

Googling on either of their names, or on MapReduce, should give you the flavor of this current "blogstorm."

Something interesting I saw is that they are critical of Berkeley starting to teach MapReduce, BigTable, and other Googlish things to freshmen. Personally, I find this intriguing, as it's a way to expose students to FP concepts and show them some important real-world uses. And I suppose they can try implementations in various languages, from Java to Scheme to Haskell to Sawzall. In a time when a lot of FP languages are getting replaced in the undergraduate curriculum with Java and similar things, this exposure to FP approaches is nice to see.

--Tim May

When all you have is a hammer...

Given that google seems to be successfully maintaining one of the largest data stores in the world--one which is continually updated while being queried by zillions of users a second--I hardly think their architecture needs defending.

Dewitt and Stonebreaker's criticism can be found here for those who are interested.

Most of the criticism seems to be of the type "MapReduce is bad because it's not an RDBMS". More specifically, it doesn't many of the features of a modern RDBMS.

Their points are valid if one is considering MapReduce as a plug-in replacement for an RDBMS...but that's not how Google uses it. And while I'm of the opinion that the relational model is largely a good thing, it isn't appropriate for everything. And search may be one area where it isn't appropriate.

A few specific comments:

* The classic relational model works best when you have a known problem domain, which you can turned into fixed schemas, which you can then normalize (and denormalize) as needed.
* Search generally deals with datasets which are of unknown and varying types.
* Search engine users will not know or care about any canned schemas.
* The comparison strikes me as apples-and-oranges, anyway. They aren't comparing general map/reduce/etc. to the relational model; they are comparing a distributed parallel implementation of map and reduce (a small building block in the overall google kit) to a full-fledged RDBMS, along with all the tools and infrastructure included.
* Finally... <rant>the next time I hear a relational-head compare something to Codasyl, I'll personally go whap him on the behind with a ping-pong paddle. Codasyl-comparisons are so overdone as to be thoroughly content-free--that would be like a LTU reader comparing Ruby to COBOL just because both happen to permit side effects. In both cases, dogma is being substituted for thorough analysis; both sorts of comments represent a vast failure to see the forest for the trees. I've seen better relational advocacy coming from topmind. </rant> I realize the term "relational-head" is somewhat inflammatory, for which I apologize.

Similar counter argument

A similar counter from a Googler, Mark Chu-Carrol, here. There is some reasonable semi-technical analysis, but I like his slightly ranty analogy the most:

RDB people have found the most beautiful, wonderful, perfect hammer in the whole world. It's perfectly balanced - not too heavy, not too light, and swings just right to pound in a nail just right every time. The grip is custom-made, fitted to the shape of the owners hand, so that they can use it all day without getting any blisters. It's also beautifully decorated - encrusted with gemstones and gold filigree - but only in places that won't detract from how well it works as a hammer. It really is the greatest hammer ever. Relational database guys love their hammer. It's just such a wonderful tool! And when they make something with it, it really comes out great. In fact, they like it so much that they think it's the only tool they need. If you give them a screw, they'll just pound it in like it's a nail. And when you point out to them that dammit, it's a screw, not a nail, they'll say "I know that. But you can't expect me to use a crappy little screwdriver when I have a magnificent hammer like this!"

Battle for student mindshare

There was a reddit comment about this which might have a point:

Take Stonebraker's and DeWitt's article as a defense of teaching RDBMS fundamentals, and an argument against allowing specialized things like MapReduce to take up teaching time and resources.

See the comment for more detail. There's also an interesting comment about Stonebraker's "other angle" on this in one of the replies.

I'll buy that...

...with the caveat that I'd more focus on the relational model first. RDBMSs should get some time, as a highly important instantiation of the relational model (modulo the criticism of Date and others concerning SQL), but the underyling model (predicate calculus, relational algegra/calculus, normalization and data modelling) should precede any discussion of implementations.

Likewise with concepts such as "map" and "reduce", and other basic algorithmic stuff. MapReduce is an interesting instantiation of that, especially in the arena of very-large-scale systems, but MapReduce itself isn't appropriate for a first course. (Especially as much of it is proprietary, and unless you're running a huge server farm like Google, its scale is highly inappropriate).

Besides. I would think that something like MapReduce, and other large-scale data processing goodies in the Google toolbox, might have application in implementing RDBMSs, especially on a large scale. The more I think about it, the more bizarre the criticism seems to be.

And with that in mind...

...I hereby propose (and promulgate) Coddwin's Law:

As an online discussion on database technology grows longer, the probability of a comparison involving Codasyl approaches one.