Map-reduce-merge: simplified relational data processing on large clusters

Map-reduce-merge: simplified relational data processing on large clusters (freely-accessible slides). Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, D. Stott Parker. 2007 ACM SIGMOD conference.

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.

However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.

We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

They seem to add a third phase – merge: ((k1, [v1]), (k2, [v2])) → (k3, [v3]) – which combines the outputs of two separate, parallel MapReduce tasks.

This makes it possible to do things like joins and build cartesian products.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

When the only tool you have is a hammer...

Stonebraker and DeWitt's articles: http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html and http://www.databasecolumn.com/2008/01/mapreduce-continued.html make a lot more sense when applied to this paper's Map-Reduce-Merge model.
I don't think the authors do a good enough job of explaining why I would want to implement traditional RDBMS abstractions on top of the map-reduce model. Don't you just end up with the worst of both worlds?

Relational operations are useful outside the RDBMS context

More specifically, being able to do perform operations such as joins on large datasets, shouldn't necessarily be tied to a (R)DBMS.

People tend to forget that "relational" and "database" are two orthogonal terms; though their product is highly useful. One can make good use of relational operations in contexts where support for transactions, rollback, redundancy, and write-concurrency (the hallmarks of a production-grade DBMS) aren't necessary. Such as analyzing large datasets gathered in a separate phase--something Google does a lot of.

Likewise, you can have databases (transactional semantics, recovery, etc) without using the relational model. Such beasts tend to go by other names these days--the general-purpose DBMS being dominated by relational, as the RDBMS is the model best suited to OLTP applications on generic datasets--but many other distributed applications have DBMS capability under the hood (directory services, for instance).

I don't think anybody is suggesting that MapReduce is a replacement for Oracle or SQL Server; it's not--for many reasons. But I don't think it would be a wise course of action for Google to scrap MapReduce, in the places it is deployed, and replace it with a traditional RDBMS. Might make Larry Ellison happy, but it wouldn't do Google much good.

Many brilliant software architects work at the Googleplex, after all--if an RDBMS were the best choice for Google's apps, they probably would have deployed one. (Or wrote their own!)

I don't think the authors

I don't think the authors do a good enough job of explaining why I would want to implement traditional RDBMS abstractions on top of the map-reduce model. Don't you just end up with the worst of both worlds?

I don't think joins or cartesian products are specific to relational databases. MapReduce is designed for datasets which are too large for existing databases, and Map-reduce-merge brings more useful techniques to that environment. So, I'd say it's a win.

the authors don't deny that

The authors don't deny that. They're objecting to the hype, not the current specific applications.

-t

why not scan as well?

I'm puzzled why Blelloch's scan operation isn't a standard part of the map-reduce model (see e.g. DOI 10.1109/12.42122). He showed that there's a surprising number of very powerful things that can be done with it. I asked Jeff Dean a couple of years ago, and he said they'd never found a need for it; do others have experience?

scan

scan is pretty useful, but then you'd want additional steps afterwards, in order to handle the massive result of scan. Then you may as well go onto a full-fledged data parallel language. NESL on Google.

That would be cool.