Map-reduce-merge: simplified relational data processing on large clusters (freely-accessible slides). Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, D. Stott Parker. 2007 ACM SIGMOD conference.
Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.
However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.
We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.
They seem to add a third phase – merge: ((k1, [v1]), (k2, [v2])) → (k3, [v3]) – which combines the outputs of two separate, parallel MapReduce tasks.
This makes it possible to do things like joins and build cartesian products.
Recent comments
22 weeks 6 days ago
22 weeks 6 days ago
22 weeks 6 days ago
45 weeks 19 hours ago
49 weeks 2 days ago
50 weeks 6 days ago
50 weeks 6 days ago
1 year 1 week ago
1 year 6 weeks ago
1 year 6 weeks ago