MapReduce

What is MapReduce?

MapReduce is an algorithm that processes and generates large data sets. The term "MapReduce" actually refers to two separate and distinct tasks that programs perform. The first is the "map" job. This process takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The second part is the "reduce" job. This process takes the output from the map as input and combines those data tuples into a smaller set of tuples.
Image result for mapreduce diagram

Example

You have a file that contains two columns that represent a city and the corresponding temperature recorded in that city for the various measurement days. 

Toronto, 18
Whitby, 27
New York, 32
Rome, 37
Toronto, 32
Whitby, 20
New York, 33
Rome, 38
Toronto, 22
Whitby, 19
New York, 20
Rome, 31
Toronto, 31
Whitby, 22
New York, 19
Rome, 30


The task is to produce an output that finds the maximum temperature for each city across all of the data file. So first, the map function will put these data into tuples with the key of city name, and the value of the temperature. After the data has been put into the map function, the output will be the following:

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37) (Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38) (Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31) (Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)

This output will now be the input for the reduce function. The reduce function will now combine the input results and output a single value for each city, producing a final result set as the following:

(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)



Advantages of MapReduce

Using the MapReduce algorithm provides many significant advantages. The following are some well-known advantages of MapReduce:
Image result for mapreduce fault tolerance

1. Fault Tolerance
       The MapReduce implementation uses a pull model for moving data between the map function and the reduce function, as opposed to a push model where the map functions write directly to the reduce functions. Most MapReduce executions over large data sets encounter at least a few failures. And in a push model, failure of a reduce function would force re-execution of all the tasks of the map function. But by using a pull model, if a failure happens during the reduce function, it can retrieve the output of the map function from the local node storage rather than re-executing the map function. As the data sets of the world is growing larger, the fault tolerance feature of MapReduce is being rated higher and becoming more important.



Image result for mapreduce multiple system

2. Heterogenous Systems
       In cases where different types of storage systems are used for the database, MapReduce can simply analyze data in such heterogenous systems. One set of data may be stored in a relational database and another might be logged to a file system. Or as the environment is evolving, data may migrate to new storage systems. In these cases, end users can extend MapReduce to support a new storage system by defining simple reader and writer implementations that operate on the storage system. Then, a single MapReduce operation can easily process and combine data from a variety of storage systems. So MapReduce can be used across any type of environment of databases.





3. Complex Functions
       MapReduce can be used for very complex functions that are too complicated to be expressed easily in a SQL query. Map and reduce functions are often fairly simple and have straightforward SQL equivalents. But they are also capable of complex functions that go beyond the ability of SQL. S
Here are some of the examples:

  • Extracting the set of outgoing links from a collection of HTML documents and aggregating by target document
  • Stitching together overlapping satellite images to remove seams and to select high-quality imagery for Google Earth
  • Generating a collection of inverted index files using a compression scheme tuned for efficient support of Google search queries
  • Processing all road segments in the world and rendering map tile images that display these segments for Google Maps
  • Fault-tolerant parallel execution of programs written in higher-level languages across a collection of input data



Disadvantages of MapReduce

1. Latency
       MapReduce has two tasks that need to be performed: map function and reduce function. And it requires a lot of time to perform these tasks. Data is distributed and processed over the cluster in MapReduce. The map function takes a set of data and converts it into another set of data, where individual element are broken down into key value pair, and the reduce function takes the output from the map as input and process further. This process increases the time to perform these tasks thereby increasing latency.

2. Programming Model
       MapReduce has been characterized as too "low-level" for analysts used to SQL-like or declarative languages. Developing efficient MapReduce applications requires advanced programming skills and deep understanding of the system architecture. Common data analysis tasks usually include processing of multiple datasets and relational operations, such as joins, which are not trivial to implement in MapReduce. Also, since MapReduce operators are stateless, MapReduce implementations of iterative algorithms require manual management of state and chaining of iterations.





Future Trends of MapReduce Applications
Image result for GATK mapreduce
1. Next-Generation DNA Sequencing
  •        The CloudBurst software, among the first to make use of Hadoop in the field of bioinformatics, maps next generation short read sequencing data to a reference genome for SNP discovery and genotyping.
  • The Crossbow software, which also makes use of Hadoop, performs whole genome resequencing analysis and SNP genotyping from short reads. It is a cloud-computing tool that executes in parallel.
  • Myrna is a cloud computing pipeline for calculating differential gene expression in large RNA-Seq datasets. It can be executed on the cloud using Amazon Elastic MapReduce, on any Hadoop cluster, or even on a stand-alone computer without using Hadoop.
  • Genome Analysis Toolkit (GATK) is a structured programming framework that simplifies the development of efficient analysis tool for next-generation DNA sequencing by making use of the MapReduce programming model.

2. Other Bioinformatics Domains
  • The MapReduce-like middleware Hydra has been used to support parameter sweep parallelism and dataparallelism in bioinformatics workflows.
  • MapReduce has also been used to implement pattern finding algorithms for analyzing complex prescription compatibility networks.
  • Phylogenetic applications have also been designed by making use of the MapReduce framework.


                                                                                                                                                              
References

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: A Flexible Data Processing Tool .” 
       Communications of the ACM, vol. 53, no. 1, 2010, p. 72., doi:10.1145/1629175.1629198.

Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large 
       Clusters.” 2004.

Kalavri, Vasiliki & Vlassov, Vladimir. (2013). MapReduce: Limitations, Optimizations and Open 
       Issues. 1031-1038. 10.1109/TrustCom.2013.126. 

Sarkar, Anurag, et al. “MapReduce: A Comprehensive Study on Applications, Scope and 
       Challenges.” International Journal of Advance Research in Computer Science and 
       Management Studies, vol. 3, no. 7, July 2015, pp. 256–272.

“What Is MapReduce?” The Analytics Maturity Model (IT Best Kept Secret Is Optimization), IBM Corporation, www.ibm.com/analytics/hadoop/mapreduce.

Comments