MapReduce is a programming model or a software framework invented by google for processing large amount of data in parallel on large clusters of commodity hardware. It usually divides the work into set of independent tasks which are processed by Map and Reduce tasks.
The Hadoop framework takes care of all the things like scheduling tasks, monitoring them and re-executing if any of the task fails.
Map-Reduce work always in the form of key-value pair and can be implemented in many languages like java, python.
It is the responsibility of the framework to convert the unstructured data into key-value pair. In Map-reduce input is in the form of list and map-reduce transforms that list of input data elements into list of output data elements.
This transformation is done by both Map and Reduce. It divides the work into smaller parts and executes them on different nodes in the cluster. First Map phase is done and next Reduce phase.
In Map reduce program Map function implements the mapper and Reduce function implements the reducer.
We will see this in the detailed analysis of Map-Reduce program.
Split can be called as a logical representation of block. In map-reduce 1 mapper can process 1 split at a time.We have seen in HDFS that the default size can be 64mb or 128mb, then if file size is 1280mb, block size is 128mb than we will have 10 splits, then 10 mappers will run for the input file. Learn more..
Takes key-value pair as input, where key is a reference to input value and value is the actual data on which we process.Processing is done by the Map function which defined by the user. In map function user can define his own custom business logic. Learn more..
In reduce the input will be in the form of Intermediate output given by the mapper. Before the input is given to reducer it is given for shuffling and sorting.The intermediate output will be shuffled and sorted by the framework itself, we don’t need to write any code for this and next it is given to reducer. Learn more..
Here first we divide the data into multiple blocks or splits depending on the block size. Next it is given to mapper which process 1 split at a time.The input to the mapper will be in the form of key-value pair and the output will be intermediate output which is stored on local disk. Learn more..
At start we have input files where the actual data is stored. These input files are divided into blocks or splits and distributed across the cluster.These splits are further divided into key-value pairs which are processed by map tasks one record at a time. Learn more..
So, until now we have seen both mapper and reducer. Here we don’t have reducer, output of mapper is the final output and stored in HDFS.Here the main advantage is since we don’t have any reducers there is no need of data shuffling between mapper and reducer. So, map only job is more efficient. Learn more..
The output produced by Map is not directly written to disk, it first writes it to its memory. It takes advantage of buffering writes in memory. Each map task has a circular buffer memory of about 100MB by default (the size can be tuned by changing the mapreduce.task.io.sort.mbproperty). Learn more..