MapReduce – Advanced Concepts:
Map only Job:
So, until now we have seen both mapper and reducer. Here we don’t have reducer, output of mapper is the final output and stored in HDFS.
Here the main advantage is since we don’t have any reducers there is no need of data shuffling between mapper and reducer. So, map only job is more efficient.
We use map only job only for certain problems like data parsing in which there is no need of aggregation or summation.
We have a Combiner In between Map and reduce. It is mostly like a reducer. The tasks done by the combiner are also aggregating, filtering, etc.
The main difference between Combiner and Reducer is that Reducer process data from all the mappers from all the nodes whereas Combiner process data from mappers of a single node.
By this the number of intermediate outputs generated by mapper will be reduced, it can reduce data movement of key-value pairs between mapper and reducer.
We can define our own Combiner function. We can take an example like we can have many repeated keys generated by mapper, then we can do aggregation by Combiner and reduce the size of data being sent to reducer.
Combiner internally implements reducer method only. We can take a simple Word Count example and analyse the combiner.
Data locality means movement of code or computation closer to data stored in HDFS. It is usually easy to move kb’s of code near to place where data is stored.
We cannot move HDFS data which is of very large size near to code. This process of movement is called as Data Locality which minimizes the network congestion. So, let us see how this process is done.
First Mapreduce code or job will come nearer to the slaves and process the blocks stored on each and every slave. Here mapper operates on the data located on slaves.
Suppose the data is located on the same node where the mapper is running then it can be referred as Data Local. Here computation is closer to the data.
Data local is the good choice but it is not always possible to do this due to resource constraints on a busy cluster.
In such situations it is preferred to run mapper on a different node but on the same rack which has the data.
In this case, data will be moved between different nodes of same rack. Data can also travel different rack, this is the least preferred scenario.