MapReduce – Reduce Function
Input
In reduce the input will be in the form of Intermediate output given by the mapper. Before the input is given to reducer it is given for shuffling and sorting.
The intermediate output will be shuffled and sorted by the framework itself, we don’t need to write any code for this and next it is given to reducer. As soon as first mapper is finished the shuffling and sorting process is started.
Processing
For processing the input, we have reduce function. Reduce function is defined by the user and here we can write our own custom business logic.
Reducers are normally less than number of mappers so we write basic logics here like aggregations, summations. For one particular key we get multiple values.
We can set the number of reducers we want but cannot set number of mappers, for each reducer we get single output file. Suppose we have 2 reducers than we get 2 output files.
Output
The output generated by reducer is the final output which is stored on HDFS not on local disk. By the default property of HDFS the output will be replicated.
So, we store final output of reducer on HDFS not the intermediate output of mapper.
If we have multiple reducers than one part of intermediate output will go to reducer1 and other to reducer2, same intermediate output will not go to both reducers. By this distributed processing we get faster output.
Note: Mapper takes the advantage of Data Locality whereas in Reducer we don’t have Data locality.