MapReduce – Flow:
Now let us summarize both mapper and reducer in a complete diagram.
Here first we divide the data into multiple blocks or splits depending on the block size. Next it is given to mapper which process 1 split at a time.
The input to the mapper will be in the form of key-value pair and the output will be intermediate output which is stored on local disk.
The output will be shuffled and sorted by the framework itself and given to reducer. These Map and Reduce tasks are scheduled using YARN and run on nodes present in the cluster.
If a task fails, it will be automatically rescheduled to run on different node. Here in reducer values corresponding to same key goes to same reducer. The output generated by reducer is the final output which is stored on HDFS.
Note: If we have multiple files in directory then Map-Reduce will read all the files in the directory. Reading of files is not done sequentially. Internally all the files are divided into blocks so it reads all the blocks of files distributedly.