MapReduce – InputFormat:
InputFormat:
At start we have input files where the actual data is stored. These input files are divided into blocks or splits and distributed across the cluster.
These splits are further divided into key-value pairs which are processed by map tasks one record at a time. So now what is Input format, it defines how the input files need to be split.
It takes the files from HDFS and splits the set of input files for the job. Each input split is then assigned to an individual mapper for processing.
We have a method called getSplits, getRecordReader for the given input split.
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter)throws IOException
Hadoop provides InputFormat class in org.apache.hadoop.mapreduce package which provides details of input splits and RecordReader, here RecordReader is a class which generates key-value pairs from input split.
public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedE }
Client can calculate number of splits using getSplits() method. Then he sends that information to the Application Master, which uses their storage location to schedule Map task.
The Map task than passes the split to createRecordReader() method on InputFormat to obtain a RecordReader for that split. We get the key and value from RecordReader and sent to Map function.
The RecordReader class will be as
public abstract class RecordReader<Key, Value> implements Closeable { public abstract void initialize(InputSplit split, TaskAttemptContext context) ; public abstract boolean nextKeyValue() throws IOException, InterruptedException ; public abstract Key getCurrentKey() throws IOException, InterruptedException ; public abstract Value getCurrentValue() throws IOException, InterruptedException ; public abstract float getProgress() throws IOException, InterruptedException ; public abstract close() throws IOException ; }
Note: We can write our own partitioner, recordreader, split, inputformat.
Types of InputFormat are:
Textinputformat:
Default input format which takes text as input format. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators.
MultiFileInputFormat:
Suppose we have many small files then we will have many mappers as file size is equal to block size as file size is less. Here 1 mapper process only 1 file.
With the help of MultiFileInputFormat we can combine all that small files into 1 split and assign only 1 mapper for doing the task.
SequenceFileInputFormat:
It is used when we have many map-reduce jobs where output of one map reduce job is given as input to other map-reduce job. It is used when we have sequence files as Inputformat.
Split
Split is a logical representation of a block. Block is a physical division of data whereas split is a logical division of data. Mapper process 1 split at a time.
Split improves performance very much as it divided the large file into small splits and assigns 1 mapper for a split and process all the splits parallel.
Split is just a logical representation which means it won’t occupy 128mb space. Splits are further divided into record i.e. key-value pairs and processed by mapper.
Recordreader
RecordReader is the main class which loads the data from the input file and converts it into key-value pair. It runs until all the input split is converted into key-value pair. We can create our own recordreader.
Map
We have already seen about mapper which receives 1 key-value pair at a time given by RecordReader until our entire split is consumed. In map we can write our own business logic.
Partitioner
Partitioner decides that which key goes to which reducers depending on the hash value generated by the key.
The key having same hash value goes to same reducer. By this we can understand that partition just divided the data into multiple parts.
Shuffling Process
Shuffling is the process of data movement of key-value pairs between mapper and reducer over the network. It starts as soon as one map task is finished. Shuffling depends on network bandwidth so it is a costlier operation.
Sort
Sorting is done automatically by the framework itself. All the keys are sorted by the framework and given to the reducer. Intermediate output will be sorted based on the keys before given to reducer.
Reducer
It receives the sorted output from mapper. We can have one or multiple reducers.
Outputformat
At last, we have output files which are written to HDFS. OutputFormat defines how to write output data to HDFS. The output also will be in the form of key-value pairs.
Like InputFormat, we also have many types of OutputFormat like TextOutputFormat, SequenceFileOutputFormat, NullOutputFormat.
Like RecordReader, we also have RecordWriter which does the actual work of writing, it takes individual key-value pair and writes the data to the output file.