HDFS – Features
All the blocks of a file are replicated on data nodes. The whole replication process is taken care by the master. To reduce data bandwidth, HDFS always tries to read a request from a replica that is closer to the reader or on the same rack. With the help of replication, we can achieve high availability, data reliability, fault tolerance. Data reliability means not losing any particular data. In any situation there won’t be any loss of data. We can achieve scalability by adding more nodes in the cluster when required and can add more disks on the node.
When we start the HDFS, first the Name node enters Safemode. We cannot replicate the data blocks in Safemode. Name node receives heartbeat and Block report messages from data node. Name node checks the Block report which contains all the block information like number of replicas for each block and then it exits Safemode state. Then the Name node starts replicating the blocks to other Data nodes.
It is a signal sent by data node to name node continuously which tells that the Data node or Slave node is alive. As the master and slave nodes are configured on different servers the master needs to know whether the slave is alive or not so for that we use Heartbeat signals. If the slave node is down than it cannot sent signal to Master than Master considers it as dead after specified amount of time. The information about how many nodes are up or down can be seen in the Web-Console.
Balancing (Under-Replicated and Over-Replicated):
Let us take an example that supposes one of the slave nodes is down. We have seen the default block replication is 3 copies. If one machine is down than no problem occurs because the blocks are replicated on other machines. Now the blocks are Under-replicated. Now we have only 2 copies of blocks as one machine is down. Master checks the heartbeat of it and HDFS automatically balances the cluster by using commands. It just replicates that blocks on other data nodes. So now we have 3 copies of the replicated blocks. Suppose after some time the slave node which got down is now in upstate. Now it is Over-Replicated. HDFS automatically balances it also by deleting the blocks in the new data node or data node depending on the current usage of blocks by the client.
It means the availability of data in all worst situations like hardware down, machine failed and system down. If a network link or node or some hardware goes down data will be available from some other path.