HDFS – Rack:
We can say Rack as a group of machines. In one rack we can have multiple data nodes. Rack is mainly used for improving network traffic while reading or writing operations.
When a client reads or writes data than the Name node chooses the Data node which is available on same rack or nearby rack for reading or writing purpose. Communication between data nodes on same rack is more efficient.
Here our blocks are replicated across multiple racks so that if 1 complete rack is down no problem occurs, as we have 2nd replica available on other rack. Suppose 30 machines in 1 rack became down, then processing power of our cluster will decrease but our data is highly available on other rack.
The purpose of rack-awareness policy is to improve data reliability, data availability and network bandwidth utilization.
Let us see above diagram which is cluster structure.
Here we have Core switch which manages all other 3 rack switches.
In Rack-1 we have 1 Name node and many data nodes. At least, one copy should be placed across the rack which means, suppose if replication factor is 3 and client is writing data in first data node in rack1, then it can have its 2nd copy in 2nd data node of rack1 but it cannot create another copy in same rack1.The 3rd copy will be created in slave of another rack.
Within the rack network speed will be more, across the rack it is less. This concept of choosing closer data nodes based on rack is called as Rack Awareness.