HDFS means Hadoop Distributed File System which is world’s most trusted storage engine designed to work on commodity hardware. It is designed to store very large files which may be of size of TB’s. Here commodity hardware means the systems which we use in our day-to-day life like of less expensive, less RAM, low memory, etc.
Some of the features of HDFS are fault-tolerance, scalability, reliability, high throughput, distributed nature.
In HDFS we can write and read the data which is distributed. For storing large size files, HDFS provides high data bandwidth.
The two main principles used in HDFS are storing of less number of large files and write-once-read-many access model for files.
Let’s see the detailed concepts in the tutorial.
There are two types of nodes in HDFS.Master Node can call it as Master node or Name Node. For an HDFS cluster, we have single Name node and multiple Data nodes. Name node is the daemon which runs on master so we can call it as Master node or Name node.Slave Node can call it as Data node also. Data node does the task given by the Name node. It is generally deployed on slaves. We have many data nodes in a cluster. Learn more..
We have seen that the files are divided into smaller chunks called Blocks. The default size is 64MB or 128MB.The block size can be increased also. If block size is 64mb which means at a time we can read 64MB of data. In blocks the division is physical division of data. These blocks are stored in distributed way across cluster. So, we can read, write, process data in distributed way. Blocks in data nodes are replicated among themselves. Learn more..
All the blocks of a file are replicated on data nodes. The whole replication process is taken care by the master. To reduce data bandwidth, HDFS always tries to read a request from a replica that is closer to the reader or on the same rack. With the help of replication, we can achieve high availability, data reliability, fault tolerance. Data reliability means not losing any particular data. Learn more..
Reading is done in parallel (distributed) so it is very efficient. At client node a JVM will be running, the very first class which comes into picture is HDFS client. Let us see the reading process step by step. HDFS Client will send open request to Hadoop distributed file system. Learn more..
Write operation is also done in a distributed way. Here Name node only tells the blocks need to be replicated on the machines. All the blocks are replicated among themselves. Writing is done in parallel means it does not write first 1st block, next 2nd block…it will be done parallel. Learn more..
We can say Rack as a group of machines. In one rack we can have multiple data nodes. Rack is mainly used for improving network traffic while reading or writing operations.When a client reads or writes data than the Name node chooses the Data node which is available on same rack or nearby rack for reading or writing purpose. Learn more..
First, to execute HDFS commands, we need to start services of HDFS and yarn. To do that we use start-dfs.sh and start-yarn.sh. Than we get all the services or daemon started like datanode, namenode, etc. as given below. We can check all the services using “JPS” command. Learn more..