Home / HDFS Tutorial / HDFS – Architecture

HDFS – Architecture:

Blocks

We have seen that the files are divided into smaller chunks called Blocks. The default size is 64MB or 128MB.The block size can be increased also. If block size is 64mb which means at a time we can read 64MB of data. In blocks the division is physical division of data. These blocks are stored in distributed way across cluster. So, we can read, write, process data in distributed way. Blocks in data nodes are replicated among themselves.

HDFS Architecture(i2tutorials.com)

Let us now see the intelligence of HDFS. Suppose we have a file which is 129MB which need to be divided into blocks.

Then, we will have only 2 blocks, one of size 128mb and other of size only 1mb not of 128mb. HDFS does not waste your space, it just allocates the required size of block for it.

Name node

It runs on the entire master. Name node stores all the metadata information like filename, path, number of blocks, block id, block location, number of replicas, etc. It keeps the metadata in memory for fast retrieval. So, if the metadata is stored in memory, we can get a doubt that what happens if the system is rebooted. Generally, if the system is rebooted we will not lose any data because the exact copy of that data is available on disk also. Name node stores the data in memory for only fast retrieval because at a time many clients will be interacting with the master so if master node is slow then there will be huge performance issue. Hence, we generally keep master’s memory high, so we use high RAM systems for Master.

Name node uses two main files for storing information. EditLog to record every change that occurred in the file system. The entire file system namespace which we have discussed earlier is stored in a file called FsImage. Both the files are stored in the name node local file-system.

Data node

Data node runs on all the slaves in the cluster. It stores actual data. As the data node does the actual work and we need to have many data nodes in the cluster, we generally use commodity hardware for it. We need higher disk systems for data nodes for processing huge data.