HDFS (Hadoop Distributed File System) is the most trusted storage system in the world that is used to occupy a limited number of large data files instead of storing a huge number of small data files. The file System has an excellent backup mechanism that is useful even in the case of failure.
This is a fault tolerant storage layer for Hadoop and its components. Also, it offers quick data access in parallel. Let us move ahead with HDFS tutorial guide that includes each and every concept related to file system in detail from A to Z. As shown in the image above, the blog will cover the main topics related to HDFS like, what is HDFS, HDFS nodes, Daemons, RACK, data storage, HDFS architecture, features and HDFS operations etc. Once you will go through the blog carefully, you will get a perfect idea of HDFS filesystem and also you would know either this is the right career choice for you or not.
HDFS is based on master-slave architecture and it has two nodes; one is NameNode (Master) and other is DataNode (slave).
There are two HDFS Daemons as shown in the image above; one is NameNode and other is DataNode.
Every time when a file is copied to the HDFS, it is first divided into small data chunks that are termed as blocks. The default storage size in HDFS is 128 MB that can be optimized based on requirements.
Further, these data blocks are stored in the cluster in a distributed manner. With the help MapReduce, data can be processed in parallel inside clusters. Multiple copies of data are replicated across different nodes with maximum fault-tolerant capacity, reliability, or availability etc. You will get a better understanding of the data storage process in HDFS by looking at the image given below -
Hadoop runs over clusters are distributed across different racks. Also, the name node daemon places a replica of the data block on different racks to improve the fault tolerant capabilities. The HDFS Daemons try to put a replica of the data block on every rack so that data loss can be prevented in all possible cases. The objective of RACK in HDFS is to optimize data storage, network bandwidth, or data reliability etc.
The images give you a detailed picture of HDFS architecture where there is only a single name node to store the metadata information while number n number of data nods are responsible for actual storage work. Further, data nodes are arranged in racks and replica of data blocks are distributed across racks in the cluster to provide data reliability, fault tolerance, or availability. Also, we will check about the read-write operations in HDFS. The client sends a request to name node when he wants to write operations in the file. As soon as the request is processed, a file is created and it cannot be edited again. For everything, the client needs to interact with name node as it is taken the heart of HDFS framework.
As we know, data is stored in a distributed manner in HDFS. Data is distributed across small nodes of the cluster. HDFS is the best way to process large data files and parallel execution of data. You will be surprised to know that MapReduce is the centerpiece here that supports distributed data storage for the HDFS.
Every time when a file is copied to the HDFS, it is first divided into small data chunks that are termed as blocks. The default storage size in HDFS is 128 MB that can be optimized based on requirements. Further, these data blocks are stored in the cluster in a distributed manner. With the help MapReduce, data can be processed in parallel inside clusters.
Take an example, where one data chunk is of 129 MB then it will create two blocks for the same. One for 128 MB and other for 1MB. HDFS is intelligent enough that it understands disk space should not be wasted and it arranges the blocks accordingly.
Further, minimum three replicas of each data node are created to make it available all the time in case of emergency. In other words, we can say that data loss is not the issue in case of HDFS. The placement of blocks and its replica are always decided by the name node daemon. It picks up the data from data node that can be loaded in minimum time.
The duplicate copies of data nodes are termed as replicas and HDFS creates minimum three copies of each data nodes that are distributed across racks of the cluster. The framework tries to place at least one copy of data node in each rack.
The concept of the rack is already discussed in earlier sections whose main objective is availability, reliability, and network bandwidth utilization.
When data is distributed across multiple nodes then it promotes data availability across clusters. Take an example, if some hardware goes down then data can be accessed from the different replicated node that results in high availability ultimately.
HDSA has a fault tolerant storage layer for HDFS and its components. The working of HDFS is based on commodity hardware with average configurations and it has high chances of failure if some hardware goes down. To tolerate this fault, HDFS replicates data on multiple places access it from anywhere if a particular data node is responding. Also, chances of data loss are almost negligible that helps HFDS to attain fault tolerant features and used by top MNCs to secure their large data files.
Scalability signifies here the contraction and expansion of data across the cluster. There are two options how scalability can be achieved in HDFS –
For this purpose, you need to change the configuration settings and add multiple disks based on requirement. You need to set downtime here but it is very less. If this is not the right choice of scaling for you then you can opt for other option that is discussed below.
The other popular option is horizontal scaling, where more nodes are added to the clusters instead of disks. Virtually, we can add n number of nodes to the cluster required for the project. This is a very attractive feature that is used by almost all leading MNCs worldwide to manage large data files.
Based on our discussion, this is clear that replication increased both fault tolerance capabilities and data availability. The third noticeable feature is data reliability. We know that data is replicated minimum 3 times but it should not be over replicated otherwise storage space would be wasted. So, destroying over replicated data is necessary here to store the data reliably.
Do you the meaning of throughout here? Throughout is the total amount of work completed in a unit time. It improves the overall data accessibility from the system and helps to measure the overall system performance too. If you are planning to execute some task then it should be divided well and distributed among multiple systems to handle the workload. This will execute all tasks independently in parallel. This will complete the task in short time span only and reduce the overall time to read operations.
To read or write any data file, the client needs to interact with the name node first. It will process the request and operation is completed accordingly. For this purpose, you need to follow the authentication process completely and start reading or writing data as needed.
In this blog, we have discussed Hadoop HDFS in detail from A to Z. the comprehensive tutorial guide is 100 Percent suitable for the beginners and gives you a perfect idea of the framework. So, you must be willing to learn more about HDFS now. If yes, start your learning with Hadoop certification program at JanBaskTraining and take your career to the new heights as you ever dreamt of.
JanBask Training is a leading Global Online Training Provider through Live Sessions. The Live classes provide a blended approach of hands on experience along with theoretical knowledge which is driven by certified professionals.
Receive Latest Materials and Offers on Hadoop Course