International Womens Day : Flat 30% off on live classes + 2 free self-paced courses - SCHEDULE CALL

Select Course
Blog
Corporate Training

+1 202 599 3842

(4.8/5 ) | 1.5K+ Ratings

- Hadoop Blogs -

How to install Hadoop and Set up a Hadoop cluster?

Hadoop is supported by the Linux platform and its different flavors. So, it is necessary to install Linux first to set up the Hadoop environment. In case, you have any other operating system; you could use a Virtual box Software and run Linux inside the Virtual box. Hadoop is generally installed in two modes, single node, and multi-node. Hadoop Installation Modes

Single Node Hadoop Installation
Multi-Node Hadoop Installation

Single Node Cluster means there is only a single Data Node running and setting up all NameNode, ResourceManager, DataNode, and NodeManager on a single machine. It is generally used for study and testing purpose. For small environments, single node installation can efficiently check the workflow in a sequential manner as compared to large environments where data is distributed across hundreds of machines.

In the case of Multi-node cluster, there are multiple Data Nodes running, and each data node runs on different machines. These types of clusters are suitable for big organizations who process voluminous data every day. They have to deal with petabytes of data daily distributed across hundreds of machines, so they prefer using multi-node clusters for large environments. Prerequisites for Hadoop Installation

Virtual box: It is used to install the operating system on Hadoop.
Operating System: Hadoop can be installed on the Linux based operating system, Macintosh, Windows operating systems.
Java: You should install Java packages on your system.
Hadoop: Obviously, you need Hadoop package to start the installation.

Hadoop Installation on Linux CentOS

Download the Java package and save it to the Home directory. Extract the Java tar files. Command: tar –xvf jdk-8u101-linux-i586.tar.gz
Download the Hadoop package and extract the Hadoop tar files. Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Command: tar –xvf hadoop-2.7.3.tar.gz

Now add the Hadoop and Java paths to the bash file (.bashrc). Open the bash file and java paths as shown below. Command: vi .bashrc
Save the bash file and close it. To apply these commands to the current terminal, execute the source command. Command: source .bashrc
To make sure that Hadoop and Java are installed properly and can be accessed through the terminal, Hadoop version and Java version commands should be used as given below. Command: java –version Command: hadoop version
This is the time to edit the Hadoop configuration files. For this purpose, you should change the directory first then list the all Hadoop configuration files. Command: cd hadoop-2.7.3/etc/hadoop/ Command: Is A Complete list of Hadoop files are located in /etc/Hadoop directory and they can be displayed using “ls” command.
Open the core-site.xml file and edit this file properly. This file informs the Hadoop daemon where the NameNode cluster is running. It also contains Hadoop core configuration settings like I/O settings that are frequently used with MapReduce and HDFS. To open the core-site.xml file, you can use the following command: Command: vi core-site.xml

Here are the steps to configure core-site.xml file:

Read: Top 30 Splunk Interview Questions and Answers


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Open the hdfs-site.xml file and edit this file properly. This file contains the configuration settings of HDFS daemons like NameNode, DataNode, etc. It includes the block size and replication factor of HDFS. To open the hdfs-site.xml file, you can use the following command: Command: vi hdfs-site.xml

Here are the steps to configure hdfs-site.xml file:


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>

Open the mapred-site.xml file and edit this file properly. This file contains the configuration settings of MapReduce application like CPU cores, reducer process, or size of the mapper, etc. If the mapred-site.xml file is not available in advance, then you should create it using a mapred-site.xml template. Here are the commands for creating a mapred-site.xml file and open the file. Command: cp mapred-site.xml.template mapred-site.xml command: vi mapred-site.xml.

Here are the steps to configure mapred-site.xml file:


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Open the yarn-site.xml file and edit this file properly. This file contains the configuration settings of Node Manager, Resource Manager like memory management. Application size, programs, algorithms, etc. To open the yarn-site.xml file, you can use the following command: Command: vi yarn-site.xml

Here are the steps to configure yarn-site.xml file:


<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

In the next step, edit Hadoop-env.sh file add the Java path as given below. This file contains the environment variable that is used by the script to run Hadoop like Java home path etc. To open the Hadoop-env.sh file, you can use the following command: Command: vi hadoop-env.sh
Once you have edit or configured all files successfully, now go to the Home Directory and Format the Name Node. Command: cd Command cd hadoop-2.7.3 command: bin/hadoop namenode -format

These commands format HDFS through NameNode. You should never format the running file system; otherwise you will all the data stored on the system.

Once you have formatted the NameNode, go to the Hadoop bin directory and start all the daemons. For this purpose, first change the directory then start all the daemons together with a single command or you can do it individually. Here are the commands for your reference. Command: cd hadoop-2.7.3/sbin command: ./start-all.sh
With this command, all daemons are started together using a single command. Now let us learn how to initialize each daemon individually. Here is the command to start the Name Node daemon in Hadoop. Command: ./hadoop-daemon.sh start namenode
Here is the command to start the Data Node daemon in Hadoop. Command: ./hadoop-daemon.sh start datanode
Next is the Resource Manager that helps in managing all cluster resources and distributed apps on YARN. Here is the command to start the Resource Manager daemon in Hadoop. Command: ./yarn-daemon.sh start resource manager
Node manager is the agent responsible for managing containers, monitoring resources, and reporting the same to the Resource Manager. Here is the command to start the Node Manager daemon in Hadoop. Command: ./yarn-daemon.sh start nodemanager
Job History Server is responsible for servicing all job history related requests from clients. Here is the command to start the Job History Server daemon in Hadoop. Command: ./mr-jobhistory-daemon.sh start historyserver
To make sure that all Hadoop services are running properly or there is some error, run the following command: jps
In the last step, open the Mozilla browser and go to the localhost:50070/dfshealth.html to check the Name Node interface.

check the Name Node interface Congratulation, you have successfully installed a single node Hadoop cluster in one go. In the next section, let us learn how to install Hadoop on a multi-node cluster.

Hadoop Installation – Setting up a multi-node Hadoop cluster

The multi-node cluster contains two or more data nodes in a distributed Hadoop environment. It is practically used in organizations to analyze and process petabytes of data. With a Hadoop certification, you may learn Hadoop environment set up and installation practically. Here, we need two machines, Master and Slave. Data node is running on both machines. Let us start with multi-node cluster set up in Hadoop.

Read: ELK vs. Splunk vs. Sumo Logic – Demystifying the Data Management Tools

What are the Prerequisites?

Operating System: Hadoop can be installed on the Linux based operating system, Macintosh, Windows operating systems.
Java: You should install Java packages on your system.
Hadoop: Obviously, you need Hadoop package to start the installation.
SSH: This is a network protocol for operating network services over an unsecured network.

Here are steps for Hadoop installation over a multi-node cluster:

Here we are using two machines, Master and Slave. You may check their IP address with the ifconfig command. We are using the IP address for Master machine - 192.168.56.102 and the IP address for slave machine - 192.168.56.103.
In the second step, you should disable the firewall restrictions. Here are the commands for successful firewall configuration: Command: service iptables stop command: sudo chkconfig iptables off
Now open the host file and add master or data node with their respective IP addresses. Here are the commands for the same. Command: sudo nano /etc/hosts

With this command, the same properties are displayed in both, the master and the slave file.

This is the time to restart sshd service with the following command. Command: service sshd restart
Create one SSH key in the master node. You should press Enter if it asks for the file name to save the key. Command: ssh-keygen –t rsa –p “”
Copy the generated SSH key to the master node authorized key. You can use the following command to copy the key. Command: cat $HOME/ .ssh/id_rsa.pub>> $HOME/ .ssh/authorized_keys
Copy the generated SSH key to the slave node authorized key. You can use the following command to copy the key. Command: ssh-copy-id –I $HOME/ .ssh/id_rsa.pub Janbask@slave
Download the Java package and save it to the Home directory. Extract the Java tar files. Command: tar –xvf jdk-8u101-linux-i1586.tar.gz
Download the Hadoop package and extract the Hadoop tar files. Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gzCommand: tar –xvf hadoop-2.7.3.tar.gz
Now add the Hadoop and Java paths to the bash file (.bashrc). Open the bash file and java paths as shown below. Command: vi .bashrc
Save the bash file and close it. To apply these commands to the current terminal, execute the source command. Command: source .bashrc
To make sure that Hadoop and Java are installed properly and can be accessed through the terminal, Hadoop version and Java version commands should be used as given below. Command: java –version command: hadoop version
Now create master files and edit both master and slave files as follows: command: sudo gedit masters
Edit the slave file in the master machine as follows: command: sudo gedit /home/janbask/hadoop-2.7.3/etc/hadoop/slaves
You may edit the slave file in the slave machine as follows: command: sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/slaves
Now edit the core-site.xml file on slave and master machines as follows: command: sudo gedit /home/JanBask/hadoop/-2.7.3/etc/hadoop/core-site.xml

Here are the steps to configure core-site.xml file in Hadoop.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

Edit the hdfs-site.xml file on the master machine as follows: Command: sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

Here are the steps to configure hdfs-site.xml file in Hadoop.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.7.3/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>

Edit the hdfs-site.xml file on the slave machine as follows:


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>

Now copy mapred-site templated and edit the mapred-site file on master and slave machine as follows: Command: cp mapred-site.xml.template mapred-site.xml command: sudo gedit /home/Janbask/hadoop-2.7.3/etc/hadoop/mapred-site.xml

Here are the steps to configure the mapred-site file in Hadoop.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit the yarn-site XML file on master and slave machines as follows: command sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/yarn-site.xml

Here are the steps to configure the YARN-site XML file in Hadoop.

Read: What Is The Working Philosophy Behind Hadoop MapReduce?

In the next step, format the NameNode on master machine only. Command: hadoop namenode -format
Start all the daemons together at the master machine as follows. Command: ./sbin/start-all.sh
Check either all the daemons are running successfully or not at the master and slave machines as follows: Command: jps

In the end, go to the master:50070/dfshealth.html on your master machine to check that will give the NameNode interface. Now, you should check the number of live nodes; if it is two or more than two, you have successfully set up a multi-node cluster. In case, it is not two then you surely missed any step in between that we have mentioned in the blog. But don’t panic and go back to verify each of the steps one by one. Focus more on file configurations because they are a little complex. If you find any issue then fix the problem and move ahead.

Here we have focused on two data nodes only to explain the process in simple steps. If you want you can add more nodes as per the requirements. I would recommend starting practicing with two nodes initially, and you can increase the count later.

Final Words:

I hope this blog would help you in Hadoop installation and setting up a single node and multi-node cluster successfully. If you are still facing any problem, then you should take help from mentors. You can join the online Hadoop certification program at JanBask Training and learn all practical aspects of the framework.

We have a huge network of satisfied learners spread across the globe. The JanBask Hadoop or Big Data training will help you to become an expert in HDFS, Flume, HBase, Hive, Pig, Oozie, Yarn, MapReduce, etc. Have a question for us? If yes, please put your query in the comment section, we will get back to you.

Read: MapReduce Interview Questions and Answers

FaceBook

Twitter

JanBask Training Team

The JanBask Training Team includes certified professionals and expert writers dedicated to helping learners navigate their career journeys in QA, Cybersecurity, Salesforce, and more. Each article is carefully researched and reviewed to ensure quality and relevance.

Comments

Hadoop Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Trending Courses

Cyber Security

Introduction to cybersecurity
Cryptography and Secure Communication
Cloud Computing Architectural Framework
Security Architectures and Models

Upcoming Class

17 days 02 Aug 2025

View Details

Introduction and Software Testing
Software Test Life Cycle
Automation Testing and API Testing
Selenium framework development using Testing

Upcoming Class

2 days 18 Jul 2025

View Details

Salesforce

Salesforce Configuration Introduction
Security & Automation Process
Sales & Service Cloud
Apex Programming, SOQL & SOSL

Upcoming Class

7 days 23 Jul 2025

View Details

Business Analyst

BA & Stakeholders Overview
BPMN, Requirement Elicitation
BA Tools & Design Documents
Enterprise Analysis, Agile & Scrum

Upcoming Class

9 days 25 Jul 2025

View Details

MS SQL Server

Introduction & Database Query
Programming, Indexes & System Functions
SSIS Package Development Procedures
SSRS Report Design

Upcoming Class

9 days 25 Jul 2025

View Details

Data Science

Data Science Introduction
Hadoop and Spark Overview
Python & Intro to R Programming
Machine Learning

Upcoming Class

2 days 18 Jul 2025

View Details

DevOps

Intro to DevOps
GIT and Maven
Jenkins & Ansible
Docker and Cloud Computing

Upcoming Class

3 days 19 Jul 2025

View Details

Hadoop

Architecture, HDFS & MapReduce
Unix Shell & Apache Pig Installation
HIVE Installation & User-Defined Functions
SQOOP & Hbase Installation

Upcoming Class

2 days 18 Jul 2025

View Details

Python

Features of Python
Python Editors and IDEs
Data types and Variables
Python File Operation

Upcoming Class

9 days 25 Jul 2025

View Details

Artificial Intelligence

Components of AI
Categories of Machine Learning
Recurrent Neural Networks
Recurrent Neural Networks

Upcoming Class

2 days 18 Jul 2025

View Details

Machine Learning

Introduction to Machine Learning & Python
Machine Learning: Supervised Learning
Machine Learning: Unsupervised Learning

Upcoming Class

9 days 25 Jul 2025

View Details

Tableau

Introduction to Tableau Desktop
Data Transformation Methods
Configuring tableau server
Integration with R & Hadoop

Upcoming Class

2 days 18 Jul 2025

View Details

Browse Categories

How to Compare Hive, Spark, Impala and Presto?

Oct 18, 2024 eye-dark

992.2k

Top 10 Reasons Why Should You Learn Big Data Hadoop?

Mar 28, 2024 eye-dark

241.6k

ELK vs. Splunk vs. Sumo Logic – Demystifying the Data Management Tools

Oct 08, 2018 eye-dark

472.5k

Search Posts

Reset

How to Compare Hive, Spark, Impala and Presto? 992.2k

Top 10 Reasons Why Should You Learn Big Data Hadoop? 241.6k

ELK vs. Splunk vs. Sumo Logic – Demystifying the Data Management Tools 472.5k

Frequently Used Hive Commands in HQL with Examples 320.5k

What is Spark? Apache Spark Tutorials Guide for Beginner 943.9k

Hadoop Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Receive Latest Materials and Offers on Hadoop Course

By submitting my contact details, I agree Privacy Policy ... and I consent to receiving SMS/call/email, including marketing and promotional SMS. Read More

Scroll

How to install Hadoop and Set up a Hadoop cluster?

Hadoop Installation – Setting up a multi-node Hadoop cluster

Here are steps for Hadoop installation over a multi-node cluster:

JanBask Training Team

Comments

Trending Courses

Browse Categories

Related Posts