Grab Deal : Flat 30% off on live classes + 2 free self-paced courses! - SCHEDULE CALL

- Hadoop Blogs -

How to install Hadoop and Set up a Hadoop cluster?

Hadoop is supported by the Linux platform and its different flavors. So, it is necessary to install Linux first to set up the Hadoop environment. In case, you have any other operating system; you could use a Virtual box Software and run Linux inside the Virtual box. Hadoop is generally installed in two modes, single node, and multi-node. Hadoop Installation Modes

  • Single Node Hadoop Installation
  • Multi-Node Hadoop Installation

Single Node Cluster means there is only a single Data Node running and setting up all NameNode, ResourceManager, DataNode, and NodeManager on a single machine. It is generally used for study and testing purpose. For small environments, single node installation can efficiently check the workflow in a sequential manner as compared to large environments where data is distributed across hundreds of machines.

In the case of Multi-node cluster, there are multiple Data Nodes running, and each data node runs on different machines. These types of clusters are suitable for big organizations who process voluminous data every day. They have to deal with petabytes of data daily distributed across hundreds of machines, so they prefer using multi-node clusters for large environments. Prerequisites for Hadoop Installation

  • Virtual box: It is used to install the operating system on Hadoop.
  • Operating System: Hadoop can be installed on the Linux based operating system, Macintosh, Windows operating systems.
  • Java: You should install Java packages on your system.
  • Hadoop: Obviously, you need Hadoop package to start the installation.

Hadoop Installation on Linux CentOS

Command: tar –xvf hadoop-2.7.3.tar.gz        

  • Now add the Hadoop and Java paths to the bash file (.bashrc). Open the bash file and java paths as shown below. Command: vi .bashrc
  • Save the bash file and close it. To apply these commands to the current terminal, execute the source command. Command: source .bashrc
  • To make sure that Hadoop and Java are installed properly and can be accessed through the terminal, Hadoop version and Java version commands should be used as given below. Command: java –version Command: hadoop version
  • This is the time to edit the Hadoop configuration files. For this purpose, you should change the directory first then list the all Hadoop configuration files. Command: cd hadoop-2.7.3/etc/hadoop/ Command: Is A Complete list of Hadoop files are located in /etc/Hadoop directory and they can be displayed using “ls” command.
  • Open the core-site.xml file and edit this file properly. This file informs the Hadoop daemon where the NameNode cluster is running. It also contains Hadoop core configuration settings like I/O settings that are frequently used with MapReduce and HDFS. To open the core-site.xml file, you can use the following command: Command: vi core-site.xml

Here are the steps to configure core-site.xml file:

Read: Top 45 Pig Interview Questions and Answers for 2023 and Beyond

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
  • Open the hdfs-site.xml file and edit this file properly. This file contains the configuration settings of HDFS daemons like NameNode, DataNode, etc. It includes the block size and replication factor of HDFS. To open the hdfs-site.xml file, you can use the following command: Command: vi hdfs-site.xml

Here are the steps to configure hdfs-site.xml file:


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
  • Open the mapred-site.xml file and edit this file properly. This file contains the configuration settings of MapReduce application like CPU cores, reducer process, or size of the mapper, etc. If the mapred-site.xml file is not available in advance, then you should create it using a mapred-site.xml template. Here are the commands for creating a mapred-site.xml file and open the file. Command: cp mapred-site.xml.template mapred-site.xml command: vi mapred-site.xml.

Here are the steps to configure mapred-site.xml file:


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
  • Open the yarn-site.xml file and edit this file properly. This file contains the configuration settings of Node Manager, Resource Manager like memory management. Application size, programs, algorithms, etc. To open the yarn-site.xml file, you can use the following command: Command: vi yarn-site.xml

Here are the steps to configure yarn-site.xml file:


<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
  • In the next step, edit Hadoop-env.sh file add the Java path as given below. This file contains the environment variable that is used by the script to run Hadoop like Java home path etc. To open the Hadoop-env.sh file, you can use the following command: Command: vi hadoop-env.sh
  • Once you have edit or configured all files successfully, now go to the Home Directory and Format the Name Node. Command: cd Command cd hadoop-2.7.3 command: bin/hadoop namenode -format

These commands format HDFS through NameNode. You should never format the running file system; otherwise you will all the data stored on the system.

  • Once you have formatted the NameNode, go to the Hadoop bin directory and start all the daemons. For this purpose, first change the directory then start all the daemons together with a single command or you can do it individually. Here are the commands for your reference. Command: cd hadoop-2.7.3/sbin command: ./start-all.sh
  • With this command, all daemons are started together using a single command. Now let us learn how to initialize each daemon individually. Here is the command to start the Name Node daemon in Hadoop. Command: ./hadoop-daemon.sh start namenode
  • Here is the command to start the Data Node daemon in Hadoop. Command: ./hadoop-daemon.sh start datanode
  • Next is the Resource Manager that helps in managing all cluster resources and distributed apps on YARN. Here is the command to start the Resource Manager daemon in Hadoop. Command: ./yarn-daemon.sh start resource manager
  • Node manager is the agent responsible for managing containers, monitoring resources, and reporting the same to the Resource Manager. Here is the command to start the Node Manager daemon in Hadoop. Command: ./yarn-daemon.sh start nodemanager
  • Job History Server is responsible for servicing all job history related requests from clients. Here is the command to start the Job History Server daemon in Hadoop. Command: ./mr-jobhistory-daemon.sh start historyserver
  • To make sure that all Hadoop services are running properly or there is some error, run the following command: jps
  • In the last step, open the Mozilla browser and go to the localhost:50070/dfshealth.html to check the Name Node interface.

check the Name Node interface Congratulation, you have successfully installed a single node Hadoop cluster in one go. In the next section, let us learn how to install Hadoop on a multi-node cluster.

Hadoop Installation – Setting up a multi-node Hadoop cluster

The multi-node cluster contains two or more data nodes in a distributed Hadoop environment. It is practically used in organizations to analyze and process petabytes of data. With a Hadoop certification, you may learn Hadoop environment set up and installation practically. Here, we need two machines, Master and Slave. Data node is running on both machines. Let us start with multi-node cluster set up in Hadoop.

Read: Your Complete Guide to Apache Hive Installation on Ubuntu Linux

What are the Prerequisites?

  • Operating System: Hadoop can be installed on the Linux based operating system, Macintosh, Windows operating systems.
  • Java: You should install Java packages on your system.
  • Hadoop: Obviously, you need Hadoop package to start the installation.
  • SSH: This is a network protocol for operating network services over an unsecured network.

Here are steps for Hadoop installation over a multi-node cluster:

  • Here we are using two machines, Master and Slave. You may check their IP address with the ifconfig command. We are using the IP address for Master machine - 192.168.56.102 and the IP address for slave machine - 192.168.56.103.
  • In the second step, you should disable the firewall restrictions. Here are the commands for successful firewall configuration: Command: service iptables stop command: sudo chkconfig iptables off
  • Now open the host file and add master or data node with their respective IP addresses. Here are the commands for the same. Command: sudo nano /etc/hosts

With this command, the same properties are displayed in both, the master and the slave file.

  • This is the time to restart sshd service with the following command. Command: service sshd restart
  • Create one SSH key in the master node. You should press Enter if it asks for the file name to save the key. Command: ssh-keygen –t rsa –p “”
  • Copy the generated SSH key to the master node authorized key. You can use the following command to copy the key. Command: cat $HOME/ .ssh/id_rsa.pub>> $HOME/ .ssh/authorized_keys
  • Copy the generated SSH key to the slave node authorized key. You can use the following command to copy the key. Command: ssh-copy-id –I $HOME/ .ssh/id_rsa.pub Janbask@slave
  • Download the Java package and save it to the Home directory. Extract the Java tar files. Command: tar –xvf jdk-8u101-linux-i1586.tar.gz
  • Download the Hadoop package and extract the Hadoop tar files. Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gzCommand: tar –xvf hadoop-2.7.3.tar.gz
  • Now add the Hadoop and Java paths to the bash file (.bashrc). Open the bash file and java paths as shown below. Command: vi .bashrc
  • Save the bash file and close it. To apply these commands to the current terminal, execute the source command. Command: source .bashrc
  • To make sure that Hadoop and Java are installed properly and can be accessed through the terminal, Hadoop version and Java version commands should be used as given below. Command: java –version command: hadoop version
  • Now create master files and edit both master and slave files as follows: command: sudo gedit masters
  • Edit the slave file in the master machine as follows: command: sudo gedit /home/janbask/hadoop-2.7.3/etc/hadoop/slaves
  • You may edit the slave file in the slave machine as follows: command: sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/slaves
  • Now edit the core-site.xml file on slave and master machines as follows: command: sudo gedit /home/JanBask/hadoop/-2.7.3/etc/hadoop/core-site.xml

Here are the steps to configure core-site.xml file in Hadoop.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
  • Edit the hdfs-site.xml file on the master machine as follows: Command: sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

Here are the steps to configure hdfs-site.xml file in Hadoop.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.7.3/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>
  • Edit the hdfs-site.xml file on the slave machine as follows:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>
  • Now copy mapred-site templated and edit the mapred-site file on master and slave machine as follows: Command: cp mapred-site.xml.template mapred-site.xml command: sudo gedit /home/Janbask/hadoop-2.7.3/etc/hadoop/mapred-site.xml

Here are the steps to configure the mapred-site file in Hadoop.


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
  • Edit the yarn-site XML file on master and slave machines as follows: command sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/yarn-site.xml

Here are the steps to configure the YARN-site XML file in Hadoop.

Read: Key Features & Components Of Spark Architecture
  • In the next step, format the NameNode on master machine only. Command: hadoop namenode -format
  • Start all the daemons together at the master machine as follows. Command: ./sbin/start-all.sh
  • Check either all the daemons are running successfully or not at the master and slave machines as follows: Command: jps

In the end, go to the master:50070/dfshealth.html on your master machine to check that will give the NameNode interface. Now, you should check the number of live nodes; if it is two or more than two, you have successfully set up a multi-node cluster. In case, it is not two then you surely missed any step in between that we have mentioned in the blog. But don’t panic and go back to verify each of the steps one by one. Focus more on file configurations because they are a little complex. If you find any issue then fix the problem and move ahead.

Here we have focused on two data nodes only to explain the process in simple steps. If you want you can add more nodes as per the requirements. I would recommend starting practicing with two nodes initially, and you can increase the count later.

Final Words:

I hope this blog would help you in Hadoop installation and setting up a single node and multi-node cluster successfully. If you are still facing any problem, then you should take help from mentors. You can join the online Hadoop certification program at JanBask Training and learn all practical aspects of the framework.

We have a huge network of satisfied learners spread across the globe. The JanBask Hadoop or Big Data training will help you to become an expert in HDFS, Flume, HBase, Hive, Pig, Oozie, Yarn, MapReduce, etc. Have a question for us? If yes, please put your query in the comment section, we will get back to you.

Read: Apache Storm Interview Questions and Answers: Fresher & Experience


fbicons FaceBook twitterTwitter google+Google+ lingedinLinkedIn pinterest Pinterest emailEmail

     Logo

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

Cyber Security Course

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
Cyber Security Course

Upcoming Class

2 days 27 Apr 2024

QA Course

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
QA Course

Upcoming Class

1 day 26 Apr 2024

Salesforce Course

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
Salesforce Course

Upcoming Class

1 day 26 Apr 2024

Business Analyst Course

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
Business Analyst Course

Upcoming Class

22 days 17 May 2024

MS SQL Server Course

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
MS SQL Server Course

Upcoming Class

1 day 26 Apr 2024

Data Science Course

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
Data Science Course

Upcoming Class

1 day 26 Apr 2024

DevOps Course

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
DevOps Course

Upcoming Class

-0 day 25 Apr 2024

Hadoop Course

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
Hadoop Course

Upcoming Class

1 day 26 Apr 2024

Python Course

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
Python Course

Upcoming Class

9 days 04 May 2024

Artificial Intelligence Course

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
Artificial Intelligence Course

Upcoming Class

2 days 27 Apr 2024

Machine Learning Course

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
Machine Learning Course

Upcoming Class

36 days 31 May 2024

 Tableau Course

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
 Tableau Course

Upcoming Class

1 day 26 Apr 2024

Search Posts

Reset

Receive Latest Materials and Offers on Hadoop Course

Interviews