- Hadoop Blogs -

How to install Hadoop and Set up a Hadoop cluster?

Hadoop is supported by the Linux platform and its different flavors. So, it is necessary to install Linux first to set up the Hadoop environment. In case, you have any other operating system; you could use a Virtual box Software and run Linux inside the Virtual box. Hadoop is generally installed in two modes, single node, and multi-node. Hadoop Installation Modes

  • Single Node Hadoop Installation
  • Multi-Node Hadoop Installation

Single Node Cluster means there is only a single Data Node running and setting up all NameNode, ResourceManager, DataNode, and NodeManager on a single machine. It is generally used for study and testing purpose. For small environments, single node installation can efficiently check the workflow in a sequential manner as compared to large environments where data is distributed across hundreds of machines.

In the case of Multi-node cluster, there are multiple Data Nodes running, and each data node runs on different machines. These types of clusters are suitable for big organizations who process voluminous data every day. They have to deal with petabytes of data daily distributed across hundreds of machines, so they prefer using multi-node clusters for large environments. Prerequisites for Hadoop Installation

  • Virtual box: It is used to install the operating system on Hadoop.
  • Operating System: Hadoop can be installed on the Linux based operating system, Macintosh, Windows operating systems.
  • Java: You should install Java packages on your system.
  • Hadoop: Obviously, you need Hadoop package to start the installation.

Hadoop Installation on Linux CentOS

Command: tar –xvf hadoop-2.7.3.tar.gz        

  • Now add the Hadoop and Java paths to the bash file (.bashrc). Open the bash file and java paths as shown below. Command: vi .bashrc
  • Save the bash file and close it. To apply these commands to the current terminal, execute the source command. Command: source .bashrc
  • To make sure that Hadoop and Java are installed properly and can be accessed through the terminal, Hadoop version and Java version commands should be used as given below. Command: java –version Command: hadoop version
  • This is the time to edit the Hadoop configuration files. For this purpose, you should change the directory first then list the all Hadoop configuration files. Command: cd hadoop-2.7.3/etc/hadoop/ Command: Is A Complete list of Hadoop files are located in /etc/Hadoop directory and they can be displayed using “ls” command.
  • Open the core-site.xml file and edit this file properly. This file informs the Hadoop daemon where the NameNode cluster is running. It also contains Hadoop core configuration settings like I/O settings that are frequently used with MapReduce and HDFS. To open the core-site.xml file, you can use the following command: Command: vi core-site.xml

Here are the steps to configure core-site.xml file:

Read: How to Compare Hive, Spark, Impala and Presto?
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
  • Open the hdfs-site.xml file and edit this file properly. This file contains the configuration settings of HDFS daemons like NameNode, DataNode, etc. It includes the block size and replication factor of HDFS. To open the hdfs-site.xml file, you can use the following command: Command: vi hdfs-site.xml

Here are the steps to configure hdfs-site.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
  • Open the mapred-site.xml file and edit this file properly. This file contains the configuration settings of MapReduce application like CPU cores, reducer process, or size of the mapper, etc. If the mapred-site.xml file is not available in advance, then you should create it using a mapred-site.xml template. Here are the commands for creating a mapred-site.xml file and open the file. Command: cp mapred-site.xml.template mapred-site.xml command: vi mapred-site.xml.

Here are the steps to configure mapred-site.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
  • Open the yarn-site.xml file and edit this file properly. This file contains the configuration settings of Node Manager, Resource Manager like memory management. Application size, programs, algorithms, etc. To open the yarn-site.xml file, you can use the following command: Command: vi yarn-site.xml

Here are the steps to configure yarn-site.xml file:

<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
  • In the next step, edit Hadoop-env.sh file add the Java path as given below. This file contains the environment variable that is used by the script to run Hadoop like Java home path etc. To open the Hadoop-env.sh file, you can use the following command: Command: vi hadoop-env.sh
  • Once you have edit or configured all files successfully, now go to the Home Directory and Format the Name Node. Command: cd Command cd hadoop-2.7.3 command: bin/hadoop namenode -format

These commands format HDFS through NameNode. You should never format the running file system; otherwise you will all the data stored on the system.

  • Once you have formatted the NameNode, go to the Hadoop bin directory and start all the daemons. For this purpose, first change the directory then start all the daemons together with a single command or you can do it individually. Here are the commands for your reference. Command: cd hadoop-2.7.3/sbin command: ./start-all.sh
  • With this command, all daemons are started together using a single command. Now let us learn how to initialize each daemon individually. Here is the command to start the Name Node daemon in Hadoop. Command: ./hadoop-daemon.sh start namenode
  • Here is the command to start the Data Node daemon in Hadoop. Command: ./hadoop-daemon.sh start datanode
  • Next is the Resource Manager that helps in managing all cluster resources and distributed apps on YARN. Here is the command to start the Resource Manager daemon in Hadoop. Command: ./yarn-daemon.sh start resource manager
  • Node manager is the agent responsible for managing containers, monitoring resources, and reporting the same to the Resource Manager. Here is the command to start the Node Manager daemon in Hadoop. Command: ./yarn-daemon.sh start nodemanager
  • Job History Server is responsible for servicing all job history related requests from clients. Here is the command to start the Job History Server daemon in Hadoop. Command: ./mr-jobhistory-daemon.sh start historyserver
  • To make sure that all Hadoop services are running properly or there is some error, run the following command: jps
  • In the last step, open the Mozilla browser and go to the localhost:50070/dfshealth.html to check the Name Node interface.

check the Name Node interface Congratulation, you have successfully installed a single node Hadoop cluster in one go. In the next section, let us learn how to install Hadoop on a multi-node cluster.

Hadoop Installation – Setting up a multi-node Hadoop cluster

The multi-node cluster contains two or more data nodes in a distributed Hadoop environment. It is practically used in organizations to analyze and process petabytes of data. With a Hadoop certification, you may learn Hadoop environment set up and installation practically. Here, we need two machines, Master and Slave. Data node is running on both machines. Let us start with multi-node cluster set up in Hadoop.

Read: CCA Spark & Hadoop Developer Certification Exam Practice Tests

What are the Prerequisites?

  • Operating System: Hadoop can be installed on the Linux based operating system, Macintosh, Windows operating systems.
  • Java: You should install Java packages on your system.
  • Hadoop: Obviously, you need Hadoop package to start the installation.
  • SSH: This is a network protocol for operating network services over an unsecured network.

Here are steps for Hadoop installation over a multi-node cluster:

  • Here we are using two machines, Master and Slave. You may check their IP address with the ifconfig command. We are using the IP address for Master machine - 192.168.56.102 and the IP address for slave machine - 192.168.56.103.
  • In the second step, you should disable the firewall restrictions. Here are the commands for successful firewall configuration: Command: service iptables stop command: sudo chkconfig iptables off
  • Now open the host file and add master or data node with their respective IP addresses. Here are the commands for the same. Command: sudo nano /etc/hosts

With this command, the same properties are displayed in both, the master and the slave file.

  • This is the time to restart sshd service with the following command. Command: service sshd restart
  • Create one SSH key in the master node. You should press Enter if it asks for the file name to save the key. Command: ssh-keygen –t rsa –p “”
  • Copy the generated SSH key to the master node authorized key. You can use the following command to copy the key. Command: cat $HOME/ .ssh/id_rsa.pub>> $HOME/ .ssh/authorized_keys
  • Copy the generated SSH key to the slave node authorized key. You can use the following command to copy the key. Command: ssh-copy-id –I $HOME/ .ssh/id_rsa.pub [email protected]
  • Download the Java package and save it to the Home directory. Extract the Java tar files. Command: tar –xvf jdk-8u101-linux-i1586.tar.gz
  • Download the Hadoop package and extract the Hadoop tar files. Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gzCommand: tar –xvf hadoop-2.7.3.tar.gz
  • Now add the Hadoop and Java paths to the bash file (.bashrc). Open the bash file and java paths as shown below. Command: vi .bashrc
  • Save the bash file and close it. To apply these commands to the current terminal, execute the source command. Command: source .bashrc
  • To make sure that Hadoop and Java are installed properly and can be accessed through the terminal, Hadoop version and Java version commands should be used as given below. Command: java –version command: hadoop version
  • Now create master files and edit both master and slave files as follows: command: sudo gedit masters
  • Edit the slave file in the master machine as follows: command: sudo gedit /home/janbask/hadoop-2.7.3/etc/hadoop/slaves
  • You may edit the slave file in the slave machine as follows: command: sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/slaves
  • Now edit the core-site.xml file on slave and master machines as follows: command: sudo gedit /home/JanBask/hadoop/-2.7.3/etc/hadoop/core-site.xml

Here are the steps to configure core-site.xml file in Hadoop.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
  • Edit the hdfs-site.xml file on the master machine as follows: Command: sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

Here are the steps to configure hdfs-site.xml file in Hadoop.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/edureka/hadoop-2.7.3/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>
  • Edit the hdfs-site.xml file on the slave machine as follows:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/edureka/hadoop-2.7.3/datanode</value>
</property>
</configuration>
  • Now copy mapred-site templated and edit the mapred-site file on master and slave machine as follows: Command: cp mapred-site.xml.template mapred-site.xml command: sudo gedit /home/Janbask/hadoop-2.7.3/etc/hadoop/mapred-site.xml

Here are the steps to configure the mapred-site file in Hadoop.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
  • Edit the yarn-site XML file on master and slave machines as follows: command sudo gedit /home/JanBask/hadoop-2.7.3/etc/hadoop/yarn-site.xml

Here are the steps to configure the YARN-site XML file in Hadoop.

Read: MapReduce Interview Questions and Answers
  • In the next step, format the NameNode on master machine only. Command: hadoop namenode -format
  • Start all the daemons together at the master machine as follows. Command: ./sbin/start-all.sh
  • Check either all the daemons are running successfully or not at the master and slave machines as follows: Command: jps

In the end, go to the master:50070/dfshealth.html on your master machine to check that will give the NameNode interface. Now, you should check the number of live nodes; if it is two or more than two, you have successfully set up a multi-node cluster. In case, it is not two then you surely missed any step in between that we have mentioned in the blog. But don’t panic and go back to verify each of the steps one by one. Focus more on file configurations because they are a little complex. If you find any issue then fix the problem and move ahead.

Here we have focused on two data nodes only to explain the process in simple steps. If you want you can add more nodes as per the requirements. I would recommend starting practicing with two nodes initially, and you can increase the count later.

Final Words:

I hope this blog would help you in Hadoop installation and setting up a single node and multi-node cluster successfully. If you are still facing any problem, then you should take help from mentors. You can join the online Hadoop certification program at JanBask Training and learn all practical aspects of the framework.

We have a huge network of satisfied learners spread across the globe. The JanBask Hadoop or Big Data training will help you to become an expert in HDFS, Flume, HBase, Hive, Pig, Oozie, Yarn, MapReduce, etc. Have a question for us? If yes, please put your query in the comment section, we will get back to you.

Read: What is Flume? Apache Flume Tutorial Guide For Beginners

    Janbask Training

    JanBask Training is a leading Global Online Training Provider through Live Sessions. The Live classes provide a blended approach of hands on experience along with theoretical knowledge which is driven by certified professionals.


Comments

Search Posts

Reset

Receive Latest Materials and Offers on Hadoop Course