Grab Deal : Flat 30% off on live classes + 2 free self-paced courses! - SCHEDULE CALL

- Hadoop Blogs -

An Introduction and Differences Between YARN and MapReduce

Difference Between YARN and MapReduce

Hadoop developers are very much familiar with these two terms, one is YARN and other is MapReduce. Though some newbies may feel them alike there is a huge difference between YARN and MapReduce concepts. Where one is an architecture which is used to distribute clusters, so on another hand Map Reduce is a programming model.

This article is written to give you a detailed explanation of both the concepts and a short comparison between the two. YARN is also known as dummy resource scheduler and MapReduce involve a process to decide that what should be done with any resource?

An Introduction to YARN

YARN is included in Hadoop 2.0, it is basically used to separate processing components and resource management process. YARN is given to provide an advantageous platform or an option for distributed processing layer, used in earlier versions of Hadoop. YARN is known as:

  • Not a cluster manager buta Resource Manager,
  • Instead of short-lived and dedicated job tracker, it is known as ApplicationMaster,
  • Not a Task Tracker instead a Node Manager,
  • Not a MapReduce job but a distributed application.

YARN has the following architecture as shown below:

In the above-shown YARN architecture, there is a global resource manager which runs as a master daemon, it tracks the total live nodes and resources on the cluster and manages the allocation task of these resources. It works in a multi-tenant, secured, and shared manner. YARN vs MapRecude If we talk about the complete process of its execution then on submission of an application, the lightweight process ApplicationMaster coordinates execution of the applications. The task of this Application Manager is to monitor, restarting, running, and slowing the tasks. All tasks related to its applications are controlled by the Node Manager.

Node Manager is an efficient version of Task Tracker, even it has dynamically created resource containers. Size of the container may vary from one application to another and it depends on the certain factors like size of memory, CPU, and network I/O. Nowadays MRv1 runs on the top of YARN.

Read: Scala VS Python: Which One to Choose for Big Data Projects

Application Running Process in YARN

YARN vs MapRecude As per above diagram, the execution or running order of an Application is as follow:

  • A Resource Manager is asked to run an Application Master by the Client
  • Resource Manager when receives the request, then it searches for Node Manager to launch ApplicationMaster in the container. Once the request is completed, the result is returned.
  • As per the requirement, more containers can be requested from Resource Manager
  • A MapReduce and distributed computation aremade to run in the end.

Life Span of a YARN application

  • The lifespan of a YARN application can range from a few seconds to a few months
  • It can be like one application per job (MapReduce)
  • It can be One application per workflow for this:
    • Containers can be reused
    • Intermediate data is cached between jobs
    • Tez and Spark are the examples
  • Long Running applications which can be shared among many people
    • It may act as a Coordinator
    • A long-running master to launch other applications
    • Apache Impala runs proxy applications and can reduce the overhead of Application Master

Introduction to MapReduce

MapReduce framework is used to write applications which can process a large amount of structured and unstructured data. The data processed by these applications are stored in HDFS. MapReduce is basically used for batch processing which may include petabyte and terabyte of Apache Hadoop data. MapReduce offers following listed benefits:

Listicle Benefits Description
Simple to Use Since in MapReduce the developers can write the application in any language like Java, C, C++ or Python, it is easy for developers to run Map-Reduce jobs.  
Scalable Applications MapReduce can process petabytes of data, which is stored on HDFS cluster.  
Fast MapReduce can solve the problems which may take a number of days in solving and even they can be solved by MapReduce in several hours or minutes.  
Easy to Recover If in case of any failure copy of data is unavailable then in MapReduce the data can be taken from another machine, which will have a similar copy with same key/value pair and it can then be used to solve sub-task. JobTracker is used to keep track of these problems.  
Minimal data movement In MapReduce, the complete process of computation is moved to HDFS and the task of processing can occur on physical nodes itself where the data resides. In this way, network I/O patterns are also reduced and Hadoop processing speed is increased significantly.

  MapReduce is the core building block of Hadoop framework, it allows parallel and distributed processing of data in huge amount. It consists of the following tasks and components:

  • MapReduce has two tasks, one is to Map and other is to Reduce.
  • In MapReduce, the reduce phase is executed after completion of mapper phase.
  • In Map process, data blocks are read out then processed carefully through which key-value pairs are produced as intermediate output.
  • The output of Mapper phase becomes the input of Reducer.
  • Reducer can receive the inputs from more than one
  • Reducer then aggregated the intermediate data tuples and generates key-value pairs as the final output.

Advantages of MapReduce

MapReduce has the following advantages that you should know –

1). Parallel Processing In MapReduce, the full job is divided into multiple nodes and they are processed in a parallel manner simultaneously. So, it works basically in divide and conquers manner and the data is processed among multiple machines in a parallel manner. As the processing is done in a parallel manner, so the processing time is reduced drastically.

Read: An Introduction to Apache Spark and Spark SQL

2). Locality of Data Instead of moving data for processing, in MapReduce, the complete process is moved to each node. As now the data is available in a huge amount so it may become difficult to move it from one place to another and therefore this technique is considered as beneficial and the best one.

It offers the following advantages:

It is quite cost-effective to move processing unit from one node to another rather than moving data.

  • Processing time is reduced drastically as more than one node takes part in processing.
  • No node gets overburdened as many nodes take part in processing data.

Difference Between YARN and MapReduce

After discussing YARN and MapReduce, let’s see what are the differences between YARN and the MapReduce?

YARN has following components to process a task:

  1. Job Tracker
  2. Task Tracker
  3. Slot

MapReduce has following components to process a task:

  1. Resource Manager
  2. Timeline
  3. Application Master
  4. Node Manager
  5. Container

As listed, above are the different components used to process any task or job in YARN and MapReduce.Though they are completely separate concepts, the user can easily see and check the advantages of both the concepts which are used in data processing.

Read: Difference Between Apache Hadoop and Spark Framework

Scalability. availability, utilization, and multitenancy are a few other factors to compare the performance of these systems. Where YARN is just a Resource manager so MapReduce is the process to distribute the data processing task and to manage the complete task. A set of resources is used in MapReduce for the complete task. Resource allocation is a subpart of MapReduce jobs.

Final Words:

Today, Hadoop is a huge platform and is used by many organizations to process the big or huge amount of data. MapReduce and YARN are just two concepts which are part of huge data processing.

Hadoop developers get many advantages of this platform and the complete architecture become quite simple and easier due to its processing way and the ability to process the huge amount of data.

Hadoop data processing involve many steps to process data YARN and MapReduce processes make the complete processing faster and efficient. As the use of parallel and distributed processing makes the task easier.

Read: What Is Apache Oozie? Oozie Configure & Install Tutorial Guide for Beginners


fbicons FaceBook twitterTwitter google+Google+ lingedinLinkedIn pinterest Pinterest emailEmail

     Logo

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

Cyber Security Course

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
Cyber Security Course

Upcoming Class

1 day 27 Apr 2024

QA Course

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
QA Course

Upcoming Class

0 day 26 Apr 2024

Salesforce Course

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
Salesforce Course

Upcoming Class

0 day 26 Apr 2024

Business Analyst Course

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
Business Analyst Course

Upcoming Class

21 days 17 May 2024

MS SQL Server Course

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
MS SQL Server Course

Upcoming Class

0 day 26 Apr 2024

Data Science Course

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
Data Science Course

Upcoming Class

0 day 26 Apr 2024

DevOps Course

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
DevOps Course

Upcoming Class

-1 day 25 Apr 2024

Hadoop Course

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
Hadoop Course

Upcoming Class

0 day 26 Apr 2024

Python Course

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
Python Course

Upcoming Class

8 days 04 May 2024

Artificial Intelligence Course

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
Artificial Intelligence Course

Upcoming Class

1 day 27 Apr 2024

Machine Learning Course

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
Machine Learning Course

Upcoming Class

35 days 31 May 2024

 Tableau Course

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
 Tableau Course

Upcoming Class

0 day 26 Apr 2024

Search Posts

Reset

Receive Latest Materials and Offers on Hadoop Course

Interviews