Grab Deal : Flat 30% off on live classes + 2 free self-paced courses! - SCHEDULE CALL

- Hadoop Blogs -

What Is The Working Philosophy Behind Hadoop MapReduce?

In this blog we shall discuss the basics of Hadoop MapReduce along with its basic functionality, then we will focus the working methodology of each component of Hadoop MapReduce.

Introduction of Hadoop MapReduce

Apache Hadoop MapReduce is a system for handling extensive data sets in parallel over a Hadoop cluster. Data analysis utilizes two stages i.e. Map and Reduce process. The configuration of job supplies Map and Reduce analysis functions and the Hadoop system gives the planning, distribution, and parallelization facilities.

A job is a top-level unit in a MapReduce. A Job primarily has a Map and a Reduce stage, however, the Reduce stage can be neglected.

Amid the Map stage, the input data is isolated into input splits for analysis by Map Tasks running in parallel over the Hadoop. As a matter of course, the MapReduce system gets input data from the Hadoop Distributed File System (HDFS). The Reduce stage utilizes results from the Map stage as an input to a set of parallel Reduce Tasks. Reduce tasks combine the data into definite outcomes. Despite the fact that the Reduce stage relies upon yield from the Map stage, Map and Reduce processing isn't really consecutive. That is, Reduce tasks can start when any Map tasks finish. It isn't fundamental for all Map tasks to finish before any Reduce tasks to start.

Read: Hive Interview Question And Answers

MapReduce works on key-value sets. Theoretically, a MapReduce work takes a lot of info key-value pairs and creates a set of output key-value pairs by transferring the data through Map and Reduce features. The Map tasks produce a halfway arrangement of key-value sets that Reduce tasks utilize as input.

MapReduce Job Cycle

At the point when the client submits a MapReduce job to Hadoop:

  • The local Job Client makes the job ready for submission and transfers it to the Job Tracker.
  • The Job Tracker plans the job and distributes the Map work among the Task Trackers for parallel working.
  • Each Task Tracker generates a Map Task. The Job Tracker gets progress data from the Task Trackers.
  • As Map results end up accessible, the Job Tracker disperses the Reduce work among the Task Trackers for parallel handling.
  • Each Task Tracker brings forth a Reduce Task to accomplish the work. The Job Tracker gets progress data from the Task Trackers.

Job Client

The main role of Job Client is to prepare the job for execution. Whenever you submit a MapReduce job to Hadoop, the local Job Client will do the following:

  • Job configuration and validation.
  • Generation of input splits and check how Hadoop partitions the Map input data.
  • Copies the job assets or resources (Job JAR document, input splits, configuration) to a shared area, for example, an HDFS directory, where it is available to the Job Tracker and Task Trackers.
  • Finally submits the job to the Job Tracker.

Job Tracker

The Job Tracker is in charge of planning jobs, partitioning a job into Map and Reduce activities, conveying Map and Reduce tasks among worker nodes, task failure recovery, and tracking the activity or job status. When getting ready to run a Job, the Job Tracker:

Read: Your Complete Guide to Apache Hive Data Models
  • Fetches information splits from the shared location where the Job Client set the data.
  • Creates a Map activity for each split.

Assigns each Map activity or tasks to a Task Tracker. The Job Tracker monitors the strength of the Task Trackers and the advancement of the Job. Once Map task is completed and results become accessible, the Job Tracker:

  • Generate Reduce tasks up to the most extreme enabled by the job configuration.
  • Assigns each Map result segment to a Reduce task.
  • Allots each Reduce task to a Task Tracker.

A job is finished when all Map and Reduce tasks are successfully accomplished, or if there is no Reduce step when there is no Map task remaining in the queue.

Task Tracker

A Task Tracker deals with the errands of one worker node and reports status to the Job Tracker. Generally, the Task Tracker keeps running on the related worker node, yet it isn't required to be on the same host. At the point when the Job Tracker relegates a Map or Reduce task to a Task Tracker, the Task Tracker:

  • Fetches job assets locally.
  • Generates a child JVM on the worker node to execute the Map or Reduce task.
  • Reporting the status to the Job Tracker.

Map Task

The Hadoop MapReduce structure makes a Map Task to process each information split. Following are the activities involved in Map Task:

Read: What Is Apache Oozie? Oozie Configure & Install Tutorial Guide for Beginners
  • It uses the Input Format functionality to bring the input data locally and generate input key-value pairs.
  • Linking of the job-supplied Map function and key-value pair.
  • Performs local sorting and conglomeration of the outcomes.
  • If the job incorporates a Combiner then it runs the Combiner for further accumulation.
  • It stores the outcomes locally, in memory and on the local record framework.
  • Communicates the progression or any advancement and status to the Task Tracker.

At the point when a Map task informs the Task Tracker of culmination, the Task Tracker informs the Job Tracker. The Job Tracker at that point makes the outcomes accessible to Reduce tasks.

Reduce Task

The Reduce stage compiles the outcomes from the Map stage into conclusive outcomes. Generally, the last outcome set is smaller in comparison to the input set, yet this is application dependent. The Reduce is completed by parallel Reduce Tasks. Reduce is generally carried out in three stages i.e. copy, sort, and merge. A Reduce task comprises of the following:

  • Assigns local job resources
  • It enters the copy stage to get all the local copies of the assigned Map results from the worker or resource nodes.
  • When the duplicate stage finishes, executes the sort stage to consolidate the replicated outcomes into a solitary arranged arrangement of (key, esteem list) sets.
  • When the sort stage finishes, it then executes the Reduce stage and raises the job-supplied Reduce application on each key-value pair.
  • Saves the final outcomes to the required destination, for example, HDFS.

Conclusion

The present era is all about managing data and exploiting it. The data is increasing at a massive rate and therefore it requires a special tool to be deployed. Hadoop has the capability to manage these Big data. Hadoop MapReduce can be considered as the core of the Hadoop system as it enables Hadoop to process the data in a highly resilient, efficient manner.

Read: Frequently Used Hive Commands in HQL with Examples

     user

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

salesforce

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
salesforce

Upcoming Class

8 days 02 Aug 2024

salesforce

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
salesforce

Upcoming Class

12 days 06 Aug 2024

salesforce

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
salesforce

Upcoming Class

-0 day 25 Jul 2024

salesforce

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
salesforce

Upcoming Class

15 days 09 Aug 2024

salesforce

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
salesforce

Upcoming Class

1 day 26 Jul 2024

salesforce

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
salesforce

Upcoming Class

8 days 02 Aug 2024

salesforce

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
salesforce

Upcoming Class

2 days 27 Jul 2024

salesforce

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
salesforce

Upcoming Class

8 days 02 Aug 2024

salesforce

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
salesforce

Upcoming Class

2 days 27 Jul 2024

salesforce

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
salesforce

Upcoming Class

16 days 10 Aug 2024

salesforce

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
salesforce

Upcoming Class

29 days 23 Aug 2024

salesforce

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
salesforce

Upcoming Class

8 days 02 Aug 2024

Interviews