Grab Deal : Upto 30% off on live classes + 2 free self-paced courses - SCHEDULE CALL

Select Course
Blog
Corporate Training

+1 202 599 3842

(4.8/5 ) | 1.5K+ Ratings

- Hadoop Blogs -

What Is The Working Philosophy Behind Hadoop MapReduce?

In this blog we shall discuss the basics of Hadoop MapReduce along with its basic functionality, then we will focus the working methodology of each component of Hadoop MapReduce.

Introduction of Hadoop MapReduce

Apache Hadoop MapReduce is a system for handling extensive data sets in parallel over a Hadoop cluster. Data analysis utilizes two stages i.e. Map and Reduce process. The configuration of job supplies Map and Reduce analysis functions and the Hadoop system gives the planning, distribution, and parallelization facilities.

A job is a top-level unit in a MapReduce. A Job primarily has a Map and a Reduce stage, however, the Reduce stage can be neglected.

Amid the Map stage, the input data is isolated into input splits for analysis by Map Tasks running in parallel over the Hadoop. As a matter of course, the MapReduce system gets input data from the Hadoop Distributed File System (HDFS). The Reduce stage utilizes results from the Map stage as an input to a set of parallel Reduce Tasks. Reduce tasks combine the data into definite outcomes. Despite the fact that the Reduce stage relies upon yield from the Map stage, Map and Reduce processing isn't really consecutive. That is, Reduce tasks can start when any Map tasks finish. It isn't fundamental for all Map tasks to finish before any Reduce tasks to start.

Read: Your Complete Guide to Apache Hive Data Models

MapReduce works on key-value sets. Theoretically, a MapReduce work takes a lot of info key-value pairs and creates a set of output key-value pairs by transferring the data through Map and Reduce features. The Map tasks produce a halfway arrangement of key-value sets that Reduce tasks utilize as input.

MapReduce Job Cycle

At the point when the client submits a MapReduce job to Hadoop:

The local Job Client makes the job ready for submission and transfers it to the Job Tracker.
The Job Tracker plans the job and distributes the Map work among the Task Trackers for parallel working.
Each Task Tracker generates a Map Task. The Job Tracker gets progress data from the Task Trackers.
As Map results end up accessible, the Job Tracker disperses the Reduce work among the Task Trackers for parallel handling.
Each Task Tracker brings forth a Reduce Task to accomplish the work. The Job Tracker gets progress data from the Task Trackers.

Job Client

The main role of Job Client is to prepare the job for execution. Whenever you submit a MapReduce job to Hadoop, the local Job Client will do the following:

Job configuration and validation.
Generation of input splits and check how Hadoop partitions the Map input data.
Copies the job assets or resources (Job JAR document, input splits, configuration) to a shared area, for example, an HDFS directory, where it is available to the Job Tracker and Task Trackers.
Finally submits the job to the Job Tracker.

Job Tracker

The Job Tracker is in charge of planning jobs, partitioning a job into Map and Reduce activities, conveying Map and Reduce tasks among worker nodes, task failure recovery, and tracking the activity or job status. When getting ready to run a Job, the Job Tracker:

Read: An Introduction to Apache Spark and Spark SQL

Fetches information splits from the shared location where the Job Client set the data.
Creates a Map activity for each split.

Assigns each Map activity or tasks to a Task Tracker. The Job Tracker monitors the strength of the Task Trackers and the advancement of the Job. Once Map task is completed and results become accessible, the Job Tracker:

Generate Reduce tasks up to the most extreme enabled by the job configuration.
Assigns each Map result segment to a Reduce task.
Allots each Reduce task to a Task Tracker.

A job is finished when all Map and Reduce tasks are successfully accomplished, or if there is no Reduce step when there is no Map task remaining in the queue.

Task Tracker

A Task Tracker deals with the errands of one worker node and reports status to the Job Tracker. Generally, the Task Tracker keeps running on the related worker node, yet it isn't required to be on the same host. At the point when the Job Tracker relegates a Map or Reduce task to a Task Tracker, the Task Tracker:

Fetches job assets locally.
Generates a child JVM on the worker node to execute the Map or Reduce task.
Reporting the status to the Job Tracker.

Map Task

The Hadoop MapReduce structure makes a Map Task to process each information split. Following are the activities involved in Map Task:

Read: An Introduction and Differences Between YARN and MapReduce

It uses the Input Format functionality to bring the input data locally and generate input key-value pairs.
Linking of the job-supplied Map function and key-value pair.
Performs local sorting and conglomeration of the outcomes.
If the job incorporates a Combiner then it runs the Combiner for further accumulation.
It stores the outcomes locally, in memory and on the local record framework.
Communicates the progression or any advancement and status to the Task Tracker.

At the point when a Map task informs the Task Tracker of culmination, the Task Tracker informs the Job Tracker. The Job Tracker at that point makes the outcomes accessible to Reduce tasks.

Reduce Task

The Reduce stage compiles the outcomes from the Map stage into conclusive outcomes. Generally, the last outcome set is smaller in comparison to the input set, yet this is application dependent. The Reduce is completed by parallel Reduce Tasks. Reduce is generally carried out in three stages i.e. copy, sort, and merge. A Reduce task comprises of the following:

Assigns local job resources
It enters the copy stage to get all the local copies of the assigned Map results from the worker or resource nodes.
When the duplicate stage finishes, executes the sort stage to consolidate the replicated outcomes into a solitary arranged arrangement of (key, esteem list) sets.
When the sort stage finishes, it then executes the Reduce stage and raises the job-supplied Reduce application on each key-value pair.
Saves the final outcomes to the required destination, for example, HDFS.

Conclusion

The present era is all about managing data and exploiting it. The data is increasing at a massive rate and therefore it requires a special tool to be deployed. Hadoop has the capability to manage these Big data. Hadoop MapReduce can be considered as the core of the Hadoop system as it enables Hadoop to process the data in a highly resilient, efficient manner.

Read: Hbase Architecture & Main Server Components

JanBask Training Team

The JanBask Training Team includes certified professionals and expert writers dedicated to helping learners navigate their career journeys in QA, Cybersecurity, Salesforce, and more. Each article is carefully researched and reviewed to ensure quality and relevance.