International Womens Day : Flat 30% off on live classes + 2 free self-paced courses - SCHEDULE CALL

Select Course
Blog
Corporate Training

+1 202 599 3842

(4.8/5 ) | 1.5K+ Ratings

- Hadoop Blogs -

Key Features & Components Of Spark Architecture

Spark has emerged as the most powerful, highly in-demand, and the primary big data framework across the major industries of the world. It has become a robust complement to Hadoop, which was the original choice of technology for big data. This has been possible due to the ability of Spark to handle big data challenges in addition to its accessibility and power. It now has a sound user base of more than 225,000 members and the contribution to code done by over 500 people from 200 different companies. It has become the preferred framework by some mainstream players like Alibaba, Amazon, eBay, Yahoo, Tencent, Baidu, etc.

Rajiv Bhat, the senior vice president of data science and marketplace at InMobi said, “Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model. Now, we can do about four models a day”. Thus, Spark is an open-source framework for real-time processing, which has become the most running projects by the Apache Software Foundation. It is at present the undisputed market leader in the realm of Big Data processing. It thus has a huge career scope and is worth the deal to become a Spark certified professional.

Need for Spark

There is a huge amount of data which is produced in the world of the internet every day, which has to process within seconds. This is made possible by the Spark Real-Time Processing Framework. Latter has become an essential part of our lives as it spans industries like banking for detection of frauds, governmental surveillance systems, automation in healthcare, stock market predictions, etc. Looking at these domains which employ real-time data analytics in a major way:

Government: The biggest use in governmental machinery of real-time analytics lies in the realm of national security as almost all nations have to keep a live track of the updates from both the military and police for any threat to national security.
Healthcare: Real-time analytics are very useful in checking the medical history of some critical patients, which helps them keep track of blood and organ transplants amongst themselves. This is highly useful in case of extreme medical emergencies where a delay in seconds can cost a life.
Banking: The world transacts its money through banking, and thus, it becomes important to ensure the transactions are free from fraud.
Stock Exchanges: The stockbrokers use Real-time analytics for prediction of stock movements. Many businesses change their business models by making use of real-time analytics for checking the market demand of their brand.

Key Differences between Spark and Hadoop

The most common question which strikes the businesses is what was the need for Spark when Hadoop was there. This can be answered by delineating the concept of batch and real-time processing. While the former rests on the concept of processing blocks of data which have been stored for a certain period and later work on real-time processing model. The MapReduce framework of Hadoop in 2005 was a pathbreaking technology in big data but only until 2014 when Spark was introduced. The main selling proposition of Spark was speed in real-time as it was 100 times faster than the MapReduce framework of Hadoop. Thus, it can be said that Hadoop works on the principle of batch-processing of data which has been stored over some time. Spark, on the other hand, is instrumental in real-time processing and solve critical use cases. Additionally, even in terms of batch processing, it is found to be 100 times faster.

Read: Pig Vs Hive: Difference Two Key Components of Hadoop Big Data

Spark Features

Apache Spark is an open-source cluster framework of computing used for real-time data processing. It has a bubbling open-source community and is the most ambitious project by Apache Foundation. Spark gives an interface for programming the entire clusters which have in-built parallelism and fault-tolerance. It has been basically built on MapReduce Framework of Hadoop and extends the same to many more computation types. Some of the salient features of Spark are:

Speed: As mentioned above, Spark is 100 times faster than the batch-processing undertaken by the Hadoop MapReduce. This has been made possible by controlled partitioning, which helps it manage data by mode of partitions, which help to distribute data processing parallel to minimal network traffic.
Machine Learning: The machine learning component of Spark, MLib is quite hardy in terms of data processing as it eliminates the need to use multiple tools, i.e. one for processing and one for machine learning. It thus gives the data engineers and other data scientists a tough and unified engine which is fast and easy to use.
Polyglot: It has provision for high-level APIs in Java, Scala, R and Python, which means that it can code in any of the four. Also, it allows for a shell in Scala and Python. The former is accessed from the installed directory by ./bin/spark-shell while the latter by ./bin/pyspark.
Real-Time Computation: Spark has real-time computation with low latency due to its in-memory computation. It is designed for huge scalability. The documented users of the Spark team have production clusters running on systems with thousands of nodes. It supports many computational methods.
Slow Evaluation: It is seen that Spark will put off the evaluation until it becomes extremely important. This is one of the main factors which contribute to its speed. Spark handles transformations by adding them to a DAG or Directed Acrylic Graph of computation, and it is only after the request of data by the driver, DAG will actually get executed.
Integration with Hadoop: Spark shares fine compatibility with Hadoop, which acts as a gift for all the Big Data engineers who began their careers in Hadoop. Although Spark is a stated replacement for MapReduce functions of Hadoop, it also has the ability for running on top of Hadoop cluster by making use of YARN for scheduling of resources.

Spark Architecture: Abstractions and Daemons

Spark enjoys a well-marked and layered architecture with all the components and layers widely clubbed and integrated with other extensions and libraries. The architecture rests on two primary abstractions:

Resilient Distributed Datasets (RDD): These are a collection of data items which are divided into partitions and can be stored in the spark cluster in memory on workers nodes. Talking in terms of datasets there are two types of RDDs supported by Spark, namely the Hadoop Datasets which are created from the files which are stored on HDFS and parallelized collections which are in turn based on the existing Scala collections. Furthermore, two types of operations are supported by the RDDS viz. Transformations and Actions.
Directed Acyclic Graph (DAG): When every node is an RDD and edge is transformation over data, then the sequence of computations which are performed on such data is DAG. The Hadoop MapReduce multistage model of execution stands completely eliminated by DAG, which also provides for enhancements of performance in comparison to Hadoop. Direct in DAG stands for the fact that transformation is an action which changes the data partition state from A to B, while Acyclic means that the transformations cannot come back to the older partition.

The Spark Architecture has two main daemons along with a cluster manager. It is basically a master/slave architecture. The two daemons are:

Master Daemon: It handles the Master/Driver Process
Worker Daemon: It handles the Slave Process

One Spark cluster has only a single Master and many Slaves/ Workers. Individual Java processes are run by the driver and the executors while users can run them on different machines like a vertical cluster or mixed machine configuration or even on the same horizontal spark cluster.

Read: Hadoop Command Cheat Sheet - What Is Important?

Spark Architecture: Roles of Driver, Executor and Cluster Manager

1). Driver: The driver program is the central point which runs the primary () function of the application. It is also the entry point of Spark Shell (Scala, Python, and R). Spark Context is built here. Components of the Driver are:

DAGScheduler
TaskScheduler
BackendScheduler
BlockManager

All these are responsible for translating the spark user code into actual spark jobs which are implemented on the cluster. Roles of the Driver are:

The driver program working on the master node of the Spark cluster helps in scheduling the execution of the job and also holding negotiations with the cluster manager.
It helps in translation of the RDD into the execution graph, which it then divides into more than one stage.
Metadata about the RDDs and their partitions are also stored by the driver.
The driver helps in converting a user application into smaller tasks which are in turn executed by the executors. Latter are worker processes which work on individual tasks.
It also helps to reveal the information about running spark application via a Web UI at port 4040.

2). Executor: An executor is mainly responsible for executing the tasks. It is a distributed agent. Every spark application comes endowed with its own process executor which run for the complete lifetime of the Spark application. This is known as “Static Allocation of Executors.” It is also possible for the users to choose the dynamic allocation of executors in which the executor are added and removed to match the overall workload. Roles of the Executor are:

Read: Scala VS Python: Which One to Choose for Big Data Projects

Data processing
It helps in reading from and writing data to the external sources
It helps in storing the computation results data in-memory, HDD, or cache.
It helps in interacting with the storage systems

3). Cluster Manager: The choice of a cluster manager for every application Spark is completely dependent on the goals as cluster managers give a different set of scheduling capabilities. The standalone cluster manager is easier to use while developing a new spark application. There are three types of cluster managers which can be leveraged by a Spark application for allocating and deallocating various physical resources like client spark jobs, CPU memory, etc.

Run-Time Architecture of Spark Application

As the client submits a spark application code, it is implicitly converted by the driver into a logical DAG. Various optimizations like pipelining of transformations and then converting the logical DAG into a physical execution plan with a set of stages. It is after the creation of a physical execution plan with various small physical execution units is created known as tasks which are clubbed together and sent to the Spark Cluster.

The Driver then interacts with the cluster manager and holds negotiations for the resources. It is the cluster manager, which helps in the launching of the executors on the slave nodes. The tasks are sent to the cluster manager by a driver based on the data placement. Executors register with the driver program before execution. The driver will monitor these as the application runs. The future tasks are also scheduled by the driver again based on data placement. When driver programs main () method exits or when stop () method, it terminates all the executors and release them from the cluster manager.

Conclusion

Thus, the architecture enumerates its ease of use, accessibility, and the ability to handle big data tasks. The architecture has finally come to dominate Hadoop mainly because of its speed. It finds usage in many industries. It has taken Hadoop MapReduce to a completely new level with few shuffles in the processing of data. The efficiency 100X of the system is enhanced by the in-memory data storage and real-time processing of data. Also, the lazy evaluation contributes to the speed. It is wise to upgrade yourself the same as it holds great potential for the training. For more insights on the same and related technologies, please visit JanBasktraining.com

Read: Apache Storm Interview Questions and Answers: Fresher & Experience

FaceBook

Twitter

JanBask Training Team

The JanBask Training Team includes certified professionals and expert writers dedicated to helping learners navigate their career journeys in QA, Cybersecurity, Salesforce, and more. Each article is carefully researched and reviewed to ensure quality and relevance.

Comments

Hadoop Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Trending Courses

Cyber Security

Introduction to cybersecurity
Cryptography and Secure Communication
Cloud Computing Architectural Framework
Security Architectures and Models

Upcoming Class

6 days 25 Jul 2025

View Details

Introduction and Software Testing
Software Test Life Cycle
Automation Testing and API Testing
Selenium framework development using Testing

Upcoming Class

6 days 25 Jul 2025

View Details

Salesforce

Salesforce Configuration Introduction
Security & Automation Process
Sales & Service Cloud
Apex Programming, SOQL & SOSL

Upcoming Class

4 days 23 Jul 2025

View Details

Business Analyst

BA & Stakeholders Overview
BPMN, Requirement Elicitation
BA Tools & Design Documents
Enterprise Analysis, Agile & Scrum

Upcoming Class

6 days 25 Jul 2025

View Details

MS SQL Server

Introduction & Database Query
Programming, Indexes & System Functions
SSIS Package Development Procedures
SSRS Report Design

Upcoming Class

6 days 25 Jul 2025

View Details

Data Science

Data Science Introduction
Hadoop and Spark Overview
Python & Intro to R Programming
Machine Learning

Upcoming Class

13 days 01 Aug 2025

View Details

DevOps

Intro to DevOps
GIT and Maven
Jenkins & Ansible
Docker and Cloud Computing

Upcoming Class

-0 day 19 Jul 2025

View Details

Hadoop

Architecture, HDFS & MapReduce
Unix Shell & Apache Pig Installation
HIVE Installation & User-Defined Functions
SQOOP & Hbase Installation

Upcoming Class

7 days 26 Jul 2025

View Details

Python

Features of Python
Python Editors and IDEs
Data types and Variables
Python File Operation

Upcoming Class

6 days 25 Jul 2025

View Details

Artificial Intelligence

Components of AI
Categories of Machine Learning
Recurrent Neural Networks
Recurrent Neural Networks

Upcoming Class

9 days 28 Jul 2025

View Details

Machine Learning

Introduction to Machine Learning & Python
Machine Learning: Supervised Learning
Machine Learning: Unsupervised Learning

Upcoming Class

6 days 25 Jul 2025

View Details

Tableau

Introduction to Tableau Desktop
Data Transformation Methods
Configuring tableau server
Integration with R & Hadoop

Upcoming Class

7 days 26 Jul 2025

View Details

Browse Categories

What Is Apache Oozie? Oozie Configure & Install Tutorial Guide for Beginners

May 09, 2018 eye-dark

542.3k

What Is The Hadoop Cluster? How Does It Work?

Feb 07, 2024 eye-dark

318.3k

Pig Vs Hive: Difference Two Key Components of Hadoop Big Data

Feb 28, 2018 eye-dark

313.5k

Search Posts

Reset

What Is Apache Oozie? Oozie Configure & Install Tutorial Guide for Beginners 542.3k

What Is The Hadoop Cluster? How Does It Work? 318.3k

Pig Vs Hive: Difference Two Key Components of Hadoop Big Data 313.5k

How to Install Apache Pig on Linux? 930.6k

Hadoop HDFS Commands Cheat Sheet 571.8k

Hadoop Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Receive Latest Materials and Offers on Hadoop Course

By submitting my contact details, I agree Privacy Policy ... and I consent to receiving SMS/call/email, including marketing and promotional SMS. Read More

Scroll

Key Features & Components Of Spark Architecture

Need for Spark

Key Differences between Spark and Hadoop

Spark Features

Spark Architecture: Abstractions and Daemons

Spark Architecture: Roles of Driver, Executor and Cluster Manager

Run-Time Architecture of Spark Application

Conclusion

JanBask Training Team

Comments

Trending Courses

Browse Categories

Related Posts