- Hadoop Blogs -

Key Features & Components Of Spark Architecture

Spark has emerged as the most powerful, highly in-demand, and the primary big data framework across the major industries of the world. It has become a robust complement to Hadoop, which was the original choice of technology for big data. This has been possible due to the ability of Spark to handle big data challenges in addition to its accessibility and power. It now has a sound user base of more than 225,000 members and the contribution to code done by over 500 people from 200 different companies. It has become the preferred framework by some mainstream players like Alibaba, Amazon, eBay, Yahoo, Tencent, Baidu, etc.

Rajiv Bhat, the senior vice president of data science and marketplace at InMobi said, “Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model. Now, we can do about four models a day”.  Thus, Spark is an open-source framework for real-time processing, which has become the most running projects by the Apache Software Foundation. It is at present the undisputed market leader in the realm of Big Data processing. It thus has a huge career scope and is worth the deal to become a Spark certified professional.

Need for Spark

There is a huge amount of data which is produced in the world of the internet every day, which has to process within seconds. This is made possible by the Spark Real-Time Processing Framework. Latter has become an essential part of our lives as it spans industries like banking for detection of frauds, governmental surveillance systems, automation in healthcare, stock market predictions, etc. Looking at these domains which employ real-time data analytics in a major way: 

  • Government: The biggest use in governmental machinery of real-time analytics lies in the realm of national security as almost all nations have to keep a live track of the updates from both the military and police for any threat to national security.
  • Healthcare: Real-time analytics are very useful in checking the medical history of some critical patients, which helps them keep track of blood and organ transplants amongst themselves. This is highly useful in case of extreme medical emergencies where a delay in seconds can cost a life.
  • Banking: The world transacts its money through banking, and thus, it becomes important to ensure the transactions are free from fraud.
  • Stock Exchanges: The stockbrokers use Real-time analytics for prediction of stock movements. Many businesses change their business models by making use of real-time analytics for checking the market demand of their brand.

Key Differences between Spark and Hadoop

The most common question which strikes the businesses is what was the need for Spark when Hadoop was there. This can be answered by delineating the concept of batch and real-time processing. While the former rests on the concept of processing blocks of data which have been stored for a certain period and later work on real-time processing model. The MapReduce framework of Hadoop in 2005 was a pathbreaking technology in big data but only until 2014 when Spark was introduced. The main selling proposition of Spark was speed in real-time as it was 100 times faster than the MapReduce framework of Hadoop. Thus, it can be said that Hadoop works on the principle of batch-processing of data which has been stored over some time. Spark, on the other hand, is instrumental in real-time processing and solve critical use cases. Additionally, even in terms of batch processing, it is found to be 100 times faster.

Read: What Is Apache Oozie? Oozie Configure & Install Tutorial Guide for Beginners

Spark Features

Apache Spark is an open-source cluster framework of computing used for real-time data processing. It has a bubbling open-source community and is the most ambitious project by Apache Foundation. Spark gives an interface for programming the entire clusters which have in-built parallelism and fault-tolerance. It has been basically built on MapReduce Framework of Hadoop and extends the same to many more computation types. Some of the salient features of Spark are:

  • Speed: As mentioned above, Spark is 100 times faster than the batch-processing undertaken by the Hadoop MapReduce. This has been made possible by controlled partitioning, which helps it manage data by mode of partitions, which help to distribute data processing parallel to minimal network traffic.
  • Machine Learning: The machine learning component of Spark, MLib is quite hardy in terms of data processing as it eliminates the need to use multiple tools, i.e. one for processing and one for machine learning. It thus gives the data engineers and other data scientists a tough and unified engine which is fast and easy to use.
  • Polyglot: It has provision for high-level APIs in Java, Scala, R and Python, which means that it can code in any of the four. Also, it allows for a shell in Scala and Python. The former is accessed from the installed directory by ./bin/spark-shell while the latter by ./bin/pyspark.
  • Real-Time Computation: Spark has real-time computation with low latency due to its in-memory computation. It is designed for huge scalability. The documented users of the Spark team have production clusters running on systems with thousands of nodes. It supports many computational methods.
  • Slow Evaluation: It is seen that Spark will put off the evaluation until it becomes extremely important. This is one of the main factors which contribute to its speed. Spark handles transformations by adding them to a DAG or Directed Acrylic Graph of computation, and it is only after the request of data by the driver, DAG will actually get executed.
  • Integration with Hadoop: Spark shares fine compatibility with Hadoop, which acts as a gift for all the Big Data engineers who began their careers in Hadoop. Although Spark is a stated replacement for MapReduce functions of Hadoop, it also has the ability for running on top of Hadoop cluster by making use of YARN for scheduling of resources.

Spark Architecture: Abstractions and Daemons

Spark enjoys a well-marked and layered architecture with all the components and layers widely clubbed and integrated with other extensions and libraries. The architecture rests on two primary abstractions:

  • Resilient Distributed Datasets (RDD): These are a collection of data items which are divided into partitions and can be stored in the spark cluster in memory on workers nodes. Talking in terms of datasets there are two types of RDDs supported by Spark, namely the Hadoop Datasets which are created from the files which are stored on HDFS and parallelized collections which are in turn based on the existing Scala collections. Furthermore, two types of operations are supported by the RDDS viz. Transformations and Actions.
  • Directed Acyclic Graph (DAG): When every node is an RDD and edge is transformation over data, then the sequence of computations which are performed on such data is DAG. The Hadoop MapReduce multistage model of execution stands completely eliminated by DAG, which also provides for enhancements of performance in comparison to Hadoop. Direct in DAG stands for the fact that transformation is an action which changes the data partition state from A to B, while Acyclic means that the transformations cannot come back to the older partition.

The Spark Architecture has two main daemons along with a cluster manager. It is basically a master/slave architecture. The two daemons are:

  • Master Daemon: It handles the Master/Driver Process
  • Worker Daemon: It handles the Slave Process

One Spark cluster has only a single Master and many Slaves/ Workers. Individual Java processes are run by the driver and the executors while users can run them on different machines like a vertical cluster or mixed machine configuration or even on the same horizontal spark cluster.

Read: Scala Tutorial Guide for Begginner

Spark Architecture: Roles of Driver, Executor and Cluster Manager

Spark Architecture

1). Driver: The driver program is the central point which runs the primary () function of the application. It is also the entry point of Spark Shell (Scala, Python, and R). Spark Context is built here. Components of the Driver are:

  • DAGScheduler
  • TaskScheduler
  • BackendScheduler
  • BlockManager

All these are responsible for translating the spark user code into actual spark jobs which are implemented on the cluster. Roles of the Driver are:

  • The driver program working on the master node of the Spark cluster helps in scheduling the execution of the job and also holding negotiations with the cluster manager.
  • It helps in translation of the RDD into the execution graph, which it then divides into more than one stage.
  • Metadata about the RDDs and their partitions are also stored by the driver.
  • The driver helps in converting a user application into smaller tasks which are in turn executed by the executors. Latter are worker processes which work on individual tasks.
  • It also helps to reveal the information about running spark application via a Web UI at port 4040.

2). Executor: An executor is mainly responsible for executing the tasks. It is a distributed agent. Every spark application comes endowed with its own process executor which run for the complete lifetime of the Spark application. This is known as “Static Allocation of Executors.” It is also possible for the users to choose the dynamic allocation of executors in which the executor are added and removed to match the overall workload. Roles of the Executor are:

Read: A Complete List of Sqoop Commands Cheat Sheet with Example
  • Data processing
  • It helps in reading from and writing data to the external sources
  • It helps in storing the computation results data in-memory, HDD, or cache.
  • It helps in interacting with the storage systems

3). Cluster Manager: The choice of a cluster manager for every application Spark is completely dependent on the goals as cluster managers give a different set of scheduling capabilities. The standalone cluster manager is easier to use while developing a new spark application. There are three types of cluster managers which can be leveraged by a Spark application for allocating and deallocating various physical resources like client spark jobs, CPU memory, etc.

Run-Time Architecture of Spark Application

As the client submits a spark application code, it is implicitly converted by the driver into a logical DAG. Various optimizations like pipelining of transformations and then converting the logical DAG into a physical execution plan with a set of stages. It is after the creation of a physical execution plan with various small physical execution units is created known as tasks which are clubbed together and sent to the Spark Cluster.

The Driver then interacts with the cluster manager and holds negotiations for the resources. It is the cluster manager, which helps in the launching of the executors on the slave nodes. The tasks are sent to the cluster manager by a driver based on the data placement. Executors register with the driver program before execution. The driver will monitor these as the application runs. The future tasks are also scheduled by the driver again based on data placement. When driver programs main () method exits or when stop () method, it terminates all the executors and release them from the cluster manager.

Conclusion

Thus, the architecture enumerates its ease of use, accessibility, and the ability to handle big data tasks. The architecture has finally come to dominate Hadoop mainly because of its speed. It finds usage in many industries. It has taken Hadoop MapReduce to a completely new level with few shuffles in the processing of data. The efficiency 100X of the system is enhanced by the in-memory data storage and real-time processing of data. Also, the lazy evaluation contributes to the speed. It is wise to upgrade yourself the same as it holds great potential for the training. For more insights on the same and related technologies, please visit JanBasktraining.com

Read: Apache Storm Interview Questions and Answers: Fresher & Experience

    Janbask Training

    JanBask Training is a leading Global Online Training Provider through Live Sessions. The Live classes provide a blended approach of hands on experience along with theoretical knowledge which is driven by certified professionals.


Comments

Search Posts

Reset

Receive Latest Materials and Offers on Hadoop Course