Our Support: During the COVID-19 outbreak, we request learners to CALL US for Special Discounts!

- Hadoop Blogs -

An Introduction to Apache Spark and Spark SQL

Spark SQL is a famous data processing tool among Big data professionals. Structured and semi-structured data can be easily processed on Spark SQL. Here structured data is that data which has a proper schema like Hive, JSON, Cables or Parquet data which has a pre-defined set of fields, records and other data, while semi-structured data may not necessarily have a schema. Today Hadoop is extensively used by industries to analyze data and Hadoop uses MapReduce technique to provide scalable, flexible and cost-effective computing models without compromising the speed of data processing. Apache Spark was also introduced to speed up the computation process of Hadoop software.

It is believed that Spark is a modified version of Hadoop, but it is not true as it has its own cluster management computation. Infact Spark uses Hadoop to store data and to process it. So in short Hadoop is used by Spark in two ways, one is to store data and another is to process it. This article is written to provide you an introduction to Spark and discuss the same in detail. The Spark has its own way of managing clusters and the computation processes so basically it uses Hadoop just to store the data.

Evolution and Introduction to Spark

Apache Spark was developed in 2009 by Matei Zaharia in UC Berkley’s AMPLab. Initially under the BSD license it was launched as open source technology. In 2013 it was donated to the Apache Software Foundation and now it has become one of the topmost products of the Apache Foundation. Apache Spark is a lightning-fast computing technology. By using MapReduce technology, it can even process high-level computations easily. These high-level computations can include stream processing and interactive queries. One of the advantageous features of Spark is in-memory cluster computing, which can increase the processing speed to great extent. An extensive workload, including iterative algorithms, batch applications, streaming and interactive queries are also covered by Spark.

Architecture of Apache Spark

Apache Spark can be built through Hadoop components. There are three ways to do this and they are shown in the following figure: What IS Apache Spark SQL?

Following are three ways to implement it:

Hadoop Yarn: It means that Spark runs on Yarn. To run Spark on Yarn you do not need any pre-installation and root-access. Spark can be integrated into the Hadoop ecosystem or Hadoop stack and due to these other components can run on the top of the stack.

Standalone: In Standalone mode Spark runs on HDFS (Hadoop distributed file system) and for HDFS a separate space is allocated for HDFS. All spark and MapReduce jobs are run side by side.

Read: What is Hadoop and How Does it Work?

Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job for standalone deployment. The user can start Spark and use the shell with SIMR without administrative access.

Apache Spark Components

Apache Spark has the following components:

Apache Spark Core: Spark Core is underlying execution engine which is used by Spark platform and its other functionality is built upon this platform. It provides the most advantageous in-memory computing.

Spark SQL: It is a component over Spark core through which a new data abstraction called Schema RDD is introduced. Through this a support to structured and semi-structured data is provided. What IS Apache Spark SQL? Spark Streaming:Spark streaming leverage Spark’s core scheduling capability and can perform streaming analytics. It performs RDD on mini data sets and can perform a transformation on these data sets.

MLib: It is a distributed machine learning framework. The MLib is nine times faster than Hadoop disk-based versions of Apache Mahout.

GraphX: It is a distributed framework for processing graphs. It can perform Graph computations through separate APIs, it is known as Pregel abstraction API. An optimized run time is also used for this abstraction.

Features of Spark SQL

Spark SQL is used to process structured data. Through this programming module data frame is used and it can act as distributed SQL query engine. Spark SQL has following features:

Read: An Introduction to the Architecture & Components of Hadoop Ecosystem

Integrated:Spark programs and SQL queries are mixed seamlessly. Spark SQL can help the user to query structured data as distributed dataset which is also known as RDD in Spark. There are separate APIs for this integration like Java, Scala and Python. Due to these APIs,  SQL queries can be easily run along with complex analytic algorithms.

Hive Compatibility:Unmodified Hive queries can be run on existing warehouses. Hive front-end and Meta store can be reused by Spark SQL and as a result of this it becomes fully compatible with Hive data, UDFs, queries.

Scalability: Spark SQL is advantageous for the RDD model and support fault tolerance, mid-query and handles even the larger jobs. Even for historical data a different engine can be used.

Unified Data Access: From a variety of sources the data can be loaded and collected. A single interface is provided by Schema RDDs to work efficiently with structured data which may include parquet files, Apache Hive tables and JSON files.

Standard Connectivity: You can connect with JDBC or ODBC. In Spark SQL server mode connectivity can be performed by standard JDBC and ODBC.

Architecture of Spark SQL

Apache Spark has the following architecture and includes three layers named Language API, Schema RDD and Data Sources: What IS Apache Spark SQL?

Read: What Is Apache Oozie? Oozie Configure & Install Tutorial Guide for Beginners

The three layers of Spark SQL has following functions:

Language API:Apache Spark is compatible with many languages like Python, Java, Scala and HiveQL.

Schema RDD: A special data structure RDD is used in Spark Core and works on tables, records and fields. Schema RDD is used as a temporary table and this Schema RDD is called a Data Frame.

Data Sources: For Apache Spark the usual data sources are Avro files, text files and data sources for Spark SQL is different. Data sources for Spark may include JSON, Parquet files, Hive tables, Cassandra database and others.

Data Frames of Spark SQL

The Data frame is basically a collection of distributed data. The data is organized into named columns and are like tables with better optimization. A Data Frame for Spark can be constructed from various data sources like external databases, data files, existing RDDs. Spark data frames have the following features:

  • Data Frames are able to process even huge amount of data including Kilobytes or petabytes on a single cluster or node.
  • Data frames can support various data formats like CSV, Avro, elastic search, etc. various storage systems are also supported by these data frames including HDFS, Hive tables and myself.
  • Spark SQL Catalyst can be used to optimize and to generate the codes
  • Data frames can be easily integrated with Big Data tools through Spark-Core
  • They provide an API for Java, Python, R and Scala languages

Conclusion

Nowadays Hadoop is being used by a number of organizations. As data processing is the key to success in this era of online businesses, so the developers need efficient tools to process data. Apache Spark speeds up the data processing in a distributed environment and therefore is getting popular. Apache Spark has many features due to which it is the most preferred tool to perform SQL operations using Data Frames.

Read: What Is The Hadoop Cluster? How Does It Work?



    Janbask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


Comments

Trending Courses

AWS

  • AWS & Fundamentals of Linux
  • Amazon Simple Storage Service
  • Elastic Compute Cloud
  • Databases Overview & Amazon Route 53

Upcoming Class

-1 day 14 Jul 2020

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing

Upcoming Class

16 days 31 Jul 2020

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning

Upcoming Class

1 day 16 Jul 2020

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation

Upcoming Class

2 days 17 Jul 2020

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL

Upcoming Class

0 day 15 Jul 2020

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing

Upcoming Class

9 days 24 Jul 2020

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum

Upcoming Class

-1 day 14 Jul 2020

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design

Upcoming Class

0 day 15 Jul 2020

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation

Upcoming Class

8 days 23 Jul 2020

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks

Upcoming Class

-1 day 14 Jul 2020

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning

Upcoming Class

2 days 17 Jul 2020

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop

Upcoming Class

8 days 23 Jul 2020

Search Posts

Reset

Receive Latest Materials and Offers on Hadoop Course

Interviews