Webinar Alert : Mastering  Manual and Automation Testing! - Reserve Your Free Seat Now

- Hadoop Blogs -

How to Compare Hive, Spark, Impala and Presto?

Introduction

Spark, Hive, Impala and Presto are SQL based engines. Impala is developed and shipped by Cloudera. Many Hadoop users get confused when it comes to the selection of these for managing database. Presto is an open-source distributed SQL query engine that is designed to run SQL queries even of petabytes size. It was designed by Facebook people.

Spark SQL is a distributed in-memory computation engine. Its memory-processing power is high. Hive was also introduced as a query engine by Apache. It made the job of database engineers easier and they could easily write the ETL jobs on structured data. Earlier before the launch of Spark, Hive was considered as one of the topmost and quick databases.

Now, Spark also supports Hive and it can now be accessed through Spike as well. As far as Impala is concerned, it is also a SQL query engine that is designed on top of Hadoop. Impala queries are not translated to mapreduce jobs, instead, they are executed natively.

This was a brief introduction of Hive, Spark, Impala and Presto. Now in the next section of our post, we will see a functional description of these SQL query engines and in the next section, we would cover the difference between these engines as per their properties.

Difference Between Hive, Spark, Impala and Presto

After discussing the introduction of Presto, Hive, Impala and Spark let us see the description of the functional properties of all of these. Below are the descriptions of them:

What is Hive?

Apache Hive data warehouse software facilities that are being used to query and manage large datasets use distributed storage as its backend storage system. It is built on top of Apache. This tool is developed on the top of the Hadoop File System or HDFS. Hadoop can make the following task easier:

  • Ad-hoc queries
  • Data encapsulation
  • Huge datasets and Analysis

Hive Characteristics

  • In Hive database tables are created first and then data is loaded into these tables
  • Hive is designed to manage and querying structured data from the stored tables
  • Map Reduce does not have usability and optimization features but Hive has those features. Query optimization can execute queries in an efficient way
  • The inspired language of Hive reduces the Map Reduce programming complexity and it reuses other database concepts like rows, columns, schemas, etc.
  • Hive use directory structure for data partition and improve performance
  • Most interactions pf Hive takes place through CLI or command line interface and HQL or Hive query language is used to query the database
  • Four file formats are supported by Hive that is TEXTFILE, ORC, RCFILE and SEQUENCEFILE

Three core parts of Hive

  • Hive Clients
  • Hive Services
  • Hive Storage and Computing

Apache Hive Architecture

Through different drivers, Hive communicates with various applications. Like for Java-based applications, it uses JDBC Drivers and for other applications, it uses ODBC Drivers. Hive clients and drivers then again communicate with Hive services and Hive server. Hive clients can get their query resolved through Hive services.

Here CLI or command line interface acts like Hive service for data definition language operations. Requests from different applications are processed by Driver and forwarded to different Meta stores and field systems for further processing.

Hive services like Job Client, File System and Meta store are communicated with Hive storage and are used to perform the following operations:

  • The metadata information of tables ate created and stored in Hive that is also known as “Meta Storage Database”
  • Data and query results are loaded in tables that are later stored in Hadoop cluster on HDFS

Hive is executed either in Local mode or Map Reduce mode. If the data size is smaller or is instead under pseudo mode, then the local mode of Hive is used that can increase the processing speed. While for a large amount of data or for multiple node processing Map Reduce mode of Hive is used that can provide better performance. 

What is Impala?

Impala is a massively parallel processing engine that is an open source engine. It requires the database to be stored in clusters of computers that are running Apache Hadoop. It is a SQL engine, launched by Cloudera in 2012.

Hadoop programmers can run their SQL queries on Impala in an excellent way. It is supposed to be an efficient engine because it does not move or transform data prior to processing. The engine can be easily implemented. The data format, metadata, file security and resource management of Impala are same as that of MapReduce.

It has all the qualities of Hadoop and can also support multi-user environment. The two of the most useful qualities of Impala that makes it quite useful are listed below:

Record Oriented

1). Column Storage 

2). Tree Architecture 

Sentry Plugin

Some of the popular features of Impala:

  • Support to Apache HBase storage and HDFS or Hadoop Distributed File System
  • Support Kerberos Authentication or Hadoop Security
  • It can easily read metadata, SQL syntax and ODBC driver for Apache Hive
  • It recognizes Hadoop file formats, RCFile, Parquet, LZO and SequenceFile
  • Role-based authorization with Apache Sentry.

Impala rises within 2 years of time and have become one of the topmost SQL engines. Now even Amazon Web Services and MapR both have listed their support to Impala.

What is Spark?

Apache Spark is one of the most popular QL engines. It is a general-purpose data processing engine. There are lots of additional libraries on the top of core spark data processing like graph computation, machine learning and stream processing. These libraries can be used together in an application. Spark supports the following languages like Spark, Java and R application development.

Driver Program Application 

Spark applications run several independent processes that are coordinated by the SparkSession object in the driver program. A Spark application runs as independent processes that are coordinated by Spark Session objects in the driver program. Cluster or resource manager also assigns that task to workers. A task applies its units of work to the dataset, as a result, a new dataset partition is created. Final results are either stored and saved on the disk or sent back to the driver application.

Spark can handle petabytes of data and process it in a distributed manner across thousands of clusters that are distributed among several physical and virtual clusters. Spark is being used for a variety of applications like

  • Stream Processing
  • Machine Learning
  • Interactive Analytics
  • Data Integration

Spark is being chosen by a number of users due to its beneficial features like speed, simplicity and support. Spark’s capabilities can be accessed through a rich set of APIs that are designed to specifically interact quickly and easily with data. Apache Spark community is large and supportive you can get the answer to your queries quickly and in a faster manner.

What is Presto?

Presto is a distributed and open-source SQL query engine that is used to run interactive analytical queries. It can handle the query of any size ranging from gigabytes to petabytes. Presto was designed by Facebook people. It was designed to speed up the commercial data warehouse query processing. It can scale up the organizational size matching with Facebook.

Presto runs on a cluster of machines. Presto setup includes multiple workers and a coordinator. The Presto queries are submitted to the coordinator by its clients. The Presto coordinator then analyzes the query and creates its execution plan. Later the processing is being distributed among the workers.

Hive Metastore

 While working with petabytes or terabytes of data the user will have to use lots of tools to interact with HDFS and Hadoop. Presto can help the user to query the database through MapReduce job pipelines like Hive and Pig. Presto can help the user to operate over different kind of data sources like Cassandra and many other traditional data sources.

Features of Presto

  • Can help in querying data from its resident location like that can be Hive, Cassandra, proprietary data stores or relational databases.
  • Can combine the data of a single query from multiple data sources
  • The response time of Presto is quite faster and through an expensive commercial solution they can resolve queries quickly
  • It uses vectorized columnar processing
  • Presto has pipelined execution
  • Its architecture is simple and extensive

Every day Facebook uses Presto to run petabytes of data in a single day. This may include several internal data stores. It also supports pluggable connectors that provide data for queries. Presto supports the following connectors:

  • TPC-H
  • Cassandra
  • Hadoop/Hive

As far as Presto applications are concerned then it supports lots of industrial applications like Facebook, Teradata, and Airbnb. Presto supports standard ANSI SQL which is quite easier for data analysts and developers. Presto is developed and written in Java but does not have Java code-related issues like of

Memory allocation and garbage collection. Presto has a Hadoop-friendly connector architecture. 

Pros and Cons of Impala, Spark, Presto & Hive

1). Cloudera Impala

As we have already discussed that Impala is a massively parallel programming engine that is written in C++. It is shipped by MapR, Oracle, Amazon and Cloudera. Impala has the below-listed pros and cons:

Pros and Cons of Impala

Impala Pros Impala Cons
   
1)      Real-time query execution on data stored in Hadoop clusters 1)      Impala only supports RCFile, Parquet, Avro file and SequenceFile format.
2)      The absence of Map Reduce makes it faster than Hive 2)      It supports only Cloudera’s CDH, AWS and MapR platforms
3)      It supports Enterprise installation backed by Cloudera  
4)      It uses HiveQL and SQL-92 so is easier for a data analyst and RDBMS  

2). Apache Hive

Apache Hive is an open-source query engine that is written in Java programming language that is used for analyzing, summarizing and querying data stored in Hadoop file system. Initially, it was introduced by Facebook, but later it became an open-source engine for all.

Pros and Cons of Hive

Hive Pros Hive Cons
1). It is a stable query engine  
2). Hive is an open-source engine with a vast community 1). Hive uses MapReduce concept for query execution that makes it relatively slow as compared to Cloudera Impala, Spark or Presto
3). It uses SQL-like and Hive QL languages that are easy-to-understand by RDBMS professionals 2). It can only process structured data, so for unstructured data, it is not recommended
4). It supports ORC, Text File, RCFile, avro and Parquet file formats  

3). Apache Spark

T+Spark is a cluster computing framework that can be used for Hadoop. It is written in Scala programming language and was introduced by UC Berkeley. Apache Spark is bundled with Spark SQL, Spark Streaming, MLib and GraphX, due to which it works as a complete Hadoop framework.

Pros and Cons of Spark

Spark Pros Spark Cons
   
1)      Spark is a fast query execution engine that can execute batch queries as well. It is supposed to be 10-100 times faster than Hive with MapReduce  
2)      Spark is fully compatible with hive data queries and UDF or User Defined Functions 1)      Spark required lots of RAM, due to which it increases the usability cost
3)      Spark APIs are available in various languages like Java, Python and Scala, through which application programmers can easily write the code 2)      Many new developments are still going on for Spark, so cannot be considered as a stable engine so far.
4)      Apache Spark has larger community support than Presto  

4). Presto

Presto is also a massively parallel and open-source processing system. It was developed by Facebook to execute SQL queries on Hadoop querying engine. Therefore, the queries can be easily executed with high-speed irrespective of the volume, velocity and variety of data that is being used for the query. Currently, Presto is being backed by Teradata and Airbnb, Netflix, Uber and Dropbox are using Presto for their query execution.

Pros and Cons of Presto

Presto Pros Presto Cons
1)      Presto supports ORC, Parquet, and RCFile formats. So it is being considered as a great query engine that eliminates the need for data transformation as well.  
2)      Presto works well with Amazon S3 queries and storage. It can query data from any data source in seconds even of the size of petabytes. 1)      If you are not experienced and confident about your Presto implementation capabilities then do not deploy it, except you decide to work with Teradata for debugging and support of these applications
3)      Open-source Presto community can provide great support that also makes sure that plenty of users are using Presto. 2)      As it does not have its own storage layer, so insert and writing queries on HDFS are not supported.
4)      Presto enterprise support is provided by Teradata that in itself is a big data marketing and analytics application company.  

Recommended Usage

As far as usage of these query engines is concerned then you can consider the following points while considering or selecting any one of them:

Impala can be your best choice for any interactive BI-like workloads. As Impala queries are of lowest latency so, if you are thinking about why to choose Impala, then in order to reduce query latency you can choose Impala, especially for concurrent executions.

Hive can be also a good choice for low latency and multiuser support requirement. Do not think that why to choose Hive, just for your ETL or batch processing requirements you can choose Hive. However, Hive can reduce the time that is required for query processing, but not that much so that it can become a suitable choice for BI.

Spark SQL, users can selectively use SQL constructs to write queries for Spark pipelines. The answer of question that why to choose Spark is that Spark SQL reuses Hive meta-store and frontend, that is fully compatible with existing Hive queries, data and UDFs. Through a cost-based query optimizer, code generator and columnar storage Spark query execution speed increases.

Presto is leading in BI-type queries, unlike Spark that is mainly used for performance rich queries. Support for concurrent query workloads is critical and Presto has been performing really well. So, if you are thinking that where we should use Presto or why to use Presto, then for concurrent query execution and increased workload you can use the same.

It totally depends on your requirement to choose the appropriate database or SQL engine. Here we have listed some of the commonly used and beneficial features of all SQL engines. You can choose either Presto or Spark or Hive or Impala. The choice of the database depends on technical specifications and availability of features.

Conclusion

If you are not sure about the database or SQL query engine selection, then just go through the detailed comparison of all of these. Through their specific properties and enlisted features, it may become easier for you to choose the appropriate database or SQL engine of your choice. The hive that is a MapReduce based engine can be used for slow processing, while for fast query processing you can either choose Impala or Spark. Several Spark users have upvoted the engine for its impressive performance.


     user

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

salesforce

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
salesforce

Upcoming Class

4 days 29 Sep 2024

salesforce

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
salesforce

Upcoming Class

7 days 02 Oct 2024

salesforce

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
salesforce

Upcoming Class

9 days 04 Oct 2024

salesforce

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
salesforce

Upcoming Class

9 days 04 Oct 2024

salesforce

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
salesforce

Upcoming Class

3 days 28 Sep 2024

salesforce

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
salesforce

Upcoming Class

3 days 28 Sep 2024

salesforce

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
salesforce

Upcoming Class

9 days 04 Oct 2024

salesforce

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
salesforce

Upcoming Class

2 days 27 Sep 2024

Interviews