International Womens Day : Flat 30% off on live classes + 2 free self-paced courses - SCHEDULE CALL

Select Course
Blog
Corporate Training

+1 202 599 3842

(4.8/5 ) | 1.5K+ Ratings

- Hadoop Blogs -

Your Complete Guide to Apache Hive Data Models

Apache Hive is basically an open source data warehouse system which can handle, query, or analyze large datasets and process structured/non-structured data in Hadoop. Apache Hive is built on Hadoop big data platform.

This article discusses the most important data source of HIVE which is Hive tables. Hive has made the task of writing complex Map-Reduce jobs easier and it is written in HiveQL language. HiveQL is very much similar to Structured Query Language.

Apache Hive has a number of beneficial features and a few of them are listed below:

The user can perform data querying, summarization, and analysis in an easier manner.
As it supports the concepts of external tables so the user need not store data actually in HDFS to process it.
It can fit even in Hadoop’s low-level interface perfectly.
Data partitioning is also supported by Apache Hive and therefore the performance is improved up to great extent.
The logical plans are optimized by rule-based optimizer.
Hive is scalable, familiar and extensible.
The user need not possess any experience of programming language, only knowledge of SQL is enough.
Structured data can be easily processed in Hive just like in SQL.
Ad-hoc queries can also be processed easily in Hive.

A Detailed Discussion on Apache Hive Data Models

As you know that what is Hive now, here is the time to discuss Hive data model. Following diagram shows the Hive data model:

Imagecourtesy:data-flair.com

Hive, which is an open source data warehouse and built on the top of Hadoop can analyze and store even large datasets, stored in Hadoop files. Apache Hive can store data in following three forms:

Table
Partition
Bucket

Further, we will discuss all three briefly for your reference and why it is important for data storage.

Tables

Apache Hive tables are similar to relational database tables. Tables of Hive consist of data and are their layout is described with the help of associated metadata. Filter, join, project and union operations can be performed on these tables.

Usually, in Hadoop, the data are stored in HDFS but Hive stores metadata in a relational database rather than HDFS. Two types of tables exist is Hive:

Managed or Internal Tables
External Tables.

Managed or Internal Tables

Read: Hadoop HDFS Commands Cheat Sheet

Managed Tables of Hive are also called internal tables and are the default tables. If without specifying the type user develop this table, then it will be of an internal type.

All managed tables are created or stored in HDFS and the data of the tables are created or stored in the /user/hive/warehouse directory of HDFS.

If one deletes the table then both the table data and metadata are deleted from HDFS. Following syntax is used to create any managed table:

Create table tablename(Var String, Var Int) row format delimited fields terminated by ‘;’;

Through above command, we can create a managed table. Toload the data in the table created by above command we will have to use the following command:

Load data local in ‘path of the file’ into table tablename;

While if you want to see or check the data of any particular managed table then the following syntax can be used for that:

Hadoop dfs –ls hdfs://localhost:9000/user/hive/warehouse/tablename

In order to delete the table, you can use the following command -

Drop table tablename

While creating any managed or internal table the delimiters ‘;’ should be used to terminate any field. The location of the table can be overwritten by using LOCATION command. Here in Hive tables, the data can be loaded from two locations one is HDFS and other is a local file system.

The user can also create the virtual tables through views and they are faster than creating actual tables. Views can be used just as actual tables and be used in other queries as well. Indexing can also be done for table data to speed up query processing.

Read: Chief Elements Of A Professional Hadoop Resume

The columns which are commonly used in querying the table data should be indexed. Indexing columns can make the query processing faster. Dropping a table can delete the data from HDFS.

External Table

External tables are used for external use means when the table data resides outside Hive then these tables are used. These tables are used generally when you want to delete metadata from the table and keep the table data as it is. While deletion only table schema gets deleted. Following are the syntaxe used for external tables: To create the table, we use the following command:

Create external table tablename(Var String, Var Int) row format delimited fields terminated by ‘;’;

To check the table data following command is used:

Load data local inpath ‘path of the file’ into table tablename;

To see or check the table data following command can be used for external tables:

Hadoop dfs –ls hdfs://localhost:9000/user/hive/warehouse/tablename

To delete or drop the table following command can be used:

Drop table tablename; When to use Internal or External Tables?

The user can use any of the tables as per the requirement, but there are a few considerations to choose the type of table. Internal Tables must be used when:

Temporary data is to be used
When no external data is there and Hive will use only internal data
You do not want table data after deletion

External Tables must be used when:

Read: Difference Between Apache Hadoop and Spark Framework

The data is also used externally means if the data files are processed and used by any program that does not lock the files
There is no control over data settings like ‘dirs accessing’ and you have another program to perform these operations
You are creating a fresh table not dependent on an existing table
You can access table schema and create the table along with pointing its data location.

Partition

Table of Hive is organized in partitions by grouping same types of data together based on any column or partition key. Every table has a partition key for identification. Partitions can speed up query and slicing process. Following syntax is used to create a partition:

Create Table tablename(var String,var Int) Partitioned BY (partition1 String, partition2 String);

Partitioning increases the speed of querying by reducing its latency as it only scans relevant data rather than scanning full data.

Buckets

Tables or partitions can again be sub-divided into buckets in Hive for that you will have to use the hash function. Following syntax is used to create table buckets:

Create Table tablename Partitioned BY (partition1 data_type, partition2 data_type ) Clustered BY (column1, column2) Sorted BY(column_name Asc:Desc,___) Into num_buckets Buckets;

Hive buckets are just files in the table directory which may be partitioned or un-partitioned. You can even choose n buckets to partition the data.

Final Words:

From the above discussion, this is clear that Hive data can be categorized into three types on the granular level one is a table, other is a partition and the last is a bucket. The user can use these data models as per his needs and requirements. Hive is a popular database tool for Hadoop developers and has many features apart from internal and external tables.

The concept of tables used in Hive has increased its productivity and the developers can not only use these tables as per their convenience even this concept has increased query processing up to great extent.

Hive can be used even by non-programmers as it has a simple syntax for creating and accessing the database. So, if you are planning to learn Hadoop or Hive then join JanBask’s Training and certification program right away to accelerate your learning.

Read: Key Features & Components Of Spark Architecture

FaceBook

Twitter

JanBask Training Team

The JanBask Training Team includes certified professionals and expert writers dedicated to helping learners navigate their career journeys in QA, Cybersecurity, Salesforce, and more. Each article is carefully researched and reviewed to ensure quality and relevance.

Comments

Hadoop Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Trending Courses

Cyber Security

Introduction to cybersecurity
Cryptography and Secure Communication
Cloud Computing Architectural Framework
Security Architectures and Models

Upcoming Class

20 days 02 Aug 2025

View Details

Introduction and Software Testing
Software Test Life Cycle
Automation Testing and API Testing
Selenium framework development using Testing

Upcoming Class

13 days 26 Jul 2025

View Details

Salesforce

Salesforce Configuration Introduction
Security & Automation Process
Sales & Service Cloud
Apex Programming, SOQL & SOSL

Upcoming Class

1 day 14 Jul 2025

View Details

Business Analyst

BA & Stakeholders Overview
BPMN, Requirement Elicitation
BA Tools & Design Documents
Enterprise Analysis, Agile & Scrum

Upcoming Class

12 days 25 Jul 2025

View Details

MS SQL Server

Introduction & Database Query
Programming, Indexes & System Functions
SSIS Package Development Procedures
SSRS Report Design

Upcoming Class

12 days 25 Jul 2025

View Details

Data Science

Data Science Introduction
Hadoop and Spark Overview
Python & Intro to R Programming
Machine Learning

Upcoming Class

5 days 18 Jul 2025

View Details

DevOps

Intro to DevOps
GIT and Maven
Jenkins & Ansible
Docker and Cloud Computing

Upcoming Class

6 days 19 Jul 2025

View Details

Hadoop

Architecture, HDFS & MapReduce
Unix Shell & Apache Pig Installation
HIVE Installation & User-Defined Functions
SQOOP & Hbase Installation

Upcoming Class

5 days 18 Jul 2025

View Details

Python

Features of Python
Python Editors and IDEs
Data types and Variables
Python File Operation

Upcoming Class

1 day 14 Jul 2025

View Details

Artificial Intelligence

Components of AI
Categories of Machine Learning
Recurrent Neural Networks
Recurrent Neural Networks

Upcoming Class

5 days 18 Jul 2025

View Details

Machine Learning

Introduction to Machine Learning & Python
Machine Learning: Supervised Learning
Machine Learning: Unsupervised Learning

Upcoming Class

12 days 25 Jul 2025

View Details

Tableau

Introduction to Tableau Desktop
Data Transformation Methods
Configuring tableau server
Integration with R & Hadoop

Upcoming Class

5 days 18 Jul 2025

View Details

Browse Categories

What Is The Hadoop Cluster? How Does It Work?

Feb 07, 2024 eye-dark

318.3k

An Introduction to Apache Spark and Spark SQL

Feb 05, 2024 eye-dark

261.8k

Top 20 Big Data Hadoop Interview Questions and Answers 2018

Apr 20, 2017 eye-dark

121.4k

Search Posts

Reset

What Is The Hadoop Cluster? How Does It Work? 318.3k

An Introduction to Apache Spark and Spark SQL 261.8k

Top 20 Big Data Hadoop Interview Questions and Answers 2018 121.4k

Apache Flink Tutorial Guide for Beginner 7k

Top 20 Apache Solr Interview Questions & Answers for Freshers and Experienced 646.9k

Hadoop Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Receive Latest Materials and Offers on Hadoop Course

By submitting my contact details, I agree Privacy Policy ... and I consent to receiving SMS/call/email, including marketing and promotional SMS. Read More

Scroll

Your Complete Guide to Apache Hive Data Models

JanBask Training Team

Comments

Trending Courses

Browse Categories

Related Posts