International Womens Day : Flat 30% off on live classes + 2 free self-paced courses - SCHEDULE CALL

Select Course
Blog
Corporate Training

+1 202 599 3842

(4.8/5 ) | 1.5K+ Ratings

- Python Blogs -

How to Perform Data Wrangling in Python?

In the era of artificial intelligence and analytics, data has taken the driving seat of almost all the industries. From Zomato to Google all are using data to improve their business models at the same time cater in a customer-specific way to their clientele.

In this blog, we will be first of all talking about the data followed by checking a dataset which will comprise of numerous steps starting with start with checking of inconsistent values followed by checking & removing duplicates and ending on creating labels for a supervised learning-based model. Most prominently, in this blog, we will be talking about the manipulating data using python.

Table of content

Intro to data
Data Wrangling using Pandas:
Checking for inconsistent values in dataset :
Finding the duplicate values
Creating Labels for the data points:
Conclusion

Data – The blood for analytic industry in 21st century

Data is the collection of facts and figures. However, in its raw form, data most of the time is not suitable for further processing. In other words, most of the time, there is a requirement for changing the data into another more suitable format. The process of changing and mapping the data into a more suitable form is called data wrangling, which is also known by the name of data munging. Data – The blood for analytic industry in 21st century

Almost anybody who deals with data in their daily life will agree that data is dirty and cannot be directly used to infer anything meaningful. Thus, it becomes essential to clean it i.e. to perform data wrangling. It is an essential part of data science. This is an example we are present the basics of data wrangling using pandas in python using the forest fire in Brazil dataset which is available at kaggle.com.

Data Wrangling using Pandas:

Pandas is a software library for python which was originally released in 2008 by Wes McKinney. It is widely used for data analysis and manipulation. In this blog, data munging will be performed over a spreadsheet in csv format. Here, pandas data frames are 2-D data structure comprising heterogeneous data in rows and columns which is somewhat similar to a table. Thus, can be easily manipulated in python over rows and columns.

In this blog, forest fire in brazil dataset as available on kaggle, which open for use is used.

Read: 3 Amazing Ways to Find the Python List Length

Checking for inconsistent values in Dataset :

Any dataset can have absent values which are usually represented by a NAN at the place of value. Here it’s important to mention that NAN, which stands for not a number, represents a null value.

Let us query our dataset for null values. (The dataset is having an encoding of cp1252. The default encoding is utf-8.).

 
>>Import pandas as pd
>>X=pd.read('../qwe.csv',encoding='cp1252')
>>Pd.isnull(x).any()

As seen in the Figure 1, As seen in the above query, there is no missing value in this dataset. But, if there are missing values, they can be dropped or other statistical measures can be used to interpolate or extrapolate them.

Finding the duplicate values

Huge datasets may have duplicate values in them making them introduce bias in the final output. Thus, it is important to remove the duplicates. Before we can drop the duplicate values from this dataset, it is important to know that there are any duplicate values in the data set.

To do this, one should first find the count of columns and this can be done by using the describe() function as shown in fig.2 :

 
>> x.describe()

Finding the duplicate values Data Wrangling
Figure 2: Describe

Read: PCEP Certification Guide: Entry-Level Python Programmer Certification

This summarizes most of the statistical features of the dataset under consideration. Now, nunique() can be use to find out the number of unique values in the dataset per column.

 
>>x.nunique()

Data Wrangling unique number

Figure 3:

As evident from figure 3, there are a total of 6454 values, but that does not correspond to the unique values in any reported figure. Thus, few data points might be repeated in the dataset. To segregate the data year wise we can do as follow:

 
X[x.duplicated(subset='year',keep=False)].sort_value('year')

How to Perform Data Wrangling in Python? table
Figure 4:duplicate results

As can be seen in Fig. 4, all the dataset is unique because it is presented in a mapped form. Thus, there are no duplicate values in this dataset. But, for demonstration the instances where year is repeated will be removed from the dataset.

Read: Comprehensive Python Scripting Tutorial | Setup & Examples

 
Repeat_x = x.groupby(by = ‘year’).size().sort_values(ascending = False)
Filtered_x = repeat_x[repeat_x >2].to_frame().reset_index()
Filtered_x = x[~x.year.isin(filtered_x.year)]

These commands will remove places where the year is repeated more than twice. This can be made more specific with the use of conjunctions. Negation ‘~’ needs to be used or all else instances where the occurrence is more than twice will be returned. The last step in the above process is known as data filtration and is used to remove or add data points as per requirements.

Creating Labels for the data points:

The basic purpose of the whole of data wrangling is to give the dataset a usable form, which can be used in further analysis or implement machine learning models. Thus, this step forms the most important part of data wrangling. Before this step, checking the parameters and shape of data is required. In this step, the dataset will be mapped for the binary supervised learning-based model i.e. every data point in the dataset will be mapped to 1, if forest-fire has taken place or else it will be mapped to 0

 
>>mapping = [] #creating an empty list

>> for I in range(len(x)):

>>      if(x.loc[i,”number”] !=0:

>>              mapping.append(1) #appending labels

>>       else:

>>               mapping.append(0)

>>x['label'] = mapping

>>export_csv = x.to_csv (r'd:\amazon_mapped.csv', index = None, header=True)

To map the data, an array named mapping is created. Now, the number column in original data contains the number of times fire has occurred in that area. To make a binary classifier, we will restrict ourselves to whether a fire has occurred or not. To do this, we will first make the corresponding entry in the mapping array and then, initialize this array to the data frame. Once, entry to a data frame is done. We can export this to the desired location as done in the last line of command.

Conclusion:

In this blog, the basics of data wrangling in python using pandas have been discussed and the dataset has been labelled for training of binary classifier. This python example depicts the basic steps and can be enhanced for more complex use in the domain of data science. In other words, a basic data wrangling project can be done using this. Please leave query and comments in the comment section.

Read: Python Skills for Staying Ahead in a Rapidly-Changing Field

FaceBook

Twitter

JanBask Training

A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.

Comments

Python Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Sep

Mon - Fri

6 Weeks

View Detail

Trending Courses

Cyber Security

Introduction to cybersecurity
Cryptography and Secure Communication
Cloud Computing Architectural Framework
Security Architectures and Models

Upcoming Class

1 day 03 Jul 2025

View Details

Introduction and Software Testing
Software Test Life Cycle
Automation Testing and API Testing
Selenium framework development using Testing

Upcoming Class

2 days 04 Jul 2025

View Details

Salesforce

Salesforce Configuration Introduction
Security & Automation Process
Sales & Service Cloud
Apex Programming, SOQL & SOSL

Upcoming Class

1 day 03 Jul 2025

View Details

Business Analyst

BA & Stakeholders Overview
BPMN, Requirement Elicitation
BA Tools & Design Documents
Enterprise Analysis, Agile & Scrum

Upcoming Class

9 days 11 Jul 2025

View Details

MS SQL Server

Introduction & Database Query
Programming, Indexes & System Functions
SSIS Package Development Procedures
SSRS Report Design

Upcoming Class

9 days 11 Jul 2025

View Details

Data Science

Data Science Introduction
Hadoop and Spark Overview
Python & Intro to R Programming
Machine Learning

Upcoming Class

2 days 04 Jul 2025

View Details

DevOps

Intro to DevOps
GIT and Maven
Jenkins & Ansible
Docker and Cloud Computing

Upcoming Class

8 days 10 Jul 2025

View Details

Hadoop

Architecture, HDFS & MapReduce
Unix Shell & Apache Pig Installation
HIVE Installation & User-Defined Functions
SQOOP & Hbase Installation

Upcoming Class

2 days 04 Jul 2025

View Details

Python

Features of Python
Python Editors and IDEs
Data types and Variables
Python File Operation

Upcoming Class

17 days 19 Jul 2025

View Details

Artificial Intelligence

Components of AI
Categories of Machine Learning
Recurrent Neural Networks
Recurrent Neural Networks

Upcoming Class

16 days 18 Jul 2025

View Details

Machine Learning

Introduction to Machine Learning & Python
Machine Learning: Supervised Learning
Machine Learning: Unsupervised Learning

Upcoming Class

23 days 25 Jul 2025

View Details

Tableau

Introduction to Tableau Desktop
Data Transformation Methods
Configuring tableau server
Integration with R & Hadoop

Upcoming Class

3 days 05 Jul 2025

View Details

Browse Categories

Python String Methods - What should you know?

May 15, 2024 eye-dark

813.5k

3 Amazing Ways to Find the Python List Length

Apr 21, 2022 eye-dark

Python Developer Salary for Beginners & Seniors - Know How Much to Ask!

Apr 07, 2022 eye-dark

4.9k

Search Posts

Reset

Python String Methods - What should you know? 813.5k

3 Amazing Ways to Find the Python List Length 3k

Python Developer Salary for Beginners & Seniors - Know How Much to Ask! 4.9k

The Ultimate Guide to Python List to String Conversion 6.6k

Python vs Java : Which Programming Language is Best for Your Career? 4.6k

Python Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Sep

Mon - Fri

6 Weeks

View Detail

Receive Latest Materials and Offers on Python Course

By submitting my contact details, I agree Privacy Policy ... and I consent to receiving SMS/call/email, including marketing and promotional SMS. Read More

Scroll

How to Perform Data Wrangling in Python?

Table of content

Data – The blood for analytic industry in 21st century

Data Wrangling using Pandas:

Checking for inconsistent values in Dataset :

Finding the duplicate values

Creating Labels for the data points:

Conclusion:

JanBask Training

Comments

Trending Courses

Browse Categories

Related Posts