07
SepLabour Day Special : Flat $299 off on live classes + 2 free self-paced courses! - SCHEDULE CALL
Pandas is a Python package and data manipulation tool developed by Wes McKinney. It is built on top of the Numpy package and its main data structure is DataFrame. Pandas provide fast and flexible data structures that can work with relational and classified data with great ease and intuitively. It provides fundamental high-level building blocks to perform practical and real-world data analysis in Python. Also, pandas is one of the most powerful and flexible open-source data analysis and manipulation tools available in any language. Pandas popularity has grown with time and now it is estimated that 5 to 10 million users use this and it is now a must-use tool in the Python data science toolkit.
Pandas is a BSD-licensed Python library. Python with Pandas is used among the different array of fields like academic and commercial domains like finance, economics, statistics, analytics. In this tutorial, we will learn different features of Python Pandas and its practical applications. Pandas can be called as “SQL of Python”. It helps to manage two-dimensional data tables in Python.
Pandas library is built on top of NumPy and so it uses most of the functionalities of NumPy. So it is recommended to go through our tutorial on NumPy before proceeding with this tutorial.
The source code is currently hosted on GitHub at: https://github.com/pandas-dev/pandas
Below are the commands to install using conda and pip
# conda
conda install pandas
# or PyPI
pip install pandas
Below are a few of the main features of the pandas:-
There are two types of data structures in pandas: Series and DataFrames.
1). Series: Series is just like a one-dimensional array-like object. It can contain any data types, e.g. integers, floats, strings, Python objects, and so on. It can be compared with two arrays: one as the index/labels, and the other one containing actual data.
We can create fine a sample Series object in the following example by instantiating a Pandas Series object with a list.
import pandas as pd
S = pd.Series([11, 28, 72, 3, 5, 8])
S
The above Python code returned the following result:
0 11
1 28
2 72
3 3
4 5
5 8
dtype: int64
2). DataFrame: Pandas DataFrame is a two or more dimensional data structure – basically like a relational table with rows and columns. The columns have names and the rows have indexes.
The idea of a Data-Frame is based on spreadsheets. We can see the data structure of a Data-Frame is just like a spreadsheet. A Data-Frame has both a row and a column index. a Data-Frame object contains an ordered collection of columns similar to excel sheet. Different fields of data-frame can have different types, for example, the first column may consist of string, while the second one consists of boolean values and so on.
Pandas have an array of functions and methods that collectively calculate descriptive statistics on data-frame columns. Basic aggregation methods are like sum(), mean(), but some of them, like sumsum() produces an object of the same size. Axis argument can be provided in these methods, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer.
DataFrame − “index” (axis=0, default), “columns” (axis=1)
Data-frame Creation: Dataframe can be created with pandas just like below.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df
Output-
Age Name Rating
25 Tom 4.23
26 James 3.24
25 Ricky 3.98
23 Vin 2.56
sum()
This function returns the sum of the values for the requested axis. By default, axis is index (axis=0). Below is a sample example to calculate the sum of the numeric field of data-frame.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
Output:
print df.sum(1)
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
mean()
Read: How To Comment Out Multiple Lines In Python Like A Pro
Returns the average value
std()
It returns the standard deviation of the numerical columns.
Functions & Description
Summarizing Data
The describe() function computes the given statistics of the DataFrame columns. This function gives the mean, std and IQR values and function excludes the character columns and given a summary about numeric columns. This function excludes string fields and gives statistics of only numeric fields.
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Its output is as follows −
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
Going forward to a high level, there are three important methods. The usage depends on whether we want to apply an operation on an entire Data-set, row/column-wise, or elements- wise.
Custom functions can be applied by passing the function name with the appropriate number of parameters as pipe arguments. Thus, an operation is performed on the whole Data-Frame.
This adder function adds two integer values passed as parameters and returns the sum.
def adder(ele1,ele2):
return ele1+ele2
Let’s apply custom function on data-frame level.
df = pd. DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
Let’s see the full program −
import pandas as pd
import numpy as np
def adder(ele1,ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.pipe(adder,2)
print df.apply(np.mean)
Its output is as follows −
col1 col2 col3
0 2.176704 2.219691 1.509360
1 2.222378 2.422167 3.953921
2 2.241096 1.135424 2.696432
3 2.355763 0.376672 1.182570
4 2.308743 2.714767 2.130288
Row or Column Wise Function Application
With apply() method, all user-defined functions can be applied along the axes(row or column-wise) of a Data-Frame. If any axis is not defined then by default, the column-wise operation is performed.
Example 1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean)
print df.apply(np.mean)
Output −
col1 -0.288022
col2 1.044839
col3 -0.187009
dtype: float64
If axis parameter is defined with value 1 , operations can be performed row wise as described below.
Example 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean,axis=1)
print df.apply(np.mean)
Output
col1 0.034093
col2 -0.152672
col3 -0.229728
dtype: float64
Example 3
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(lambda x: x.max() - x.min())
print df.apply(np.mean)
Output:
Read: How to Perform Data Wrangling in Python?
col1 -0.167413
col2 -0.370495
col3 -0.707631
dtype: float64
applymap() method applies a function that accepts and returns a scalar to every element of a Data-Frame.
Example
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
# My custom function
df['col1'].map(lambda x:x*100)
print df.apply(np.mean)
Output:
col1 0.480742
col2 0.454185
col3 0.266563
dtype: float64
Iteration is a term for extracting each item of something like a list or array, one after another. To iterate over data-frame, we have to iterate a data-frame like a dictionary because data-frame is consisting of rows and columns.
we can iterate an element in two ways in pandas datasets
Iterating over rows : We have three built-in functions iteritems(), iterrows(), and itertuples() to iterate over rows. .
Iteration over rows using iterrows(): Now we apply iterrows() function to get each element of rows.
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
# iterating over rows using iterrows() function
for i, j in df.iterrows():
print(i, j)
print()
Iteration using iteritems():
The Second method to iterate rows is using iteritems() function. This function runs through again over each column as key, value pair with a label as key and column value as a Series object.
# importing pandas as pd
import pandas as pd
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
'degree': ["MBA", "BCA", "M.Tech", "MBA"],
'score':[90, 40, 80, 98]}
# creating a dataframe from a dictionary
df = pd.DataFrame(dict)
# using iteritems() function to retrieve rows
for key, value in df.iteritems():
print(key, value)
print()
Iterating over Columns:
We have to create a list of data-frame columns to iterate over columns, after that we can iterate over that list to cover all columns.
# creating a list of dataframe columns
columns = list(df)
for i in columns:
# printing the third element of the column
print (df[i][2])
Output:
Sudhir
M.tech
80
Pandas have sort_values() function to sort a data frame by particular column in ascending or descending order. It’s different than the sorted Python function
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
Parameters:
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv")
# display
data
Output:
# sorting data frame by name
data.sort_values("Salary", axis = 0, ascending = True,
inplace = True, na_position ='first')
As shown in the output image, The NaN values are at the top and after that comes the sorted value of Salary.
Pandas indexing operators "[ ]" and attribute operator "." provide a quick way to access Pandas data structures across a wide range of use cases.
Pandas have three types of Multi-axes indexing; the three types are mentioned in the following table −
.loc()
Read: Top 10 Python Libraries For Machine Learning
Pandas provide several methods to have purely label based indexing. When slicing, the start bound is also included. Integers are appropriate labels, but they point to the label and not the location.
.loc() has multiple access methods like −
loc takes two single/list/range operators separated by ','. The first indicates the row and the second one represents columns.
Example1
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
#select all rows of dataset for a particular column
print df.loc[:,'A']
Output
a 0.391548
b -0.070649
c -0.317212
d -2.162406
e 2.202797
f 0.613709
g 1.050559
h 1.122680
Name: A, dtype: float64
Example 2
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
# Select all rows for more than one columns
print df.loc[:,['A','C']]
Its output is as follows −
A C
a 0.391548 0.745623
b -0.070649 1.620406
c -0.317212 1.448365
d -2.162406 -0.873557
e 2.202797 0.528067
f 0.613709 0.286414
g 1.050559 0.216526
h 1.122680 -1.621420
.iloc()
Pandas have several methods to get integer-based indexing. Like python and numpy, these are 0-based indexing. The various access methods are as follows:
Example 1
# importing the pandas and numpy library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(12, 4), columns = ['A', 'B', 'C', 'D'])
# select all rows for a specific column
print df.iloc[:4]
Its output is as follows −
A B C D
0 0.699435 0.256239 -1.270702 -0.645195
1 -0.685354 0.890791 -0.813012 0.631615
2 -0.783192 -0.531378 0.025070 0.230806
3 0.539042 -1.284314 0.826977 -0.026251
Missing Data can occur when no information is provided for any cell in data-frame. Missing Data is a very big problem in real-life scenarios because they can affect model behavior. Missing Data can also be called as NA(Not Available) values in pandas.
In Pandas missing data is represented by two values:
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. Pandas provide several useful functions for finding, removing, and replacing null values in Pandas Data-Frame :
Example1
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("employees.csv")
# creating bool series True for NaN values
bool_series = pd.isnull(data["Gender"])
# filtering data
# displaying data only with Gender = NaN
data[bool_series]
Example2:
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# creating a dataframe using dictionary
df = pd.DataFrame(dict)
# using notnull() function
df.notnull()
Series have a set of string processing methods to make several operations on array elements. These methods are used with str attribute and generally have names matching the equivalent built-in string methods:
Splitting and Replacing Strings: str.split() is a function that returns a list of strings after splitting the given string by the specified separator but it can only be applied to an individual string. Pandas str.split() method can be applied to a whole series. To replace data, we use str.replace(). This function works like Python .replace() method only, but it works on Series too before calling.
String Concatenation
We can use str.cat() to concatenate strings .This function is used to concatenate strings to the passed caller series of string. The values of a different series can be different but the length of both the series has to be the same.
Example1
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
# making copy of address column
new = df["Address"].copy()
# concatenating address with name column
# overwriting name column
df["Name"]= df["Name"].str.cat(new, sep =", ")
# display
print(df)
Example2:
# importing pandas module
import pandas as pd
# importing csv from link
data = pd.read_csv("nba.csv")
# making copy of team column
new = data["Team"].copy()
# concatenating team with name column
# overwriting name column
data["Name"]= data["Name"].str.cat(new, sep =", ")
# display
Data
FUNCTION | DESCRIPTION |
str.lower() | This is to convert a string’s characters to lowercase |
str.upper() | This is to convert a string’s characters to uppercase |
str.find() | This is used to search for a substring in each string present in a series |
str.rfind() | This is used to search for a substring in each string present in a series from the Right side |
str.findall() | This is also used to find substrings or separators in each string in a series |
str.isalpha() | This is used to check if all characters in each string in series are alphabetic(a-z/A-Z) |
str.isdecimal() | This method is used to check whether all characters in a string are decimal |
str.title() | This method is used to capitalize the first letter of every word in a string |
str.len() | This method returns a count of the number of characters in a string |
str.replace() | This method replaces a substring within a string with another value that the user provides |
str.contains() | This method tests if pattern or regex is contained within a string of a Series or Index |
str.extract() | Extract groups from the first match of regular expression pattern. |
str.startswith() | This tests if the start of each string element matches a pattern |
str.endswith() | This tests if the end of each string element matches a pattern |
str.isdigit() | This is used to check if all characters in each string in series are digits |
str.lstrip() | This removes whitespace from the left side (beginning) of a string |
str.rstrip() | This removes whitespace from the right side (end) of a string |
str.strip() | This to remove leading and trailing whitespace from a string |
str.split() | This splits a string value, based on the occurrence of a user-specified value |
str.join() | This method is used to join all elements in the list present in a series with passed delimiter |
str.cat() | This method is used to concatenate strings to the passed caller series of string. |
str.repeat() | This method is used to repeat string values in the same position of passed series itself |
str.get() | This method is used to get the element at the passed position |
str.partition() | This method splits the string only at the first occurrence unlike str.split() |
str.rpartition() | This method is used splits string only once and that too reversely. It works in a similar way like str.partition() and str.split() |
str.pad() | This method is used to add padding (whitespaces or other characters) to every string element in a series |
str.swapcase) | This method is used to swap case of each string in a series |
Python with Pandas is used in a different and wide range of domains like academic and commercial domains including finance, Retail, Statistics, analytics, etc. Pandas is such a great library for all tasks from importing data to Data analysis and deriving insightful results. Packages like NumPy and matplotlib make most of your data analysis and data visualization very easy and handy.
Read: PCEP Certification Guide: Entry-Level Python Programmer Certification
A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.
Cyber Security
QA
Salesforce
Business Analyst
MS SQL Server
Data Science
DevOps
Hadoop
Python
Artificial Intelligence
Machine Learning
Tableau
Search Posts
Related Posts
Receive Latest Materials and Offers on Python Course
Interviews