Webinar Alert : Mastering  Manual and Automation Testing! - Reserve Your Free Seat Now

- Data Science Blogs -

R Programming for Data Science: Tutorial Guide for beginners

Introduction to R for Data Science

R programming for data science, a free programming language used by millions of people around the world for data analysis and statistics, is an important part of Data Science. R is a top choice for many data scientists and statisticians which has made it very popular. But what makes R so popular? Why and how should you use R for Data Science?"

R programming language, developed by Ross Ihaka and Robert Gentleman in 1993, is widely used for applications related to data science. R provides support for an extensive suite of statistical methods, inference techniques, machine learning algorithms, time series analysis, data analytics, and graphical plots to list a few. These features make it a great language for data exploration and investigation. 

Before we proceed further with programming in r for data science and what is r for data science? Let’s first discuss what is data science and what is a data scientist.

What is data science?

Data science is the study of data that involves developing methods of analyzing, recording and storing data to effectively extract useful information.The main aim of data science is to get in-depth knowledge about any type of structured and unstructured data.

What is a data scientist?

A data scientist is one who has technical skills to solve complex problems and who has curiosity to explore what kind of problems are needed to be solved. The main goal of data scientists is to analyze, process, and model data then interpret the outcomes to create actionable plans for companies and other organizations.

Data Science Training - Using R and Python

  • Detailed Coverage
  • Best-in-class Content
  • Prepared by Industry leaders
  • Latest Technology Covered

Getting started with R for Data Science

R can be downloaded from CRAN , the comprehensive R archive network. CRAN comprises a set of mirror servers distributed around the world and is used to distribute R and R packages. A new major version of R comes out once a year, and there are 2 to 3 minor versions each year. RStudio provides an integrated development environment, or IDE, for R programming.

How to install R Packages?

R functionality is provided in terms of its packages. There are now over 10000 R packages in CRAN.

Packages contain R functions, data, and compiled code in a well-defined format. The directory containing the packages is called the library. R ships with a standard set of packages. Non standard packages are available for download and installation. Examples of R packages include arules,ggplot2,caret,shiny etc. Packages can be installed with the install.packages() function as shown below.

Install.packages(“”)

This command causes R to download the package from CRAN. Once you have a package installed, you can make its contents available to use in your current R session by using the library command:

library("")

You can join our Data Science Demo Class to solve your problems.Just Enroll Now!

Data Science Training - Using R and Python

  • No cost for a Demo Class
  • Industry Expert as your Trainer
  • Available as per your schedule
  • Customer Support Available

Features of R Programming for Data Science

R has a lot of useful features which makes it a great option for anyone in data science and related fields. Some of these features include:

  • R offers strong support for creating and using statistical models.
  • R has some amazing tools to create beautiful and clear data visualizations.
  • R is also widely used for ETL (Extract, Transform, Load) tasks and can connect to many databases like SQL and spreadsheets.
  • R has many packages which makes it easier to clean and prepare data for analysis.
  • With R, data scientists can use machine learning algorithms to predict future events.
  • R can connect to NoSQL databases and analyze unstructured data.

What is Programming in R for Data Science?

Let us take a look at a simple program in R which prints “Hello World”. This can be accomplished either from the command line in the R interpreter or via a R script. Let us look at both mechanisms.

Hello World in R – from the R command prompt:

> print("Hello world!")

[1] "Hello world!"

Creating a HelloWorld.R script in R:

helloStr <- "Hello world!"

print(helloStr)

The script can be executed using Rscript HelloWorld.R. It will print:

[1] "Hello world!

What are different BuiltIn DataSets in R?

R comes with a large number of built in datasets.These can be used as demo data for understanding R packages and functions. Data sets in package 'datasets' include:

AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales) Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph  Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The impact of Vitamin C on Tooth Growth in Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Ratings of State Judges in the US Superior Court
USPersonalExpenditure Personal Expenditure Data
UScitiesD Distances Between European Cities and Between US Cities
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines, 1937-1960
airquality New York Air Quality Measurements
anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions
attenu The Joyner-Boore Attenuation Data
attitude The Chatterjee-Price Attitude Data
austres Quarterly Time Series of the Number of Australian Residents
beaver1 (beavers) Body Temperature Series of Two Beavers
beaver2 (beavers) Body Temperature Series of Two Beavers
cars Speed and Stopping Distances of Cars
chickwts Chicken Weights versus Feed Type
co2 Mauna Loa Atmospheric CO2 Concentration
crimtab Student's 3000 Criminals Data
discoveries Yearly Numbers of Important Discoveries
esoph Smoking, Alcohol and (O)esophageal Cancer
euro Conversion Rates of Euro Currencies
euro.cross (euro) Conversion Rates of Euro Currencies
eurodist Distances Between European Cities and BetweenUS Cities
faithful Old Faithful Geyser Data
fdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
freeny Freeny's Revenue Data
freeny.x (freeny) Freeny's Revenue Data
freeny.y (freeny) Freeny's Revenue Data
infert Infertility after Spontaneous and Induced Abortion
iris Edgar Anderson's Iris Data
iris3 Edgar Anderson's Iris Data
islands Areas of the World's Major Landmasses
ldeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
lh Luteinizing Hormone in Blood Samples
longley Longley's Economic Regression Data
lynx Annual Canadian Lynx trappings 1821-1934
mdeaths (UKLungDeaths) Monthly Deaths from Lung Diseases in the UK
morley Michelson Speed of Light Data
mtcars Motor Trend Car Road Tests
nhtemp Average Yearly Temperatures in New Haven
nottem Average Monthly Temperatures at Nottingham, 1920-1939
npk Classical N, P, K Factorial Experiment
occupationalStatus Occupational Status of Fathers and their Sons
precip Annual Precipitation in US Cities
presidents Quarterly Approval Ratings of US Presidents
pressure Vapor Pressure of Mercury as a Function of Temperature
quakes Locations of Earthquakes off Fiji
randu Random Numbers from Congruential Generator RANDU
rivers Lengths of Major North American Rivers
rock Measurements on Petroleum Rock Samples
sleep Student's Sleep Data
stack.loss (stackloss) Brownlee's Stack Loss Plant Data
stack.x (stackloss) Brownlee's Stack Loss Plant Data
stackloss Brownlee's Stack Loss Plant Data
state.abb (state) US State Facts and Figures
state.area (state) US State Facts and Figures
state.center (state) US State Facts and Figures
state.division (state) US State Facts and Figures
state.name (state) US State Facts and Figures
state.region (state) US State Facts and Figures
state.x77 (state) US State Facts and Figures
sunspot.month Monthly Sunspot Data, from 1749 to "Present"
sunspot.year Yearly Sunspot Data, 1700-1988
sunspots Monthly Sunspot Numbers, 1749-1983
swiss Swiss Fertility and Socioeconomic Indicators (1888) Data
treering Yearly Treering Data, -6000-1979
trees Width, Height and Volume for Cherry Trees
uspop Populations Recorded by the US Census
volcano Topographic Information on Auckland's Maunga Whau Volcano
warpbreaks The Number of Yarn Breaks during Weaving
women Average Heights and Weights for American Women

We can view the contents of any of these datasets using the following commands:


# Loading 
data() 
# Print the first n rows 
head(, )
For example, lets explore the iris datset:
data(iris)
head(iris,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2  setosa
2 4.9 3 1.4 0.2  setosa
3 4.7 3.2 1.3 0.2  setosa

What is Data ingestion in R?

R is used for data processing and analysis tasks. The first step in this activity is importing data. Base R provides several functions for this purpose. Let us explore some of these for a better understanding:

read.table : Used for importing tab delimited tabular data.e.g.,

1 2 3
4 5 6
7 8 9
A b c
D e f

  > df <- read.table("data.txt", header = FALSE)> df

  V1  V2  V3
1 1 2 3
2 4 5 6
3 7 8 9
4 A b c
5 D e f

read.csv : Used for importing csv file with comma(,) delimiter.e.g.,

1 2 3
4 5 6
7 8 9
A b c
D e f

read.csv2 : Used for importing csv file with semicolon(;) delimiter.

read.delim : Used for importing delimited file with any arbitrary delimiter.

Other library functions are available for importing data of specific format:e.g.,

  • readxl:read_excel for reading excel files
  • rjson:fromJSON for reading JSON data
  • XML:xmlTreeParse for xml data.
  • RCurl:readHTMLTable for reading HTML table data

What is Data Preparation and Cleansing in R?

Eighty percent of data analysis is spent on the cleaning and preparation of data. It is not just the first step, but may need to be repeated many times over the course of analysis.

In tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observation unit forms a table.

Tidy represents a standard way of structuring a dataset. Real world datasets need not necessarily be available in tidy format:

  • Column headers may not be variable names.
  • Multiple variables might be stored in one column.
  • Variables might be stored in rows.
  • Unrelated observations might be stored in the same table.
  • A single observational unit might be stored across multiple tables.

R provides a package tidyr for converting data into tidy format. tidyr provides three main functions for tidying up messy data:

  • gather(),
  • separate()and
  • spread().

gather() takes multiple columns, and organizes them into key-value pairs.

For example, consider the dataset below which represents the test scores in 2 tests(a and b) for 3 individuals named Amar, Akbar and Anthony:


library(tidyr)

messy <- data.frame( name = c("Amar", "Akbar", "Anthony"), a = c(56, 91, 88), b = c(72, 64, 60) ) messy #>    name    a  b
#> 1  Amar   56 72
#> 2 Akbar   91 64
#> 3 Anthony 88 60

But ths dataset is currently not in a tidy format (Variables must correspond to columns). For it to be converted it into column format the data must be represented as name , test , score.

Let us see how we can use tidyr package to convert the existing dataset into tidy form.


messy %>%
  gather(test, score, a:b)
#>      name   test     score
#> 1   Amar    a        56
#> 2   Akbar   a        91
#> 3   Anthony a        88
#> 4   Amar    b        72
#> 5   Akbar   b        64
#> 6   Anthony b        60

Here we used the pipe operator %>%. The pipe operator allows you to pipe the output from one function to the input of another function. In our case the messy dataframe is piped as input to the gather function.

Similarly, separate function allows us to separate two variables are clumped together in one column.

spread(), takes two columns (key-value pair) and spreads them in to multiple columns, making data wider.

The tidied dataset can then be transformed as per the requirement of analysis.

R provides several packages for data transformation. Let us look at one of these – dplyr.

Below are some of the functions which are useful for this purpose:

  • filter : Pick observations by their values
  • arrange : Reordering the rows
  • select : Pick variables by their names
  • mutate : Create new variables in terms of functions of existing variables
  • summarise : Create a single summary value from multiple given values
  • group_by() : grouping operations in the “split-apply-combine” concept

For example let us determine all the entries in the iris datset with Species as ‘virginica’ and Sepal.Width:

> library(dplyr)

> filter(iris,Species=="virginica",Sepal.Width>3)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 6.3 3.3 6 2.5 virginica
2 7.2 3.6 6.1 2.5 virginica
3 6.5 3.2 5.1 2.0 virginica
4 6.4 3.2 5.3 2.3 virginica
5 7.7 3.8 6.7 2.2 virginica
6 6.9 3.2 5.7 2.3 virginica
7 6.7 3.3 5.7 2.1 virginica
8 7.2 3.2 6 1.8 virginica
9 7.9 3.8 6.4 2.0 virginica
10 6.3 3.4 5.6 2.4 virginica
11 6.4 3.1 5.5 1.8 virginica
12 6.9 3.1 5.4 2.1 virginica
13 6.7 3.1 5.6 2.4 virginica
14 6.9 3.1 5.1 2.3 virginica
15 6.8 3.2 5.9 2.3 virginica
16 6.7 3.3 5.7 2.5 virginica
17 6.2 3.4 5.4 2.3 virginica

What is Data Modeling in R?

A model provides a simple low-dimensional summary of a given dataset. R provides inbuilt functions that make fitting statistical models very simple.

The function to fit linear models is called lm. It is very useful for regression analysis of dataset.The generic syntax is as follows:


> fitted_model <- lm(formula, data = data.frame) For example: > fitted_model <- lm(y ~ x1 + x2 + x3, data = production) Will fit a multiple regression model of dependent variable y on independent variables x1,x2 and x3. Generalized linear models extend linear models to distributions such as gaussian, binomial, poisson, inverse gaussian,gamma and quasi-likelihood models. > fitted_model <- glm(formula, family=family.generator, data=data.frame) `For example: > fitted_model <- glm(y ~ x1 + x2, family = gaussian, data = production) Or > fitted_model <- glm(y ~ x, family = binomial(link=probit), data = mydata) Or > fitted_model <- glm(y ~ x1 + x2 - 1, family = quasi(link=inverse, variance=constant), data = testdata) The model parameters can be visualized by calling > summary(fitted_model)

Example:
## 
## Call:
## glm(formula = formula, family = "binomial", data = mydata)
## ## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6456  -0.5858  -0.2609  -0.0651   3.1982  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.07882    0.21726   0.363  0.71675    
## age                        0.41119    0.01857  22.146  

Besides these R also provides support for other models such as :

  • Classification and Regression model – caret package
  • Mixed models – nlme package
  • Robust Regression – package MASS ( removes outliers)
  • Additive models – package acepack
  • Tree models – package rpart,tree

What is Data Visualization in R?

Data visualization is an important aid in data analysis and decision making.ggplot2 is a data visualization package for R. ggplot2 is an implementation of Grammar of Graphics(gg)—a general scheme for data visualization which breaks up graphs into components such as scales and layers. In contrast to base R graphics using plot function, ggplot2 allows the user to add, remove or alter components in a plot at a high level of abstraction.

ggplot(dat, aes(year, lifeExp)) + geom_point()

This will create a graph between year and life expectancy data from the dataset dat and depict it using geometric points on the graph.

Different types of plots can be created by making use of additional graphing primitives such as geom_lines(),geom_boxplot(),geom_smooth() etc.

qplot is a convenient wrapper on tip of ggplot2 for creating a number of different types of plots .

The generic syntax for qplot is :

qplot(x, y, ..., data, facets = NULL, margins = FALSE, geom = "auto", 
xlim = c(NA, NA), ylim = c(NA, NA), log = "", main = NULL, xlab = NULL,
ylab = NULL, asp = NA, stat = NULL,position = NULL)
x, y, ... Aesthetics passed into each layer
data Data frame to use (optional). If not specified, will create one.
facets faceting formula to use.
margins See facet_grid: display marginal facets?
geom Character vector specifying geom(s) to draw. Default: "point" if both x and y are specified, "histogram" if only x is specified.
xlim, ylim X and y axis limits
log variables to log transform ("x", "y", or "xy")
main, xlab, ylab Character vector/expression giving plot title, x axis label, and y axis label.
asp The y/x aspect ratio
stat, position DEPRECATED.

e.g.: qplot(mpg, wt, data = mtcars)


f <- function() {
   a <- 1:10
   b <- a ^ 3
   qplot(a, b)
}
f()

This will plot a curve with a[1-10] on x-axis and b=a^3 on y axis and the (x,y) pairs being represented by points.  

“You can be an R-programming professional by Enrolling Today”

Data Science Training - Using R and Python

  • Personalized Free Consultation
  • Access to Our Learning Management System
  • Access to Our Course Curriculum
  • Be a Part of Our Free Demo Class

Most Common R Libraries?for Data Science

R has many packages and libraries to help with different tasks in Data Science. Here are some of the best add-on packages for R recommended by RStudio:

1. Database Packages

  • DBI: Connects R with different databases.
  • RMySQL and RSQLite: Load and read data from databases like MySQL and SQLite.

2. Visualization Packages

  • ggplot2: Creates beautiful and easy-to-understand plots and graphs.
  • ggmap: Downloads maps from Google Maps and adds them to your plots.
  • shiny: Helps you build web apps.

3. Data Manipulation and Analysis Packages

  • dplyr: Makes it easy to summarize and rearrange data.
  • stringr: Provides simple tools to work with text.
  • lubridate: Helps you work with dates and times in your data.
  • DataExplorer: Useful for exploring data, creating new features, and making reports.

4. Machine Learning and Deep Learning Packages

  • randomForest and caret: Train models to predict outcomes.
  • deepnet: Tools for deep learning. You can also use popular frameworks like Keras and TensorFlow with R.
  • devtools: Helps you create your own R packages.

Conclusion:

R as a language is developed from ground up for data analysis and data interpretation. As is rightly said, data represents power in the new economy. But we need appropriate tools to harness the power inherent in raw data. R programming for data science provides us with this power. With an ever growing user community and expanding package list covering all facets of data science, R is a language of choice for data science. This post provides a brief introduction to R and its capabilities so that readers can get started quickly and begin exploring further all the powerful features available for data modelling and interpretation.


     user

    JanBask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


  • fb-15
  • twitter-15
  • linkedin-15

Comments

Trending Courses

salesforce

Cyber Security

  • Introduction to cybersecurity
  • Cryptography and Secure Communication 
  • Cloud Computing Architectural Framework
  • Security Architectures and Models
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL
salesforce

Upcoming Class

7 days 02 Oct 2024

salesforce

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum
salesforce

Upcoming Class

9 days 04 Oct 2024

salesforce

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design
salesforce

Upcoming Class

9 days 04 Oct 2024

salesforce

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing
salesforce

Upcoming Class

3 days 28 Sep 2024

salesforce

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation
salesforce

Upcoming Class

3 days 28 Sep 2024

salesforce

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks
salesforce

Upcoming Class

2 days 27 Sep 2024

salesforce

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning
salesforce

Upcoming Class

9 days 04 Oct 2024

salesforce

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop
salesforce

Upcoming Class

2 days 27 Sep 2024

Interviews