29

Aug**Introduction** **R for Data Science**

Data science can be defined as the discipline of using raw data as input and extracting knowledge and insights from it.

R programming language, developed by Ross Ihaka and Robert Gentleman in 1993, is widely used for applications related to data science. R provides support for an extensive suite of statistical methods, inference techniques, machine learning algorithms, time series analysis, data analytics, graphical plots to list a few. These features make it a great language for data exploration and investigation.

__Getting started with R__

R can be downloaded from CRAN ( https://cloud.r-project.org ), the **c**omprehensive **R** **a**rchive **n**etwork. CRAN comprises of a set of mirror servers distributed around the world and is used to distribute R and R packages. A new major version of R comes out once a year, and there are 2 to 3 minor versions each year. RStudio provides an integrated development environment, or IDE, for R programming. It can be downloaded from http://www.rstudio.com/download.

__R Packages__

R functionality is provided in terms of its packages. There are now over 10000 R packages in CRAN.

Packages contain R functions, data, and compiled code in a well-defined format. The directory containing the packages is called the library. R ships with a standard set of packages. Non standard packages are available for download and installation. Examples of R packages include arules,ggplot2,caret,shiny etc. Packages can be installed with the install.packages() function as shown below.

`Install.packages(“<package-name>”)`

This command causes R to download the package from CRAN. Once you have a package installed, you can make its contents available to use in your current R session by using the library command:

`library("<package name>")`

__R ____Programming__

Let us take a look at a simple program in R which prints “Hello World”. This can be accomplished either from the command line in the R interpreter or via a R script. Let us look at both mecahnisms.

Hello World in R – from the R command prompt:

`> print("Hello world!")`

`[1] "Hello world!"`

Creating a HelloWorld.R script in R:

`helloStr <- "Hello world!"`

Read: Salary Structure of Data Scientist in USA

`print(helloStr)`

The script can be executed using Rscript HelloWorld.R. It will print:

`[1] "Hello world!"`

__BuiltIn DataSets in R__

R comes with large number of built in datasets.These can be used as demo data for understanding R packages and functions. Data sets in package 'datasets' include:

AirPassengers | Monthly Airline Passenger Numbers 1949-1960 |

BJsales | Sales Data with Leading Indicator |

BJsales.lead (BJsales) | Sales Data with Leading Indicator |

BOD | Biochemical Oxygen Demand |

CO2 | Carbon Dioxide Uptake in Grass Plants |

ChickWeight | Weight versus age of chicks on different diets |

DNase | Elisa assay of DNase |

EuStockMarkets | Daily Closing Prices of Major European Stock Indices, 1991-1998 |

Formaldehyde | Determination of Formaldehyde |

HairEyeColor | Hair and Eye Color of Statistics Students |

Harman23.cor | Harman Example 2.3 |

Harman74.cor | Harman Example 7.4 |

Indometh | Pharmacokinetics of Indomethacin |

InsectSprays | Effectiveness of Insect Sprays |

JohnsonJohnson | Quarterly Earnings per Johnson & Johnson Share |

LakeHuron | Level of Lake Huron 1875-1972 |

LifeCycleSavings | Intercountry Life-Cycle Savings Data |

Loblolly | Growth of Loblolly pine trees |

Nile | Flow of the River Nile |

Orange | Growth of Orange Trees |

OrchardSprays | Potency of Orchard Sprays |

PlantGrowth | Results from an Experiment on Plant Growth |

Puromycin | Reaction Velocity of an Enzymatic Reaction |

Seatbelts | Road Casualties in Great Britain 1969-84 |

Theoph | Pharmacokinetics of Theophylline |

Titanic | Survival of passengers on the Titanic |

ToothGrowth | The impact of Vitamin C on Tooth Growth in Guinea Pigs |

UCBAdmissions | Student Admissions at UC Berkeley |

UKDriverDeaths | Road Casualties in Great Britain 1969-84 |

UKgas | UK Quarterly Gas Consumption |

USAccDeaths | Accidental Deaths in the US 1973-1978 |

USArrests | Violent Crime Rates by US State |

USJudgeRatings | Ratings of State Judges in the US Superior Court |

USPersonalExpenditure | Personal Expenditure Data |

UScitiesD | Distances Between European Cities and Between US Cities |

VADeaths | Death Rates in Virginia (1940) |

WWWusage | Internet Usage per Minute |

WorldPhones | The World's Telephones |

ability.cov | Ability and Intelligence Tests |

airmiles | Passenger Miles on Commercial US Airlines, 1937-1960 |

airquality | New York Air Quality Measurements |

anscombe | Anscombe's Quartet of 'Identical' Simple Linear Regressions |

attenu | The Joyner-Boore Attenuation Data |

attitude | The Chatterjee-Price Attitude Data |

austres | Quarterly Time Series of the Number of Australian Residents |

beaver1 (beavers) | Body Temperature Series of Two Beavers |

beaver2 (beavers) | Body Temperature Series of Two Beavers |

cars | Speed and Stopping Distances of Cars |

chickwts | Chicken Weights versus Feed Type |

co2 | Mauna Loa Atmospheric CO2 Concentration |

crimtab | Student's 3000 Criminals Data |

discoveries | Yearly Numbers of Important Discoveries |

esoph | Smoking, Alcohol and (O)esophageal Cancer |

euro | Conversion Rates of Euro Currencies |

euro.cross (euro) | Conversion Rates of Euro Currencies |

eurodist | Distances Between European Cities and BetweenUS Cities |

faithful | Old Faithful Geyser Data |

fdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |

freeny | Freeny's Revenue Data |

freeny.x (freeny) | Freeny's Revenue Data |

freeny.y (freeny) | Freeny's Revenue Data |

infert | Infertility after Spontaneous and Induced Abortion |

iris | Edgar Anderson's Iris Data |

iris3 | Edgar Anderson's Iris Data |

islands | Areas of the World's Major Landmasses |

ldeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |

lh | Luteinizing Hormone in Blood Samples |

longley | Longley's Economic Regression Data |

lynx | Annual Canadian Lynx trappings 1821-1934 |

mdeaths (UKLungDeaths) | Monthly Deaths from Lung Diseases in the UK |

morley | Michelson Speed of Light Data |

mtcars | Motor Trend Car Road Tests |

nhtemp | Average Yearly Temperatures in New Haven |

nottem | Average Monthly Temperatures at Nottingham, 1920-1939 |

npk | Classical N, P, K Factorial Experiment |

occupationalStatus | Occupational Status of Fathers and their Sons |

precip | Annual Precipitation in US Cities |

presidents | Quarterly Approval Ratings of US Presidents |

pressure | Vapor Pressure of Mercury as a Function of Temperature |

quakes | Locations of Earthquakes off Fiji |

randu | Random Numbers from Congruential Generator RANDU |

rivers | Lengths of Major North American Rivers |

rock | Measurements on Petroleum Rock Samples |

sleep | Student's Sleep Data |

stack.loss (stackloss) | Brownlee's Stack Loss Plant Data |

stack.x (stackloss) | Brownlee's Stack Loss Plant Data |

stackloss | Brownlee's Stack Loss Plant Data |

state.abb (state) | US State Facts and Figures |

state.area (state) | US State Facts and Figures |

state.center (state) | US State Facts and Figures |

state.division (state) | US State Facts and Figures |

state.name (state) | US State Facts and Figures |

state.region (state) | US State Facts and Figures |

state.x77 (state) | US State Facts and Figures |

sunspot.month | Monthly Sunspot Data, from 1749 to "Present" |

sunspot.year | Yearly Sunspot Data, 1700-1988 |

sunspots | Monthly Sunspot Numbers, 1749-1983 |

swiss | Swiss Fertility and Socioeconomic Indicators (1888) Data |

treering | Yearly Treering Data, -6000-1979 |

trees | Width, Height and Volume for Cherry Trees |

uspop | Populations Recorded by the US Census |

volcano | Topographic Information on Auckland's Maunga Whau Volcano |

warpbreaks | The Number of Yarn Breaks during Weaving |

women | Average Heights and Weights for American Women |

We can view the contents of any of these datasets using the following commands:

# Loading data(<dataset name>) # Print the first n rows head(<dataset name>, <n>) For example, lets explore the iris datset: data(iris) head(iris,3)

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |

1 | 5.1 | 3.5 | 1.4 | 0.2 setosa |

2 | 4.9 | 3 | 1.4 | 0.2 setosa |

3 | 4.7 | 3.2 | 1.3 | 0.2 setosa |

__Data ingestion in R__

R is used for data processing and analysis tasks. The first step in this activity is importing of data. Base R provides several functions for this purpose. Let us explore some of these for a better understanding:

**read.table** : Used for importing tab delimited tabular data.

e.g.,

1 | 2 | 3 |

4 | 5 | 6 |

7 | 8 | 9 |

A | b | c |

D | e | f |

`> df <- read.table("data.txt", header = FALSE)`

> df

V1 | V2 | V3 | |

1 | 1 | 2 | 3 |

2 | 4 | 5 | 6 |

3 | 7 | 8 | 9 |

4 | A | b | c |

5 | D | e | f |

**read.csv :** Used for importing csv file with comma(,) delimiter.

e.g.,

1 | 2 | 3 |

4 | 5 | 6 |

7 | 8 | 9 |

A | b | c |

D | e | f |

**read.csv2 :** Used for importing csv file with semicolon(;) delimiter.

**read.delim :** Used for importing delimited file with any arbitrary delimiter.

Other library functions are available for importing data of specific format:

e.g.,

Read: What is Data Acquisition? Top 10 Data Acquisition Tools & Components

**readxl:read_excel**for reading excel files**rjson:fromJSON**for reading JSON data**XML:xmlTreeParse**for xml data.**RCurl:readHTMLTable**for reading HTML table data

__Data Preparation and Cleansing__

Eighty percent of data analysis is spent on the cleaning and preparation of data. It is not just the first step, but may need to be repeated many times over the course of analysis.

In **tidy data**:

- Each variable forms a column.
- Each observation forms a row.
- Each type of observation unit forms a table.

Tidy represents a standard way of structuring a dataset. Real world datasets need not necessarily be available in tidy format:

- Column headers may not be variable names.
- Multiple variables might be stored in one column.
- Variables might be stored in rows.
- Unrelated observations might be stored in the same table.
- A single observational unit might be stored across multiple tables.

R provides a package tidyr for converting data into tidy format. tidyr provides three main functions for tidying up messy data:

- gather(),
- separate()and
- spread().

gather() takes multiple columns, and organizes them into key-value pairs.

For example, consider the dataset below which represents the testscores in 2 tests(a and b) for 3 individuals named Amar, Akbar and Anthony:

library(tidyr) messy <- data.frame( name = c("Amar", "Akbar", "Anthony"), a = c(56, 91, 88), b = c(72, 64, 60) ) messy #> name a b #> 1 Amar 56 72 #> 2 Akbar 91 64 #> 3 Anthony 88 60

But ths dataset is currently not in a tidy format (Variables must correspond to columns). For it to be converted it into column format the data must be represented as name , test , score.

Let us see how we can use tidyr package to convert the existing dataset into tidy form.

messy %>% gather(test, score, a:b) #> name test score #> 1 Amar a 56 #> 2 Akbar a 91 #> 3 Anthony a 88 #> 4 Amar b 72 #> 5 Akbar b 64 #> 6 Anthony b 60

Here we used the pipe operator %>%. The pipe operator allows you to pipe the output from one function to the input of another function. In our case the messy dataframe is piped as input to the gather function.

Similarly, separate function allows us to separate two variables are clumped together in one column.

spread(), takes two columns (key-value pair) and spreads them in to multiple columns, making data wider.

The tidied dataset can then be transformed as per the requirement of analysis.

R provides several packages for data transformation. Let us look at one of these – dplyr.

Below are some of the functions which are useful for this purpose:

- filter : Pick observations by their values
- arrange : Reordering the rows
- select :Pick variables by their names
- mutate : Create new variables in terms of functions of existing variables
- summarise : Create a single summary value from multiple given values
- group_by() : grouping operations in the “split-apply-combine” concept

For example let us determine all the entries in the iris datset with Species as ‘virginica’ and Sepal.Width>3:

> library(dplyr)

> filter(iris,Species=="virginica",Sepal.Width>3)

Read: Data Science vs Machine Learning - What you need to know?

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 | 6.3 | 3.3 | 6 | 2.5 virginica |

2 | 7.2 | 3.6 | 6.1 | 2.5 virginica |

3 | 6.5 | 3.2 | 5.1 | 2.0 virginica |

4 | 6.4 | 3.2 | 5.3 | 2.3 virginica |

5 | 7.7 | 3.8 | 6.7 | 2.2 virginica |

6 | 6.9 | 3.2 | 5.7 | 2.3 virginica |

7 | 6.7 | 3.3 | 5.7 | 2.1 virginica |

8 | 7.2 | 3.2 | 6 | 1.8 virginica |

9 | 7.9 | 3.8 | 6.4 | 2.0 virginica |

10 | 6.3 | 3.4 | 5.6 | 2.4 virginica |

11 | 6.4 | 3.1 | 5.5 | 1.8 virginica |

12 | 6.9 | 3.1 | 5.4 | 2.1 virginica |

13 | 6.7 | 3.1 | 5.6 | 2.4 virginica |

14 | 6.9 | 3.1 | 5.1 | 2.3 virginica |

15 | 6.8 | 3.2 | 5.9 | 2.3 virginica |

16 | 6.7 | 3.3 | 5.7 | 2.5 virginica |

17 | 6.2 | 3.4 | 5.4 | 2.3 virginica |

__Data Modeling__

A model provides a simple low-dimensional summary of a given dataset. R provides inbuilt functions that make fitting statistical models very simple.

The function to fit linear models is called lm. It is very useful for regression analysis of dataset.The generic syntax is as follows:

> fitted_model <- lm(formula, data = data.frame) For example: > fitted_model <- lm(y ~ x1 + x2 + x3, data = production) Will fit a multiple regression model of dependent variable y on independent variables x1,x2 and x3. Generalized linear models extend linear models to distributions such as gaussian, binomial, poisson, inverse gaussian,gamma and quasi-likelihood models. > fitted_model <- glm(formula, family=family.generator, data=data.frame) `For example: > fitted_model <- glm(y ~ x1 + x2, family = gaussian, data = production) Or > fitted_model <- glm(y ~ x, family = binomial(link=probit), data = mydata) Or > fitted_model <- glm(y ~ x1 + x2 - 1, family = quasi(link=inverse, variance=constant), data = testdata) The model parameters can be visualized by calling > summary(fitted_model) Example: ## ## Call: ## glm(formula = formula, family = "binomial", data = mydata) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6456 -0.5858 -0.2609 -0.0651 3.1982 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.07882 0.21726 0.363 0.71675 ## age 0.41119 0.01857 22.146 < 2e-16 *** ## workclassLocal-gov -0.64018 0.09396 -6.813 9.54e-12 *** ## workclassPrivate -0.53542 0.07886 -6.789 1.13e-11 *** ## workclassSelf-emp-inc -0.07733 0.10350 -0.747 0.45499 ## workclassSelf-emp-not-inc -1.09052 0.09140 -11.931 < 2e-16 *** ## workclassState-gov -0.80562 0.10617 -7.588 3.25e-14 *** ## workclassWithout-pay -1.09765 0.86787 -1.265 0.20596 ## educationCommunity -0.44436 0.08267 -5.375 7.66e-08 *** ## educationHighGrad -0.67613 0.11827 -5.717 1.08e-08 *** ## educationMaster 0.35651 0.06780 5.258 1.46e-07 *** ## educationPhD 0.46995 0.15772 2.980 0.00289 ** ## educationdropout -1.04974 0.21280 -4.933 8.10e-07 *** ## educational.num 0.56908 0.07063 8.057 7.84e-16 *** ## marital.statusNot_married -2.50346 0.05113 -48.966 < 2e-16 *** ## marital.statusSeparated -2.16177 0.05425 -39.846 < 2e-16 *** ## marital.statusWidow -2.22707 0.12522 -17.785 < 2e-16 *** ## raceAsian-Pac-Islander 0.08359 0.20344 0.411 0.68117 ## raceBlack 0.07188 0.19330 0.372 0.71001 ## raceOther 0.01370 0.27695 0.049 0.96054 ## raceWhite 0.34830 0.18441 1.889 0.05894 . ## genderMale 0.08596 0.04289 2.004 0.04506 * ## hours.per.week 0.41942 0.01748 23.998 < 2e-16 *** ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 40601 on 36428 degrees of freedom ## Residual deviance: 27041 on 36406 degrees of freedom ## AIC: 27087 ## ## Number of Fisher Scoring iterations: 6

Besides these R also provides support for other models such as :

- Classification and Regression model – caret package
- Mixed models – nlme package
- Robust Regression – package MASS ( removes outliers)
- Additive models – package acepack
- Tree models – package rpart,tree

__Data Visualization__

Data visualization is an important aid in data analysis and decision making.ggplot2 is a data visualization package for R. ggplot2 is an implementation of Grammar of Graphics(gg)—a general scheme for data visualization which breaks up graphs into components such as scales and layers. In contrast to base R graphics using plot function, ggplot2 allows the user to add, remove or alter components in a plot at a high level of abstraction.

`ggplot(dat, aes(year, lifeExp)) + geom_point()`

This will create a graph between year and life expectancy data from the dataset dat and depict it using geometric points on the graph.

Different types of plots can be created by making use of additional graphing primitives such as `geom_lines(),geom_boxplot(),geom_smooth()`

etc.

qplot is a convenient wrapper on tip of ggplot2 for creating a number of different types of plots .

The generic syntax for qplot is :

qplot(x, y, ..., data, facets = NULL, margins = FALSE, geom = "auto", xlim = c(NA, NA), ylim = c(NA, NA), log = "", main = NULL, xlab = NULL, ylab = NULL, asp = NA, stat = NULL,position = NULL)

where,

x, y, ... | Aesthetics passed into each layer |

data | Data frame to use (optional). If not specified, will create one. |

facets | faceting formula to use. |

margins | See facet_grid: display marginal facets? |

geom | Character vector specifying geom(s) to draw. Default: "point" if both x and y are specified, "histogram" if only x is specified. |

xlim, ylim | X and y axis limits |

log | variables to log transform ("x", "y", or "xy") |

main, xlab, ylab | Character vector/expression giving plot title, x axis label, and y axis label. |

asp | The y/x aspect ratio |

stat, position | DEPRECATED. |

e.g.,

`qplot(mpg, wt, data = mtcars)`

f <- function() { a <- 1:10 b <- a ^ 3 qplot(a, b) } f()

This will plot a curve with a[1-10] on x-axis and b=a^3 on y axis and the (x,y) pairs being represented by points.

__Conclusion:__

R as a language is developed from ground up for data analysis and data interpretation. As is rightly said, data represents power in the new economy. But we need appropriate tools to harness the power inherent in raw data. R provides us with this power. With an ever growing user community and expanding package list covering all facets of data science, R is a language of choice for data science. This post provides a brief introduction to R and its capabilities so that readers can get started quickly and begin exploring further all the powerful features available for data modelling and interpretation.

Read: Top 35 Data Warehouse Interview Questions & Answers For Experienced

JanBask Training is a leading Global Online Training Provider through Live Sessions. The Live classes provide a blended approach of hands on experience along with theoretical knowledge which is driven by certified professionals.

Search Posts

Trending Posts

Top 30 Core Java Interview Questions and Answers for Fresher, Experienced Developer ** 19.6k**

Difference Between AngularJs vs. Angular 2 vs. Angular 4 vs. Angular 5 vs. Angular 6 ** 12.8k**

Cloud Computing Interview Questions And Answers ** 7.8k**

Different Types of SQL Server & SQL Database Functions ** 7.1k**

SSIS Interview Questions & Answers for Fresher, Experienced ** 6.8k**

Related Posts

SQL- A Leading Language for Data Science Experts ** 971.2k**

An Insight into the Intriguing World of the Data Scientist ** 560k**

Data Science vs Machine Learning - What you need to know? ** 841.8k**

How To Write A Resume Of An Entry Level Data Scientist? ** 618.9k**

What Qualifications Are Required To Become Data Scientist? ** 799.3k**

Receive Latest Materials and Offers on **Data Science Course**