Today's Offer - Data Analytics Certification Training - Enroll at Flat 10% Off.

- Data Science Blogs -

How to import Data into R using Excel, CSV, Text and XML

Importing Data into R

R packages provide some sample data to learn the tools of data science but in real-time you have to work on your custom data for real-world applications. The objective of this tutorial is to explain how to read plain-text rectangular files into R. We will mainly focus on data import packages, but many of the principles will translate to other forms of data. We’ll explore some concepts of packages that are useful for other types of data.

Importing external data in R environment means that we can read data from external files, write data to external files, and can access those files from outside the R environment. File formats like csv, xml, xlsx, json, and web data can be imported into the R environment to read the data and perform data analysis, data manipulations and after data analysis data in R can be exported to external files in the same file formats. 

Importing TXT/CSV Files to R

Using Base Functions

The methods given below are the base functions of R for importing different types of external files, so no separate packages are required to install to use these functions.

Each of the functions come with default arguments which separates them from each other. These arguments are:-

  • header: logical Boolean value. If the value is TRUE, then function assumes that your file has a header row. If headers are not given in excel file, then give value as FALSE.
  • fill: logical value. For TRUE value, unequal length rows will be added with blank fields implicitly.
  • sep: the field separator character. For example, “\t” is used for a tab-delimited file and a comma is used for comma-separated files.
  • dec: the character used for decimal points.
  • stringsAsFactor This should be set to TRUE if you want your text data to be converted to factors while importing.

We also need to give filename with a complete path or complete URL of the file is on the web.

Reading a Local File:

To import a file located in your local machine, follow the below steps: -

  • Set your working directory to point to the folder path containing your file with the command setwd ("") and then provide the file name in the function.
  • Use choose() within the import function. This will ask to select a file from your local machine.

Using the readr Package

The functions in this package (more or less) are used in a similar way as of the base functions. The main thing is that readr package functions work almost 10 times much faster than the base functions. So in the case of bulk files, importing this package will give very good speed in importing data.

  • delim: separates character values in the data file.
  • col_names: can be either TRUE (default value), FALSE or a character vector specifying column names. If it is TRUE, then the first row of the file will be returned as column names.

Reading CSV Files

CSV file is a kind of text file in which the values in columns are separated by a comma.

First, we have to set our working directory with the setwd() function to the directory where the file is located locally.

For example:

setwd("C:\Users\sarvendra.singh\Desktop\Janbask\Blogs\New blogs_july19\Random forest in R”)

Then to read values from CSV, there is a built-inunction read.csv() that export the data from the file as a data frame.

For example:


read.data <- read.csv("file1.csv")
print(read.data)

Analyzing a CSV File


#For printing number of columns
print(ncol(read.data))
Output:
[1] 5

#For printing  number of rows
print(nrow(read.data))
Output:
[1] 8

Writing to a CSV File

Writing values of pandas dataframe to an  external CSV file, we use the write.csv() function.


per.sal <- subset(read.data, empsalary >= "30000" & empsalary <= "40000")
# Writing data into a new CSV file
write.csv(per.sal,"output.csv")
new.data <- read.csv("output.csv")
print(new.data)

Reading XML Files

XML (Extensible Markup Language) file contains both data and file format using the ASCII text. It is a little similar to an HTML file that contains markup tags, but the tags in an XML file describe the meaning of the data contained in the file rather than the structure of the page.

To load data of XML file in R, we have to install the XML package, as described below-


install.packages("XML")
To read XML files, we use the in-built function xmlParse().
For example:
#To load required xml package to read XML files
library("XML")
#To load other required packages
library("methods")
#To give the input file name to the function
newfile <- xmlParse(file = "samplefile.xml")
print(newfile)

Converting an XML to a Data Frame

To perform data analysis effectively after importing data in R, we convert the data in an XML file to a Data Frame. After converting, we can perform data manipulation and other operations as performed in a data frame.

Read: How to Become a Successful Data Scientist?

For example:


library("XML")
library("methods")
#To convert the data in xml file to a data frame
xmlRdataframe <- xmlToDataFrame("samplefile.xml")
print(xmlRdataframe)

Output:

  ID NAME SALARY STARTDATE DEPT
1 1 Sam 32000 01/01/2001 HR
2 2 Rob 36000 09/03/2006 IT
3 3 Max 42000 01/05/2011 Sales
4 4 Ivar 50000 25/01/2001 Tech
5 5 Robert 25000 13/07/2015 Sales
6 6 Leon 57000 05/01/2000 IT
7 7 Samuel 45000 27/03/2003 Operations
8 8 Jack 24000 06/01/2016 Sales

Reading JSON Files

To import JSON files data into R, we first need to install or load the rjson package. JSON (JavaScript Object Notation) file is generally used for data exchange between a web application and a server. They are text-based human-readable files and editable by a normal text editor.

install.packages("rjson")

Now to read json files, we use the in-built function from JSON() which stores the data as a list.

For example:


#To load rjson package
library("rjson")
#To give the file name to the function
newfile <- fromJSON(file = "samplefile1.json")
#To print the file
print(newfile)

Output:
$ID

[1] "1" "2" "3" "4" "5" "6" "7" "8"

$Name

[1] "Sam"          "Rob"   "Max"  "Robert" "Ivar"   "Leon"   "Samuel" "Ivar"

$Salary

[1] "32000" "27000" "35000" "25000" "37000" "41000" "36000" "51000"

$StartDate

[1] "1/1/2001"  "9/3/2003"  "1/5/2004"  "14/11/2007" "13/7/2015" "4/3/2007"
[7] "27/3/2013"  "25/7/2000"

$Dept
[1] "IT"               "HR"                   "Tech"               "HR"                   "Sales"                "HR"
[7] "Operations" "IT"

Converting a JSON File to a Data Frame

To convert JSON file to a Data Frame, we use the as.data.frame() function.

For example:


library("rjson")
newfile <- fromJSON(file = "samplefile1.json")
#To convert a JSON file to a data frame
jsonRdataframe <- as.data.frame(newfile)
print(jsonRdataframe)

Output:

  ID NAME SALARY STARTDATE DEPT
1 1 Sam 32000 01/01/2001 IT
2 2 Rob 27000 09/03/2003 HR
3 3 Max 35000 01/05/2004 Tech
4 4 Ivar 25000 14/11/2007 HR
5 5 Robert 37000 13/07/2015 Sales
6 6 Leon 41000 04/03/2007 HR
7 7 Samuel 36000 27/03/2013 Operations
8 8 Jack 51000 25/07/2000 IT

Reading Excel Files

Microsoft Excel is a very popular file format that stores data in xls and xlsx format. We can read data from excel to R and write data from R back to Excel file using the readxl package in R.

To install the readxl package, run the following command


install.packages("readxl")
read_excel() function is used to import file into R and  stores it as a data frame.
newfile <- read_excel("sheet1.xlsx)
print(newfile)

Output:

  ID NAME DEPT SALARY AGE
1 1 SAM SALES 32000 35
2 2 ROB HR 36000 23
3 3 MAC IT 37000 40
4 4 IVAR IT 25000 37
5 5 MAX R&D 30000 22
6 6 ROBERT HR 27000 32
7 7 SAMUEL FINANCE 50000 41
8 8 RAGNAR SALES 45000 29

Reading HTML Tables

Reading and retrieving HTML tables from the web, we use the XML and RCurl packages in R programming.

Below scripts are used to install XML and Rcurl libraries.


install.packages("XML")
install.packages("RCurl")
For load the packages, use the following command:
library("XML")
library("RCurl")

For example, we will fetch the ‘Fiscal annual growth table from a url using the readHTMLTable() function which stores it as a Data Frame.


#To fetch a table from any website paste the url
url <- "https://en.wikipedia.org/wiki/Fiscal_Annuak_Growth#Ranking"
tabs <- getURL(url)
#To fetch the first table, if the webpage has more than one table, we use which = 1
tabs <- readHTMLTable(tabs,which = 1, stringsAsFactors = F)
head(tabs)

Output:

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1 Classification Jurisdiction 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009
2 Very High New Zealand 1 1 1 2 2 3 3 3 3 2 2
3 Very Low Singapore 2 2 2 1 1 1 1 1 1 1 1
4 Medium Denmark 3 3 3 3 4 5 5 5 6 6 5
5 Outstanding Hong Kong 4 5 4 4 3 2 2 2 2 3 4
6 Very Easy South Korea 5 4 5 5 5 7 8 8 16 19 23

str() function to analyze the structure of the data frame.

str(tabs)

Read: Deep Learning Interview Questions and Answers

Output:


'data.frame':  191 obs. of  16 variables:
$ V1 : chr  "Classification" "Very Easy" "Very Easy" "Very Easy" ...
$ V2 : chr  "Jurisdiction" "New Zealand" "Singapore" "Denmark" ...
$ V3 : chr  "2019" "1" "2" "3" ...
$ V4 : chr  "2018" "1" "2" "3" ...
$ V5 : chr  "2017" "1" "2" "3" ...
$ V6 : chr  "2016" "2" "1" "3" ...
$ V7 : chr  "2015" "2" "1" "4" ...
$ V8 : chr  "2014" "3" "1" "5" ...
$ V9 : chr  "2013" "3" "1" "5" ...
$ V10: chr  "2012" "3" "1" "5" ...
$ V11: chr  "2011" "3" "1" "6" ...
$ V12: chr  "2010" "2" "1" "6" ...
$ V13: chr  "2009" "2" "1" "5" ...
$ V14: chr  "2008" "2" "1" "5" ...
$ V15: chr  "2007" "2" "1" "7" ...
$ V16: chr  "2006" "1" "2" "8" ...

Reading from SPSS File

foreign package is required to install to read SPSS files in R. ”read” function is used to read and SPS files in R."to.data.frame" option is used to return a data frame from reading spss file.


library(foreign)  # load the foreign package
help(read.spss)  # documentation
mydata = read.spss("myfile", to.data.frame=TRUE)

Reading from Minitab File

To read  Minitab Portable Worksheet format in R, function read.mtp can be used from  the foreign package. It returns a list of components in the Minitab worksheet.


library(foreign)    # load the foreign package
help(read.mtp)     # documentation
mydata = read.mtp("mydata.mtp")  # read from .mtp file

Connecting to MYSQL Server from R

In Real-world applications data is mostly located in external databases, not in CSV files. So it is must to know how to make a connection and access data in R from SQL/NO SQL Databases.

Connecting R to MySQL is easy with the RMySQL package. First,  install the package and load the library.


install.packages("RMySQL")
library(RMySQL)

Connecting to MySQL Database:

Once the RMySQL library is installed create a database connection object.credentials like username, password needs to be provided as arguments in dbconnect function.


mysqldb = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host')

Listing Tables and columns:

After the connection has been established we list the tables and fields in the database we connected to.

dbListTables(mysqldb)

A list of the tables will be returned in our connection.

dbListFields(mysqldb, 'some_table')

This will return a list of the fields in some_table. 

Executing Queries:

With dbSendQuery function, we can run queries on tables.

dbSendQuery(mydb, 'drop table if exists some_table, some_other_table')

Creating Tables:

We can create tables in the database using R data frames.

dbWriteTable(mydb, name='table_name', value=data.frame.name)

 Retrieving data from MySQL:

To retrieve data from the database we need to save a results set object.

Read: R Programming Language Interview Questions & Answers

rs = dbSendQuery(mydb, "select * from some_table")

Query results will remain on the MySQL server. To access the results in R, we need to use the fetch function.

data = fetch(rs, n=-1)

fetch command will save the results of the query in a data frame object. The n in the function specifies the number of records to retrieve, using n=-1 retrieves all pending records.

Connecting R to MongoDB

MongoDB connection with R is quite straightforward. The Package required to install for this connection is RMongo package. New features of  MongoDB which are implemented in the last few years have not been included in the community R drivers.


require(RMongo)
mongo <- mongoDbConnect('kenblog', 'localhost', 27017)

In the mongoDbConnect method, we need to give the name of the database, server name, and port number to which we want to connect.

Now to retrieve records, we create a query. For this example, let’s get only the examdata from our scores collection. We can use the dbGetQuery method for this which takes a connection object, the collection name, and the query.

examQuery <- dbGetQuery(mongo, 'scores', "{'type': 'exam'}")

This loads in all the records from our scores collection of type exam. Let’s take the values of our exam scores and create a vector from them.

exam_scores <- examQuery[c('score')]

Now we can get a simple summary of our data with command  summary(exam_scores):


score
Min.   : 60.00
1st Qu.: 72.00
Median : 79.00
Mean   : 79.45
3rd Qu.: 86.00
Max.   :100.00

PostgreSQL Connection with R

If we are working with large datasets that may consume lots of memory of our system, its better to have the data loaded in own server with an SQL/PostgreSQL database on it, where you can query the data in smaller digestible chunks

Let’s learn how to connect the PostgreSQL database with R. We need to install  RPostgreSQL package  for  this approach.

To connect, we need to enter the following commands in R:


# install.packages("RPostgreSQL")
require("RPostgreSQL")
# create a connection
# save the password that we can "hide" it as best as we can by collapsing it
pw <- {
"new_user_password"
}
# loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")
# creates a connection to the postgres database
# note that "con" will be used later in each connection to the database
con <- dbConnect(drv, dbname = "postgres",
host = "localhost", port = 5432,
user = "openpg", password = pw)
rm(pw) # removes the password
# check for the cartable
dbExistsTable(con, "cartable")


Conclusion

Reading data from different sources for analysis and exporting the results to some other system for report writing can be frustrating tasks because this can only take a lot of time compared to statistical analysis. Below is the summarised table having library and corresponding functions name to import a particular file in R.

Library Objective Function
base Export csv write.csv()
xlsx Export excel write.xlsx()
haven Export spss write_sav()
haven Export sas write_sas()
haven Export stata write_dta()

 

base Export R save()
googledrive Upload Google Drive drive_upload()
googledrive Open in Google Drive drive_browse()
googledrive Retrieve file ID drive_get(as_id())
googledrive Dowload from Google Drive download_google()
googledrive Remove file from Google Drive drive_rm()
rdrop2 Authentification drop_auth()
rdrop2 Create a folder drop_create()
rdrop2 Upload to Dropbox drop_upload()
rdrop2 Read csv from Dropbox drop_read_csv
rdrop2 Delete file from Dropbox drop_delete()


    Janbask Training

    JanBask Training is a leading Global Online Training Provider through Live Sessions. The Live classes provide a blended approach of hands on experience along with theoretical knowledge which is driven by certified professionals.


Trending Courses

AWS

  • AWS & Fundamentals of Linux
  • Amazon Simple Storage Service
  • Elastic Compute Cloud
  • Databases Overview & Amazon Route 53

Upcoming Class

2 days 14 Nov 2019

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing

Upcoming Class

3 days 15 Nov 2019

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning

Upcoming Class

3 days 15 Nov 2019

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation

Upcoming Class

4 days 16 Nov 2019

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL

Upcoming Class

2 days 14 Nov 2019

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing

Upcoming Class

-0 day 12 Nov 2019

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum

Upcoming Class

3 days 15 Nov 2019

SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design

Upcoming Class

7 days 19 Nov 2019

Comments

Search Posts

Reset

Receive Latest Materials and Offers on Data Science Course

Interviews