Today's Offer - Data Analytics Certification Training - Enroll at Flat 10% Off.

- Data Science Blogs -

Practical guide to implement Random Forest in R with example

Random Forest In R

When we are going to buy any elite or costly items like Car, Home or any investment in the share market then we prefer to take multiple people's advice. It is unlikely that we just go to a shop and purchase any item on a random basis. We collect many suggestions from different people we know and then take the best option by seeing the positives and negatives of individuals. The reason for taking is that a review of one person can be biased as per his interests and past experiences however by asking multiple people we are trying to mitigate bias caused by any individual. One person may have a very strong aversion for a specific product because of his experience for that product, on the other hand, several other people may have very strong favor for the same product because they have had a very positive experience there.

This concept is called ‘Ensembling’ in Analytics. Ensembling is a technique in which many models are trained on a training dataset and their outputs are assimilated by some rules to get the final output.”

Decision trees have one serious drawback that they are prone to overfitting. The decision tree is grown very deep then it will learn all possible relationships in data. Overfitting can be mitigated with a technique called Pruning which reduces the size of decision trees by removing parts of the tree that provides less power to correct classification. In spite of pruning, the result often is not up to the mark. The primary reason for this is that the algorithm makes a locally optimal choice at each split without any regard to the choice is best for overall grown tree So a bad choice of split at the starting stage can result to poor model and that cannot be compensated by post ad-hoc pruning.

Need for Random Forests

Decision trees are very popular because their idea of making decision reflects how humans make decisions. They check options at different stages of tree split and selecting the best one. The analogy helps to suggest how decision trees can be improved.

One of the TV games provides an option (“Audience poll”) to contestants wherein he can ask the audience to vote on any question if he is clueless. The reason is that the answer from the majority of independent people has more chances of being correct.

  • People have different experiences and will, therefore, draw upon different “data” to answer the question.
  • People have different learning curves and preferences and will, therefore, draw upon different “variables” to make their choices at each stage in their decision process.

Based on the above human thinking comparison, it seems reasonable to build many decision trees and selecting random subsets using:

  • Different subsets of training data
  • Randomly selecting different subsets of columns for tree splitting

Final Predictions can be drawn by taking the majority vote over all trees, mode of classification in-case of classification problems and median in case of regression problems. This is how the random forest algorithm works.

These above two strategies help to reduce overfitting by averaging the response over trees created from different samples of the dataset and decreasing the probability of a small dataset of strong predictors dominating the splits. But everything has a price here model interpretability is reduced and an increase in computational complexity.                                  

Mechanics of the Algorithm

Without going into many mathematical details of the algorithm, let’s understand how the above points are implemented in the algorithm.

The main feature of this algorithm is to use different datasets for building a unique tree. This is achieved by a statistical method called bootstrap aggregating (bagging).

Imagine a dataset of size N. From this dataset we create a sample of size n (n <= N) by selecting n data points randomly with replacement. “Randomly” signifies that every data point in the dataset has an equal probability for selection and “with replacement” means that a particular data point can appear more than once in the subset.

Since the bootstrap aggregated sample is created by sampling with replacement, some data points will not be selected anytime. Generally, on an average each sample will use about two-thirds of the available data points and 1/3rd data points will not be selected in any samples so the model will not be trained on those 1/3rd datapoints. This gives us a way to estimate in the model building.

Using subsets of predictor variables

Bootstrap aggregating (bagging) reduces overfitting to a certain extent but it does not eliminate overfitting issues completely. The reason for this is that there are certain input predictors that influence the tree split and they overshadow weak predictors. These predictors play an important role in the early split of the decision tree and eventually, they influence the structure and sizes of trees in the forest. This results in correlations between trees in random forest because the same predictors are deriving split and tree size so we will get the same classification result.

The random forest has a solution to this- that is, for each split, it selects a random set of subset predictors so each split will be different. So more strong predictors cannot overshadow other fields and hence we get more diverse forest.

Random Forest Case Study In R

We will proceed as follows to train the Random Forest:

  • Import the data
  • Train the model
  • Tuning Random forest Model
  • Visualize the model
  • Evaluate the model
  • Visualize Result 

Import the data: We will use the Titanic dataset for our case study in the Random forest model. You can directly import a dataset from the internet. 

Read: What is Data Acquisition? Top 10 Data Acquisition Tools & Components

Train the model

The random forest has some parameters that can be changed to improve the generalization of the prediction. You will use the function RandomForest() to train the model. We need to install a RandomForest library or package to use this method.

A random forest model can be built using all predictors and the target variable as the categorical outcome. Random forest was attempted with the train function from the caret package and also with the randomForest function from the randomForest package. 

Tuning RF Model

The tuning parameter for a model is a very cumbersome work. There can be many permutations and combinations for a set of hyperparameters. Trying all combinations can be very time and memory consuming task. A better approach can be that the algorithm decides the best set of parameters. There are two common methods for tuning.

  • Random Search
  • Grid Search 

Grid Search

In this tutorial, we will cover both methods, we will train the model using a grid search. Grid search is simple and the model is trained for all combinations we give in the parameters list.

If the number of trees as 10, 20, 30 and the number of mtry(no. of candidates draw to feed algorithm) equals to 1, 2, 3, 4, 5. Then total models will be created.

The drawback of the grid search is the high amount of time and experiments carried out. To overcome this issue we can use random search. 

Random Search definition

Random search does not evaluate all the combinations of hyperparameter. Instead, it will randomly select any combination at every iteration. The advantage is it’s lower the computational cost, memory cost and less time required. 

Set the control parameter

Evaluate the model with the default setting

  • Find the best number of mtry
  • Find the best number of maxnodes
  • Find the best number of ntrees
  • Evaluate the model on the test dataset

Before you begin the exploration of the parameter, you need to install two libraries:-

  • Caret: library in R for machine learning
  • e1071: R machine learning library

Default Setting

trainControl() function control the folder cross-validation. You can try to run the model with the default parameters and see the accuracy score.


trainControl(method = "cv", number = n, search ="grid")
arguments
- method = "cv": The method used to resample the dataset.
- number = n: Number of folders to create
- search = "grid": this means to use grid search method. For randomized method, use "grid"
Note: You can refer to the vignette to see the other arguments of the function.
# Define the control
trControl <- trainControl(method = "cv",
number = 10,
search = "grid")

You will use the caret library to evaluate your model. The library has one function called train() to evaluate almost all machine learning algorithms. Say differently, you can use this function to train other algorithms.

The basic syntax is:-


train(formula, df, method = "rf", metric= "Accuracy", trControl = trainControl(), tuneGrid = NULL)
argument
- ‘formula’: Define the formula of the algorithm
- ‘method’: Define which model to train. Note, at the end of the tutorial, there is a list of all the models that can be trained
- ‘metric’ = "Accuracy": Define how to select the optimal model
- ‘trControl = trainControl()’: Define the control parameters
- ‘tuneGrid = NULL’: Return a data frame with all the possible combinations.

Let's try to build the model with the default values.


set.seed(1234)
# Run the model
rf_default <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)
# Print the results
print(rf_default)

Code Explanation

  • train Control (method="cv", number=10, search="grid"): Evaluate the model with a grid search of 10 folder
  • train(...): Train a random forest model.

Output:

Read: R Programming for Data Science Tutorial Guide for Beginner

The algorithm uses 500 trees and tested three different values of mtry: 2, 6, 10.The final value used for the model was mtry = 2 with an accuracy of 0.78. Let's try to get a higher score.

Step 2) Finding  best mtry

Let’s test the model with values of mtry from 1 to 10


set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300)
print(rf_mtry)

Code Explanation: tuneGrid <- expand.grid(.mtry=c(3:10)): Construct a vector with value from 3:10

The final value used for the model was mtry = 4.

Output:


## Random Forest
## 836 samples
##   7 predictor
##   2 classes: 'No', 'Yes'
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 753, 752, 753, 752, 752, 752, ...
## Resampling results across tuning parameters:
##   mtry  Accuracy   Kappa
##        1          0.7572576  0.4647368
##        2          0.7979346  0.5662364
##    3   0.8075158  0.5884815
##        4          0.8110729  0.5970664
##        5          0.8074727  0.5900030
##        6          0.8099111  0.5949342
##        7          0.8050918  0.5866415
##        8          0.8050918  0.5855399
##        9          0.8050631  0.5855035
##   10  0.7978916  0.5707336
##Final model was built using  mtry = 4.

The best value of mtry is stored in:

rf_mtry$bestTune$mtry

You can store it and use it when you need to tune the other parameters.

max(rf_mtry$results$Accuracy)

Output:


## [1] 0.8110729
best_mtry <- rf_mtry$bestTune$mtry
best_mtry

Output:


## [1] 4

Step 3) Search the best maxnodes

Let’s do a different iteration of loops to evaluate the different values of maxnodes. Below we will -

  • Create a list
  • Create a variable with the best value of the parameter mtry.
  • Create the loop
  • Storing value of maxnode
  • Summarize the results

store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)

Output:


## Call:
## summary.resamples(object = results_mtry)
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
## Number of resamples: 10
## Accuracy
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 5  0.6785714 0.7529762 0.7903758 0.7799771 0.8168388 0.8433735    0
## 6  0.6904762 0.7648810 0.7784710 0.7811962 0.8125000 0.8313253    0
## 7  0.6904762 0.7619048 0.7738095 0.7788009 0.8102410 0.8333333    0
## 8  0.6904762 0.7627295 0.7844234 0.7847820 0.8184524 0.8433735    0
## 9  0.7261905 0.7747418 0.8083764 0.7955250 0.8258749 0.8333333    0
## 10 0.6904762 0.7837780 0.7904475 0.7895869 0.8214286 0.8433735   0
## 11 0.7023810 0.7791523 0.8024240 0.7943775 0.8184524 0.8433735   0
## 12 0.7380952 0.7910929 0.8144005 0.8051205 0.8288511 0.8452381   0
## 13 0.7142857 0.8005952 0.8192771 0.8075158 0.8403614 0.8452381   0
## 14 0.7380952 0.7941050 0.8203528 0.8098967 0.8403614 0.8452381   0
## 15 0.7142857 0.8000215 0.8203528 0.8075301 0.8378873 0.8554217   0
##
## Kappa
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 5  0.3297872 0.4640436 0.5459706 0.5270773 0.6068751 0.6717371    0
## 6  0.3576471 0.4981484 0.5248805 0.5366310 0.6031287 0.6480921    0
## 7  0.3576471 0.4927448 0.5192771 0.5297159 0.5996437 0.6508314    0
## 8  0.3576471 0.4848320 0.5408159 0.5427127 0.6200253 0.6717371    0
## 9  0.4236277 0.5074421 0.5859472 0.5601687 0.6228626 0.6480921    0
## 10 0.3576471 0.5255698 0.5527057 0.5497490 0.6204819 0.6717371   0
## 11 0.3794326 0.5235007 0.5783191 0.5600467 0.6126720 0.6717371   0
## 12 0.4460432 0.5480930 0.5999072 0.5808134 0.6296780 0.6717371   0
## 13 0.4014252 0.5725752 0.6087279 0.5875305 0.6576219 0.6678832   0
## 14 0.4460432 0.5585005 0.6117973 0.5911995 0.6590982 0.6717371   0
## 15 0.4014252 0.5689401 0.6117973 0.5867010 0.6507194 0.6955990   0

The last value of maxnode has the highest accuracy. You can try with higher values to see if you can get a higher score.


store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(20: 30)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
key <- toString(maxnodes)
store_maxnode[[key]] <- rf_maxnode
}
results_node <- resamples(store_maxnode)
summary(results_node)

Output:

Read: Difference Between Data Scientist and Data Analyst

##
## Call:
## summary.resamples(object = results_node)
##
## Models: 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
## Number of resamples: 10
##
## Accuracy
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 20 0.7142857 0.7821644 0.8144005 0.8075301 0.8447719 0.8571429   0
## 21 0.7142857 0.8000215 0.8144005 0.8075014 0.8403614 0.8571429   0
## 22 0.7023810 0.7941050 0.8263769 0.8099254 0.8328313 0.8690476   0
## 23 0.7023810 0.7941050 0.8263769 0.8111302 0.8447719 0.8571429   0
## 24 0.7142857 0.7946429 0.8313253 0.8135112 0.8417599 0.8690476   0
## 25 0.7142857 0.7916667 0.8313253 0.8099398 0.8408635 0.8690476   0
## 26 0.7142857 0.7941050 0.8203528 0.8123207 0.8528758 0.8571429   0
## 27 0.7023810 0.8060456 0.8313253 0.8135112 0.8333333 0.8690476   0
## 28 0.7261905 0.7941050 0.8203528 0.8111015 0.8328313 0.8690476   0
## 29 0.7142857 0.7910929 0.8313253 0.8087063 0.8333333 0.8571429   0
## 30 0.6785714 0.7910929 0.8263769 0.8063253 0.8403614 0.8690476   0
##
## Kappa
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 20 0.3956835 0.5316120 0.5961830 0.5854366 0.6661120 0.6955990   0
## 21 0.3956835 0.5699332 0.5960343 0.5853247 0.6590982 0.6919315   0
## 22 0.3735084 0.5560661 0.6221836 0.5914492 0.6422128 0.7189781   0
## 23 0.3735084 0.5594228 0.6228827 0.5939786 0.6657372 0.6955990   0
## 24 0.3956835 0.5600352 0.6337821 0.5992188 0.6604703 0.7189781   0
## 25 0.3956835 0.5530760 0.6354875 0.5912239 0.6554912 0.7189781   0
## 26 0.3956835 0.5589331 0.6136074 0.5969142 0.6822128 0.6955990   0
## 27 0.3735084 0.5852459 0.6368425 0.5998148 0.6426088 0.7189781   0
## 28 0.4290780 0.5589331 0.6154905 0.5946859 0.6356141 0.7189781   0
## 29 0.4070588 0.5534173 0.6337821 0.5901173 0.6423101 0.6919315   0
## 30 0.3297872 0.5534173 0.6202632 0.5843432 0.6590982 0.7189781   0

We can see that for max node 22, accuracy is highest.

Step 4) Search the best ntrees

After tuning mtry and max node values, now let's tune the number of trees. The method is for tuning ntree is the same as tuning of max nodes.


store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
set.seed(5678)
rf_maxtrees <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = 24,
ntree = ntree)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)

Output:


##
## Call:
## summary.resamples(object = results_tree)
##
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000
## Number of resamples: 10
##
## Accuracy
##           Min.   1st Qu. Median      Mean   3rd Qu.         Max. NA's
## 250  0.7380952 0.7976190 0.8083764 0.8087010 0.8292683 0.8674699         0
## 300  0.7500000 0.7886905 0.8024240 0.8027199 0.8203397 0.8452381         0
## 350  0.7500000 0.7886905 0.8024240 0.8027056 0.8277623 0.8452381         0
## 400  0.7500000 0.7886905 0.8083764 0.8051009 0.8292683 0.8452381         0
## 450  0.7500000 0.7886905 0.8024240 0.8039104 0.8292683 0.8452381         0
## 500  0.7619048 0.7886905 0.8024240 0.8062914 0.8292683 0.8571429         0
## 550  0.7619048 0.7886905 0.8083764 0.8099062 0.8323171 0.8571429         0
## 600  0.7619048 0.7886905 0.8083764 0.8099205 0.8323171 0.8674699         0
## 800  0.7619048 0.7976190 0.8083764 0.8110820 0.8292683 0.8674699         0
## 1000 0.7619048 0.7976190 0.8121510 0.8086723 0.8303571 0.8452381        0
## 2000 0.7619048 0.7886905 0.8121510 0.8086723 0.8333333 0.8452381        0
##
## Kappa
##           Min.   1st Qu. Median      Mean   3rd Qu.         Max. NA's
## 250  0.4061697 0.5667400 0.5836013 0.5856103 0.6335363 0.7196807         0
## 300  0.4302326 0.5449376 0.5780349 0.5723307 0.6130767 0.6710843         0
## 350  0.4302326 0.5449376 0.5780349 0.5723185 0.6291592 0.6710843         0
## 400  0.4302326 0.5482030 0.5836013 0.5774782 0.6335363 0.6710843         0
## 450  0.4302326 0.5449376 0.5780349 0.5750587 0.6335363 0.6710843         0
## 500  0.4601542 0.5449376 0.5780349 0.5804340 0.6335363 0.6949153         0
## 550  0.4601542 0.5482030 0.5857118 0.5884507 0.6396872 0.6949153         0
## 600  0.4601542 0.5482030 0.5857118 0.5884374 0.6396872 0.7196807         0
## 800  0.4601542 0.5667400 0.5836013 0.5910088 0.6335363 0.7196807         0
## 1000 0.4601542 0.5667400 0.5961590 0.5857446 0.6343666 0.6678832        0
## 2000 0.4601542 0.5482030 0.5961590 0.5862151 0.6440678 0.6656337        0

We have tuned all important parameters. Now we can train the random forest with the following parameters:

  • ntree =800:
  • mtry=4:
  • maxnodes = 24: Maximum 24 nodes in the terminal nodes (leaves)

fit_rf <- train(survived~.,
data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 800,
maxnodes = 24)

Step 5) Model Evaluation: caret library in R has a function to make predictions.


predict(model, newdata= df)
argument
- `model`: Define the model evaluated before.
- `newdata`: Define the dataset to make prediction
prediction <-predict(fit_rf, data_test)

You can use the prediction to compute the confusion matrix and see the accuracy score
confusionMatrix(prediction, data_test$survived)

Output:


## Confusion Matrix and Statistics
##
##       Reference
## Prediction  No Yes
##        No  110  32
##        Yes  11  56
##
##                    Accuracy : 0.7943
##                    95% CI : (0.733, 0.8469)
##        No Information Rate : 0.5789
##        P-Value [Acc > NIR] : 3.959e-11
##
##                    Kappa : 0.5638
##  Mcnemar's Test P-Value : 0.002289
##
##                    Sensitivity : 0.9091
##                    Specificity : 0.6364
##        Pos Pred Value : 0.7746
##        Neg Pred Value : 0.8358
##                    Prevalence : 0.5789
##        Detection Rate : 0.5263
##        Detection Prevalence : 0.6794
##        Balanced Accuracy : 0.7727
##
##        'Positive' Class : No
##

We have got an accuracy of 0.7943 percent, which is much higher than the default accuracy.

Step 6) Visualize Result

Now let’s find feature importance with the function varImp(). In the variable importance plot, it seems that the most relevant features are sex and age. The more important features tend to appear near the root of the tree, on the other hand, less important features will often appear close to the leaves.


varImpPlot(fit_rf)
varImp(fit_rf)
## rf variable importance
##
##                    Importance
## sexmale     100.000
## age             28.014
## pclassMiddle          27.016
## fare             21.557
## pclassUpper           16.324
## sibsp           11.246
## parch              5.522
## embarkedC            4.908
## embarkedQ            1.420
## embarkedS             0.000

Conclusion

Machine learning algorithms like the random forest, Neural networks are known for better accuracy and high performance, but the problem is that they are a black box. No-one knows how they work internally. So, results interpretation is a big issue and challenge. It's fine to not knowing the internal statistical details of the algorithm but how to tune random forest is of utmost importance. Tuning the Random forest algorithm is still relatively easy compared to other algorithms.

In spite of being a black-box random forest is a highly popular ensembling technique for better accuracy. It’s even called Panacea in Machine Learning Algorithm. It is said that if you are confused in deciding which algorithm to use for classification then you can use a random forest with closing eyes.


    Janbask Training

    JanBask Training is a leading Global Online Training Provider through Live Sessions. The Live classes provide a blended approach of hands on experience along with theoretical knowledge which is driven by certified professionals.


Trending Courses

AWS

  • AWS & Fundamentals of Linux
  • Amazon Simple Storage Service
  • Elastic Compute Cloud
  • Databases Overview & Amazon Route 53

Upcoming Class

1 day 14 Nov 2019

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing

Upcoming Class

2 days 15 Nov 2019

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning

Upcoming Class

2 days 15 Nov 2019

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation

Upcoming Class

3 days 16 Nov 2019

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL

Upcoming Class

1 day 14 Nov 2019

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing

Upcoming Class

-1 day 12 Nov 2019

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum

Upcoming Class

2 days 15 Nov 2019

SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design

Upcoming Class

6 days 19 Nov 2019

Comments

Search Posts

Reset

Receive Latest Materials and Offers on Data Science Course

Interviews