Our Support: During the COVID-19 outbreak, we request learners to CALL US for Special Discounts!

- Data Science Blogs -

A Practical guide to implementing Random Forest in R with example



Introduction

You must have heard of Random Forest, Random Forest in R or Random Forest in Python! This article is curated to give you a great insight into how to implement Random Forest in R. 

We will discuss Random Forest in R example to understand the concept even better--

Random Forest In R

When we are going to buy any elite or costly items like Car, Home or any investment in the share market then we prefer to take multiple people's advice. It is unlikely that we just go to a shop and purchase any item on a random basis. We collect many suggestions from different people we know and then take the best option by seeing the positives and negatives of individuals. The reason for taking is that a review of one person can be biased as per his interests and past experiences however by asking multiple people we are trying to mitigate bias caused by any individual. One person may have a very strong aversion for a specific product because of his experience for that product, on the other hand, several other people may have very strong favor for the same product because they have had a very positive experience there.

This concept is called ‘Ensembling’ in Analytics. Ensembling is a technique in which many models are trained on a training dataset and their outputs are assimilated by some rules to get the final output.”

Decision trees have one serious drawback that they are prone to overfitting. The decision tree is grown very deep then it will learn all possible relationships in data. Overfitting can be mitigated with a technique called Pruning which reduces the size of decision trees by removing parts of the tree that provides less power to correct classification. In spite of pruning, the result often is not up to the mark. The primary reason for this is that the algorithm makes a locally optimal choice at each split without any regard to the choice is best for overall grown tree So a bad choice of split at the starting stage can result in  poor model and that cannot be compensated by post-ad-hoc pruning.

Need for Random Forests

Decision trees are very popular because their idea of making decisions reflects how humans make decisions. They check options at different stages of tree split and selecting the best one. The analogy helps to suggest how decision trees can be improved.

One of the TV games provides an option (“Audience poll”) to contestants wherein he can ask the audience to vote on any question if he is clueless. The reason is that the answer from the majority of independent people has more chances of being correct.

  • People have different experiences and will, therefore, draw upon different “data” to answer the question.
  • People have different learning curves and preferences and will, therefore, draw upon different “variables” to make their choices at each stage in their decision process.

Based on the above human thinking comparison, it seems reasonable to build many decision trees and selecting random subsets using:

  • Different subsets of training data
  • Randomly selecting different subsets of columns for tree splitting

Final Predictions can be drawn by taking the majority vote over all trees, mode of classification in-case of classification problems and median in case of regression problems. This is how the random forest algorithm works.

Data Science Training - Using R and Python

  • Detailed Coverage
  • Best-in-class Content
  • Prepared by Industry leaders
  • Latest Technology Covered

These above two strategies help to reduce overfitting by averaging the response over trees created from different samples of the dataset and decreasing the probability of a small dataset of strong predictors dominating the splits. But everything has a price. Here, model interpretability is reduced with an increase in computational complexity.

Mechanics of the Algorithm

Without going into many mathematical details of the algorithm, let’s understand how the above points are implemented in the algorithm.

The main feature of this algorithm is to use different datasets for building a unique tree. This is achieved by a statistical method called bootstrap aggregating (bagging).

Imagine a dataset of size N. From this dataset we create a sample of size n (n <= N) by selecting n data points randomly with replacement. “Randomly” signifies that every data point in the dataset has an equal probability for selection and “with replacement” means that a particular data point can appear more than once in the subset.

Since the bootstrap aggregated sample is created by sampling with replacement, some data points will not be selected anytime. Generally, on an average each sample will use about two-thirds of the available data points and 1/3rd data points will not be selected in any samples so the model will not be trained on those 1/3rd datapoints. This gives us a way to estimate the model building.

Using subsets of predictor variables

Bootstrap aggregating (bagging) reduces overfitting to a certain extent but it does not eliminate overfitting issues completely. The reason for this is that there are certain input predictors that influence the tree split and they overshadow weak predictors. These predictors play an important role in the early split of the decision tree and eventually, they influence the structure and sizes of trees in the forest. This results in correlations between trees in random forests because the same predictors are deriving split and tree size so we will get the same classification result.

The random forest has a solution to this- that is, for each split, it selects a random set of subset predictors so each split will be different. So more strong predictors cannot overshadow other fields and hence we get more diverse forests.

Read: Top 5 Python Automation Testing Frameworks to Practice in 2020

Random Forest Case Study In R

We will proceed as follows to train the Random Forest:

  • Import the data
  • Train the model
  • Tuning Random forest Model
  • Visualize the model
  • Evaluate the model
  • Visualize Result 

Data Science Training - Using R and Python

  • No cost for a Demo Class
  • Industry Expert as your Trainer
  • Available as per your schedule
  • Customer Support Available

Set the control parameter

  • Evaluate the model with the default setting
  • Find the best number of mtry
  • Find the best number of maxnodes
  • Find the best number of ntrees
  • Evaluate the model on the test dataset

Before you begin the exploration of the parameter, you need to install two libraries:-

  • Caret: library in R for machine learning
  • e1071: R machine learning library

Data Science Training - Using R and Python

  • Personalized Free Consultation
  • Access to Our Learning Management System
  • Access to Our Course Curriculum
  • Be a Part of Our Free Demo Class

Evaluate Model with Default Setting

trainControl() function controls the folder cross-validation. You can try to run the model with the default parameters and see the accuracy score.

The basic syntax is:-


train(formula, df, method = "rf", metric= "Accuracy", trControl = trainControl(), tuneGrid = NULL)
argument
- ‘formula’: Define the formula of the algorithm
- ‘method’: Define which model to train. Note, at the end of the tutorial, there is a list of all the models that can be trained
- ‘metric’ = "Accuracy": Define how to select the optimal model
- ‘trControl = trainControl()’: Define the control parameters
- ‘tuneGrid = NULL’: Return a data frame with all the possible combinations.

You will use the caret library to evaluate your model. The library has one function called train() to evaluate almost all machine learning algorithms. Say differently, you can use this function to train other algorithms.


set.seed(1234)
# Run the model
rf_default <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
trControl = trControl)
# Print the results
print(rf_default)

<meta charset="utf-8" />Code Explanation

  • train Control (method="cv", number=10, search="grid"): Evaluate the model with a grid search of 10 folder

  • train(...): Train a random forest model.

Output:

The algorithm uses 500 trees and tested three different values of mtry: 2, 6, 10.The final value used for the model was mtry = 2 with an accuracy of 0.78. Let's try to get a higher score.

Step 2) Finding  best mtry

Let’s test the model with values of mtry from 1 to 10


set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 300)
print(rf_mtry)

Code Explanation: tuneGrid <- expand.grid(.mtry=c(3:10)): Construct a vector with value from 3:10

The final value used for the model was mtry = 4.

Output:


## Random Forest
## 836 samples
##   7 predictor
##   2 classes: 'No', 'Yes'
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 753, 752, 753, 752, 752, 752, ...
## Resampling results across tuning parameters:
##   mtry  Accuracy   Kappa
##        1          0.7572576  0.4647368
##        2          0.7979346  0.5662364
##    3   0.8075158  0.5884815
##        4          0.8110729  0.5970664
##        5          0.8074727  0.5900030
##        6          0.8099111  0.5949342
##        7          0.8050918  0.5866415
##        8          0.8050918  0.5855399
##        9          0.8050631  0.5855035
##   10  0.7978916  0.5707336
##Final model was built using  mtry = 4.

The best value of mtry is stored in:

rf_mtry$bestTune$mtry

You can store it and use it when you need to tune the other parameters.

max(rf_mtry$results$Accuracy)

Output:


## [1] 0.8110729
best_mtry <- rf_mtry$bestTune$mtry
best_mtry

Output:


## [1] 4

Step 3) Search the best maxnodes

Let’s do a different iteration of loops to evaluate the different values of maxnodes. Below we will -

  • Create a list
  • Create a variable with the best value of the parameter mtry.
  • Create the loop
  • Storing value of maxnode
  • Summarize the results

store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)

Output:


## Call:
## summary.resamples(object = results_mtry)
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
## Number of resamples: 10
## Accuracy
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 5  0.6785714 0.7529762 0.7903758 0.7799771 0.8168388 0.8433735    0
## 6  0.6904762 0.7648810 0.7784710 0.7811962 0.8125000 0.8313253    0
## 7  0.6904762 0.7619048 0.7738095 0.7788009 0.8102410 0.8333333    0
## 8  0.6904762 0.7627295 0.7844234 0.7847820 0.8184524 0.8433735    0
## 9  0.7261905 0.7747418 0.8083764 0.7955250 0.8258749 0.8333333    0
## 10 0.6904762 0.7837780 0.7904475 0.7895869 0.8214286 0.8433735   0
## 11 0.7023810 0.7791523 0.8024240 0.7943775 0.8184524 0.8433735   0
## 12 0.7380952 0.7910929 0.8144005 0.8051205 0.8288511 0.8452381   0
## 13 0.7142857 0.8005952 0.8192771 0.8075158 0.8403614 0.8452381   0
## 14 0.7380952 0.7941050 0.8203528 0.8098967 0.8403614 0.8452381   0
## 15 0.7142857 0.8000215 0.8203528 0.8075301 0.8378873 0.8554217   0
##
## Kappa
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 5  0.3297872 0.4640436 0.5459706 0.5270773 0.6068751 0.6717371    0
## 6  0.3576471 0.4981484 0.5248805 0.5366310 0.6031287 0.6480921    0
## 7  0.3576471 0.4927448 0.5192771 0.5297159 0.5996437 0.6508314    0
## 8  0.3576471 0.4848320 0.5408159 0.5427127 0.6200253 0.6717371    0
## 9  0.4236277 0.5074421 0.5859472 0.5601687 0.6228626 0.6480921    0
## 10 0.3576471 0.5255698 0.5527057 0.5497490 0.6204819 0.6717371   0
## 11 0.3794326 0.5235007 0.5783191 0.5600467 0.6126720 0.6717371   0
## 12 0.4460432 0.5480930 0.5999072 0.5808134 0.6296780 0.6717371   0
## 13 0.4014252 0.5725752 0.6087279 0.5875305 0.6576219 0.6678832   0
## 14 0.4460432 0.5585005 0.6117973 0.5911995 0.6590982 0.6717371   0
## 15 0.4014252 0.5689401 0.6117973 0.5867010 0.6507194 0.6955990   0

The last value of maxnode has the highest accuracy. You can try with higher values to see if you can get a higher score.


store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(20: 30)) {
set.seed(1234)
rf_maxnode <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = maxnodes,
ntree = 300)
key <- toString(maxnodes)
store_maxnode[[key]] <- rf_maxnode
}
results_node <- resamples(store_maxnode)
summary(results_node)

Output:


##
## Call:
## summary.resamples(object = results_node)
##
## Models: 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
## Number of resamples: 10
##
## Accuracy
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 20 0.7142857 0.7821644 0.8144005 0.8075301 0.8447719 0.8571429   0
## 21 0.7142857 0.8000215 0.8144005 0.8075014 0.8403614 0.8571429   0
## 22 0.7023810 0.7941050 0.8263769 0.8099254 0.8328313 0.8690476   0
## 23 0.7023810 0.7941050 0.8263769 0.8111302 0.8447719 0.8571429   0
## 24 0.7142857 0.7946429 0.8313253 0.8135112 0.8417599 0.8690476   0
## 25 0.7142857 0.7916667 0.8313253 0.8099398 0.8408635 0.8690476   0
## 26 0.7142857 0.7941050 0.8203528 0.8123207 0.8528758 0.8571429   0
## 27 0.7023810 0.8060456 0.8313253 0.8135112 0.8333333 0.8690476   0
## 28 0.7261905 0.7941050 0.8203528 0.8111015 0.8328313 0.8690476   0
## 29 0.7142857 0.7910929 0.8313253 0.8087063 0.8333333 0.8571429   0
## 30 0.6785714 0.7910929 0.8263769 0.8063253 0.8403614 0.8690476   0
##
## Kappa
##        Min.   1st Qu.   Median            Mean   3rd Qu.            Max. NA's
## 20 0.3956835 0.5316120 0.5961830 0.5854366 0.6661120 0.6955990   0
## 21 0.3956835 0.5699332 0.5960343 0.5853247 0.6590982 0.6919315   0
## 22 0.3735084 0.5560661 0.6221836 0.5914492 0.6422128 0.7189781   0
## 23 0.3735084 0.5594228 0.6228827 0.5939786 0.6657372 0.6955990   0
## 24 0.3956835 0.5600352 0.6337821 0.5992188 0.6604703 0.7189781   0
## 25 0.3956835 0.5530760 0.6354875 0.5912239 0.6554912 0.7189781   0
## 26 0.3956835 0.5589331 0.6136074 0.5969142 0.6822128 0.6955990   0
## 27 0.3735084 0.5852459 0.6368425 0.5998148 0.6426088 0.7189781   0
## 28 0.4290780 0.5589331 0.6154905 0.5946859 0.6356141 0.7189781   0
## 29 0.4070588 0.5534173 0.6337821 0.5901173 0.6423101 0.6919315   0
## 30 0.3297872 0.5534173 0.6202632 0.5843432 0.6590982 0.7189781   0

We can see that for max node 22, accuracy is highest.

Step 4) Search the best ntrees

After tuning mtry and max node values, now let's tune the number of trees. The method is for tuning ntree is the same as tuning of max nodes.


store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
set.seed(5678)
rf_maxtrees <- train(survived~.,
data = data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
maxnodes = 24,
ntree = ntree)
key <- toString(ntree)
store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)

Output:


##
## Call:
## summary.resamples(object = results_tree)
##
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000
## Number of resamples: 10
##
## Accuracy
##           Min.   1st Qu. Median      Mean   3rd Qu.         Max. NA's
## 250  0.7380952 0.7976190 0.8083764 0.8087010 0.8292683 0.8674699         0
## 300  0.7500000 0.7886905 0.8024240 0.8027199 0.8203397 0.8452381         0
## 350  0.7500000 0.7886905 0.8024240 0.8027056 0.8277623 0.8452381         0
## 400  0.7500000 0.7886905 0.8083764 0.8051009 0.8292683 0.8452381         0
## 450  0.7500000 0.7886905 0.8024240 0.8039104 0.8292683 0.8452381         0
## 500  0.7619048 0.7886905 0.8024240 0.8062914 0.8292683 0.8571429         0
## 550  0.7619048 0.7886905 0.8083764 0.8099062 0.8323171 0.8571429         0
## 600  0.7619048 0.7886905 0.8083764 0.8099205 0.8323171 0.8674699         0
## 800  0.7619048 0.7976190 0.8083764 0.8110820 0.8292683 0.8674699         0
## 1000 0.7619048 0.7976190 0.8121510 0.8086723 0.8303571 0.8452381        0
## 2000 0.7619048 0.7886905 0.8121510 0.8086723 0.8333333 0.8452381        0
##
## Kappa
##           Min.   1st Qu. Median      Mean   3rd Qu.         Max. NA's
## 250  0.4061697 0.5667400 0.5836013 0.5856103 0.6335363 0.7196807         0
## 300  0.4302326 0.5449376 0.5780349 0.5723307 0.6130767 0.6710843         0
## 350  0.4302326 0.5449376 0.5780349 0.5723185 0.6291592 0.6710843         0
## 400  0.4302326 0.5482030 0.5836013 0.5774782 0.6335363 0.6710843         0
## 450  0.4302326 0.5449376 0.5780349 0.5750587 0.6335363 0.6710843         0
## 500  0.4601542 0.5449376 0.5780349 0.5804340 0.6335363 0.6949153         0
## 550  0.4601542 0.5482030 0.5857118 0.5884507 0.6396872 0.6949153         0
## 600  0.4601542 0.5482030 0.5857118 0.5884374 0.6396872 0.7196807         0
## 800  0.4601542 0.5667400 0.5836013 0.5910088 0.6335363 0.7196807         0
## 1000 0.4601542 0.5667400 0.5961590 0.5857446 0.6343666 0.6678832        0
## 2000 0.4601542 0.5482030 0.5961590 0.5862151 0.6440678 0.6656337        0

We have tuned all important parameters. Now we can train the random forest with the following parameters:

  • ntree =800:
  • mtry=4:
  • maxnodes = 24: Maximum 24 nodes in the terminal nodes (leaves)

fit_rf <- train(survived~.,
data_train,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = trControl,
importance = TRUE,
nodesize = 14,
ntree = 800,
maxnodes = 24)

Step 5) Model Evaluation: caret library in R has a function to make predictions.


predict(model, newdata= df)
argument
- `model`: Define the model evaluated before.
- `newdata`: Define the dataset to make prediction
prediction <-predict(fit_rf, data_test)

You can use the prediction to compute the confusion matrix and see the accuracy score
confusionMatrix(prediction, data_test$survived)

Output:


## Confusion Matrix and Statistics
##
##       Reference
## Prediction  No Yes
##        No  110  32
##        Yes  11  56
##
##                    Accuracy : 0.7943
##                    95% CI : (0.733, 0.8469)
##        No Information Rate : 0.5789
##        P-Value [Acc > NIR] : 3.959e-11
##
##                    Kappa : 0.5638
##  Mcnemar's Test P-Value : 0.002289
##
##                    Sensitivity : 0.9091
##                    Specificity : 0.6364
##        Pos Pred Value : 0.7746
##        Neg Pred Value : 0.8358
##                    Prevalence : 0.5789
##        Detection Rate : 0.5263
##        Detection Prevalence : 0.6794
##        Balanced Accuracy : 0.7727
##
##        'Positive' Class : No
##

We have got an accuracy of 0.7943 percent, which is much higher than the default accuracy.

Step 6) Visualize Result

Now let’s find feature importance with the function varImp(). In the variable importance plot, it seems that the most relevant features are sex and age. The more important features tend to appear near the root of the tree, on the other hand, less important features will often appear close to the leaves.


varImpPlot(fit_rf)
varImp(fit_rf)
## rf variable importance
##
##                    Importance
## sexmale     100.000
## age             28.014
## pclassMiddle          27.016
## fare             21.557
## pclassUpper           16.324
## sibsp           11.246
## parch              5.522
## embarkedC            4.908
## embarkedQ            1.420
## embarkedS             0.000

Import the Data

We will use the Titanic dataset for our case study in the Random forest model. You can directly import a dataset from the internet. 

Read: The Battle Between R and Python

Train the model

The random forest has some parameters that can be changed to improve the generalization of the prediction. You will use the function RandomForest() to train the model. We need to install a RandomForest library or package to use this method.

A random forest model can be built using all predictors and the target variable as the categorical outcome. Random forest was attempted with the train function from the caret package and also with the randomForest function from the randomForest package.

Tuning RF Model

The tuning parameter for a model is very cumbersome work. There can be many permutations and combinations for a set of hyperparameters. Trying all combinations can be a very time and memory consuming task. A better approach can be that the algorithm decides the best set of parameters. There are two common methods for tuning.

  • Grid Search 
  • Random Search

Random Search

Random search does not evaluate all the combinations of hyperparameters . Instead, it will randomly select any combination at every iteration. The advantage is it’s lower the computational cost, memory cost and less time required. 

Grid Search

In this tutorial, we will cover both methods, we will train the model using a grid search. Grid search is simple and the model is trained for all combinations we give in the parameters list.

If the number of trees is 10 , 20, 30 and the number of mtry(no. of candidates drawn to feed algorithm) equals  1, 2, 3, 4, 5. Then total models will be created.

The drawback of the grid search is the high amount of time and experiments carried out. To overcome this issue we can use random search.

Conclusion

So now, whenever anyone talks about Random forest in R, Random forest in Python or just random forest, you will have the basic idea of it. Implementing Random forest in Python is similar to how it was implemented in R.

Machine learning algorithms like the random forest, Neural networks are known for better accuracy and high performance, but the problem is that they are a black box. No-one knows how they work internally. So, results interpretation is a big issue and challenge. It's fine to not know  the internal statistical details of the algorithm but how to tune random forest is of utmost importance. Tuning the Random forest algorithm is still relatively easy compared to other algorithms.

In spite of being a black-box random forest is a highly popular ensembling technique for better accuracy. It’s even called Panacea in Machine Learning Algorithms. It is said that if you are confused about  deciding which algorithm to use for classification then you can use a random forest with closing eyes. Go to Janbask Training to get a better understanding of Random Forest.


    Janbask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


Comments

Trending Courses

AWS

  • AWS & Fundamentals of Linux
  • Amazon Simple Storage Service
  • Elastic Compute Cloud
  • Databases Overview & Amazon Route 53

Upcoming Class

11 days 12 Jun 2020

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing

Upcoming Class

3 days 04 Jun 2020

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning

Upcoming Class

7 days 08 Jun 2020

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation

Upcoming Class

1 day 02 Jun 2020

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL

Upcoming Class

7 days 08 Jun 2020

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing

Upcoming Class

0 day 01 Jun 2020

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum

Upcoming Class

2 days 03 Jun 2020

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design

Upcoming Class

4 days 05 Jun 2020

Search Posts

Reset

Receive Latest Materials and Offers on Data Science Course

Interviews