Our Support: During the COVID-19 outbreak, we request learners to CALL US for Special Discounts!

- Data Science Blogs -

Introduction of Decision Trees in Machine Learning

Trees form the basis of any human life. They are pillars for sustaining human life. The same can be said about trees in decision making. Everyone has been in a scenario where he or she has to say a yes or no to a particular question and that question will form the basis of the next question like an interview or a viva-voce. Thus, it is no different than decision trees have also found an extremely comfortable position in the world of machine learning and have positioned themselves as extremely useful in classification as well as regression. As the name suggests, this algorithm uses a tree-like model for making decisions.

In this blog, we will be going through trees with a focus on their use in the domain of data science. First of the representation of an algorithm as a tree will be discussed followed by the terminologies used in then. This will be followed by the use of decision in modern-day machine learning covering its use and code part. Finally, the advantages and disadvantages of this algorithm will be presented.

Representation of algorithms as a tree:

Now, one of the biggest questions which we encounter is the representation of an algorithm in the form of a tree. Image calling a customer support center of a mobile phone repair center with an “intelligent computerized assistant”. The first thing that is asked is the language preference i.e. the machine will say something like press 1 for English and press 2 for Hindi. This will be followed by a number of more questions like press 1 for a new complaint, press 2 for existing complain, press 3 for repair status. This whole analogy can be represented as a tress as shown in the following figure:

Representation of algorithms as a tree:

If we look closely at figure 1: Analogy to a tree, it can be observed that the flow of the intelligent computerized assistant has been depicted as an inverted tress. In this fashion, all the problems in the domain of classification, as well as regression, can be depicted as the tree.

Terminologies in decision trees:

Read: A Simple & Detailed Introduction of ANOVA

As it can be observed in any domain, that there exist terminologies and owing to the widespread use of decision trees in numerous domains the terminology is quite widespread. The following is most commonly used terms in about decision trees:

  1. Root Node: This is the node from where the tree originates in our case in figure 1, it’s the node where the question for asking selecting the language is asked. This node represents the total sample under consideration and it splits into a number of homogenous or heterogeneous sets.
  2. Child node: The node which is derived from some other node is called a child node and the node from which this child node is derived is called as the parent node.
  3. Branch: Once a node is split or extends in one or more paths, these paths are known as the branch or branches of a tree.
  4. Sub-tree: A subsection of the entire tree is called as a sub-tree.
  5. Leaf node: These are the end nodes of any tree. These nodes don’t have any further branching and are terminal in nature.
  6. Decision node: Any node where a decision to proceed further or conclude a decision is called a decision node. Owing to the property that there always exists an empty set in a set, we can have one set which is not a decision node and exists as a null node.
  7. Pruning: The process of joining a branch emerging from a tree is called pruning.
  8. Splitting: The process by which a node is split into a number (N) of branches. ()

In the domain of machine learning, there are two main types of decision trees which are based upon the data they are intended for. These are:

  1. Classification trees: These types of trees are used when we want to classify things. Like as in figure 1, the machine is classified based upon what is intended out of the call. In these types, the decision variables which come as a query to the tree as well as the training data is categorical and hence form the basis of classification.
  2. Regression trees: These are used when we are trying to forecast a particular value. The input is the continuous data type and hence, the output is also a prediction not a class label. The methodology of building a regression tree allows the input variables to contain both categorical as well as continuous variables. When a decision tree for regression is generated, it contains a test on the input variable’s value. The terminal nodes contain the predicted value.

Working of decision trees:

There are a few well-known algorithms for decision trees like ID3, CART. While explaining the working of decision trees, ID3 (Iterative Dichotomiser 3). Iterative dichotomiser starts with the native dataset as the root and with each iteration it the algorithm transverses through an unused attribute and calculates the entropy for that attribute. Entropy is defined as the measure of uncertainty in the data and is calculated as:

Working of decision trees:

Once the entropy for all the unused attributes is calculated the attribute with the smallest entropy value is selected. The set is then split into further sub-sets for processing. This is a recursive process done until we reach a null set.

Read: Introduction to Regression Analysis & Its Approaches

The practical implication of this method using sklearn is quite easy and the generated tree can be visualized by using Graphviz.

First of all, let's import the libraries and iris dataset (it comes with sklearn and is a classical dataset which is openly available.

from sklearn.datasets import load_iris
from sklearn import tree
import Graphviz

Now, the model for classification based decision tree is to be created. Command for regression tree is also the same; the only difference is that the data being supplied into machine is continuous in nature.

train_iris, target_iris = load_iris(return_X_y=True)
model_tree = tree.DecisionTreeClassifier()
model_tree = model_tree.fit(train_iris, target_iris)

This will train and model and decision tree can be visualized by using graphviz as

dot_data = tree.export_graphviz(model_tree, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("d:\iris")

This will save a pdf file in D: as iris.pdf which will contain the following decision tree:

decision tree

Read: A Complete Guide for Processing of Data

Pros and cons of decision trees

Advantages

  • Rules generated are understandable
  • Decision tree generation and querying is not much computationally expensive
  • This model can handle categorical as well as continuous data.
  • Handles multi-output data.
  • The model generated can be viewed and hence, the approach is a white box.

Disadvantages

  • Regression models trained using decision may not suit the need as a leaf node may not be suitable.
  • If training samples are small, this model is prone to errors in multi-class classification.
  • Pruning needs to be used in these models for higher accuracy. This adds to the computational costs.
  • Mathematical models to create decision trees can create extremely complex trees which have generalization issue and depict overfitting.
  • These models are unstable as a slight change in data will create a new decision tree.

Concluding Remarks:

As can be observed from the study of decision trees, they are quite handy stuff owing to ease of use as well as the white box approach. There are scenarios where they don’t make a good fit. Whatever the case may be but these models assist in evaluating the possible outcomes and provide a visual representation of these outcomes. Hence, these remain a very handy tool.

Please leave the query and comments in the comment section.

Read: Data Science Interview Questions & Answers



    Janbask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


Comments

Trending Courses

AWS

  • AWS & Fundamentals of Linux
  • Amazon Simple Storage Service
  • Elastic Compute Cloud
  • Databases Overview & Amazon Route 53

Upcoming Class

1 day 14 Jul 2020

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing

Upcoming Class

18 days 31 Jul 2020

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning

Upcoming Class

3 days 16 Jul 2020

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation

Upcoming Class

4 days 17 Jul 2020

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL

Upcoming Class

2 days 15 Jul 2020

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing

Upcoming Class

11 days 24 Jul 2020

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum

Upcoming Class

1 day 14 Jul 2020

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design

Upcoming Class

2 days 15 Jul 2020

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation

Upcoming Class

10 days 23 Jul 2020

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks

Upcoming Class

1 day 14 Jul 2020

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning

Upcoming Class

4 days 17 Jul 2020

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop

Upcoming Class

0 day 13 Jul 2020

Search Posts

Reset

Receive Latest Materials and Offers on Data Science Course

Interviews