International Womens Day : Flat 30% off on live classes + 2 free self-paced courses - SCHEDULE CALL

Select Course
Blog
Corporate Training

+1 202 599 3842

(4.8/5 ) | 1.5K+ Ratings

- Data Science Blogs -

Introduction of Decision Trees in Machine Learning

Trees form the basis of any human life. They are pillars for sustaining human life. The same can be said about trees in decision making. Everyone has been in a scenario where he or she has to say a yes or no to a particular question and that question will form the basis of the next question like an interview or a viva-voce. Thus, it is no different than decision trees have also found an extremely comfortable position in the world of machine learning and have positioned themselves as extremely useful in classification as well as regression. As the name suggests, this algorithm uses a tree-like model for making decisions.

In this blog, we will be going through trees with a focus on their use in the domain of data science. First of the representation of an algorithm as a tree will be discussed followed by the terminologies used in then. This will be followed by the use of decision in modern-day machine learning covering its use and code part. Finally, the advantages and disadvantages of this algorithm will be presented.

Representation of algorithms as a tree:

Now, one of the biggest questions which we encounter is the representation of an algorithm in the form of a tree. Image calling a customer support center of a mobile phone repair center with an “intelligent computerized assistant”. The first thing that is asked is the language preference i.e. the machine will say something like press 1 for English and press 2 for Hindi. This will be followed by a number of more questions like press 1 for a new complaint, press 2 for existing complain, press 3 for repair status. This whole analogy can be represented as a tress as shown in the following figure:

Representation of algorithms as a tree:

If we look closely at figure 1: Analogy to a tree, it can be observed that the flow of the intelligent computerized assistant has been depicted as an inverted tress. In this fashion, all the problems in the domain of classification, as well as regression, can be depicted as the tree.

Terminologies in decision trees:

Read: How to Do Data Manipulation of Packages Using R?

As it can be observed in any domain, that there exist terminologies and owing to the widespread use of decision trees in numerous domains the terminology is quite widespread. The following is most commonly used terms in about decision trees:

Root Node: This is the node from where the tree originates in our case in figure 1, it’s the node where the question for asking selecting the language is asked. This node represents the total sample under consideration and it splits into a number of homogenous or heterogeneous sets.
Child node: The node which is derived from some other node is called a child node and the node from which this child node is derived is called as the parent node.
Branch: Once a node is split or extends in one or more paths, these paths are known as the branch or branches of a tree.
Sub-tree: A subsection of the entire tree is called as a sub-tree.
Leaf node: These are the end nodes of any tree. These nodes don’t have any further branching and are terminal in nature.
Decision node: Any node where a decision to proceed further or conclude a decision is called a decision node. Owing to the property that there always exists an empty set in a set, we can have one set which is not a decision node and exists as a null node.
Pruning: The process of joining a branch emerging from a tree is called pruning.
Splitting: The process by which a node is split into a number (N) of branches. ()

In the domain of machine learning, there are two main types of decision trees which are based upon the data they are intended for. These are:

Classification trees: These types of trees are used when we want to classify things. Like as in figure 1, the machine is classified based upon what is intended out of the call. In these types, the decision variables which come as a query to the tree as well as the training data is categorical and hence form the basis of classification.
Regression trees: These are used when we are trying to forecast a particular value. The input is the continuous data type and hence, the output is also a prediction not a class label. The methodology of building a regression tree allows the input variables to contain both categorical as well as continuous variables. When a decision tree for regression is generated, it contains a test on the input variable’s value. The terminal nodes contain the predicted value.

Working of decision trees:

There are a few well-known algorithms for decision trees like ID3, CART. While explaining the working of decision trees, ID3 (Iterative Dichotomiser 3). Iterative dichotomiser starts with the native dataset as the root and with each iteration it the algorithm transverses through an unused attribute and calculates the entropy for that attribute. Entropy is defined as the measure of uncertainty in the data and is calculated as:

Working of decision trees:

Once the entropy for all the unused attributes is calculated the attribute with the smallest entropy value is selected. The set is then split into further sub-sets for processing. This is a recursive process done until we reach a null set.

Read: Data Science and Software Engineering - What you should know?

The practical implication of this method using sklearn is quite easy and the generated tree can be visualized by using Graphviz.

First of all, let's import the libraries and iris dataset (it comes with sklearn and is a classical dataset which is openly available.

from sklearn.datasets import load_iris
from sklearn import tree
import Graphviz

Now, the model for classification based decision tree is to be created. Command for regression tree is also the same; the only difference is that the data being supplied into machine is continuous in nature.

train_iris, target_iris = load_iris(return_X_y=True)
model_tree = tree.DecisionTreeClassifier()
model_tree = model_tree.fit(train_iris, target_iris)

This will train and model and decision tree can be visualized by using graphviz as

dot_data = tree.export_graphviz(model_tree, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("d:\iris")

This will save a pdf file in D: as iris.pdf which will contain the following decision tree:

decision tree

Read: Job Description & All Key Responsibilities of a Data Scientist

Pros and cons of decision trees

Advantages

Rules generated are understandable
Decision tree generation and querying is not much computationally expensive
This model can handle categorical as well as continuous data.
Handles multi-output data.
The model generated can be viewed and hence, the approach is a white box.

Disadvantages

Regression models trained using decision may not suit the need as a leaf node may not be suitable.
If training samples are small, this model is prone to errors in multi-class classification.
Pruning needs to be used in these models for higher accuracy. This adds to the computational costs.
Mathematical models to create decision trees can create extremely complex trees which have generalization issue and depict overfitting.
These models are unstable as a slight change in data will create a new decision tree.

Concluding Remarks:

As can be observed from the study of decision trees, they are quite handy stuff owing to ease of use as well as the white box approach. There are scenarios where they don’t make a good fit. Whatever the case may be but these models assist in evaluating the possible outcomes and provide a visual representation of these outcomes. Hence, these remain a very handy tool.

Please leave the query and comments in the comment section.

Read: Learn Data Science - Get Certified & See an Advancement in Your Career

FaceBook

Twitter

JanBask Training Team

The JanBask Training Team includes certified professionals and expert writers dedicated to helping learners navigate their career journeys in QA, Cybersecurity, Salesforce, and more. Each article is carefully researched and reviewed to ensure quality and relevance.

Comments

Data Science Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Trending Courses

Cyber Security

Introduction to cybersecurity
Cryptography and Secure Communication
Cloud Computing Architectural Framework
Security Architectures and Models

Upcoming Class

20 days 02 Aug 2025

View Details

Introduction and Software Testing
Software Test Life Cycle
Automation Testing and API Testing
Selenium framework development using Testing

Upcoming Class

13 days 26 Jul 2025

View Details

Salesforce

Salesforce Configuration Introduction
Security & Automation Process
Sales & Service Cloud
Apex Programming, SOQL & SOSL

Upcoming Class

-1 day 12 Jul 2025

View Details

Business Analyst

BA & Stakeholders Overview
BPMN, Requirement Elicitation
BA Tools & Design Documents
Enterprise Analysis, Agile & Scrum

Upcoming Class

-1 day 12 Jul 2025

View Details

MS SQL Server

Introduction & Database Query
Programming, Indexes & System Functions
SSIS Package Development Procedures
SSRS Report Design

Upcoming Class

-1 day 12 Jul 2025

View Details

Data Science

Data Science Introduction
Hadoop and Spark Overview
Python & Intro to R Programming
Machine Learning

Upcoming Class

-1 day 12 Jul 2025

View Details

DevOps

Intro to DevOps
GIT and Maven
Jenkins & Ansible
Docker and Cloud Computing

Upcoming Class

6 days 19 Jul 2025

View Details

Hadoop

Architecture, HDFS & MapReduce
Unix Shell & Apache Pig Installation
HIVE Installation & User-Defined Functions
SQOOP & Hbase Installation

Upcoming Class

5 days 18 Jul 2025

View Details

Python

Features of Python
Python Editors and IDEs
Data types and Variables
Python File Operation

Upcoming Class

1 day 14 Jul 2025

View Details

Artificial Intelligence

Components of AI
Categories of Machine Learning
Recurrent Neural Networks
Recurrent Neural Networks

Upcoming Class

5 days 18 Jul 2025

View Details

Machine Learning

Introduction to Machine Learning & Python
Machine Learning: Supervised Learning
Machine Learning: Unsupervised Learning

Upcoming Class

12 days 25 Jul 2025

View Details

Tableau

Introduction to Tableau Desktop
Data Transformation Methods
Configuring tableau server
Integration with R & Hadoop

Upcoming Class

5 days 18 Jul 2025

View Details

Browse Categories

The Best Data Science Projects (Beginner To Advanced)

Jan 15, 2024 eye-dark

5.4k

Data Science Career Path - Know Why & How to Make a Career in Data Science!

Jun 12, 2024 eye-dark

218.8k

How to Work with Regression based Models?

Apr 20, 2020 eye-dark

4.9k

Search Posts

Reset

The Best Data Science Projects (Beginner To Advanced) 5.4k

Data Science Career Path - Know Why & How to Make a Career in Data Science! 218.8k

How to Work with Regression based Models? 4.9k

A Simple & Detailed Introduction of ANOVA 4.3k

A Complete Guide for Processing of Data 7.4k

Data Science Course
Upcoming Batches

Jul

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

Aug

Mon - Fri

6 Weeks

View Detail

Receive Latest Materials and Offers on Data Science Course

By submitting my contact details, I agree Privacy Policy ... and I consent to receiving SMS/call/email, including marketing and promotional SMS. Read More

Scroll

Introduction of Decision Trees in Machine Learning

Representation of algorithms as a tree:

JanBask Training Team

Comments

Trending Courses

Browse Categories

Related Posts