Our Support: During the COVID-19 outbreak, we request learners to CALL US for Special Discounts!

- Data Science Blogs -

An Easy to Understand the Definition of the Confidence Interval



Introduction

In most walks of life, we talk that I’m quite confident that this thing will happen or I’m quite confident about this. All of us at some point in time have been talking about this. If we look at statements made by us, it will be obvious to anybody that we are always certain with a degree of confidence in our statement. This statement is made in natural language and is quite vague if we can put a number to this, things will become more clear. 

Now, let us say that we have a record of the number of times one made a statement like this was made. As soon as we start this process, we put a number and enter the domain of statistics. In the domain of statistics, it becomes possible for us to define the level of certainly using a number. This number can belong to a particular range. This range to which the number belongs can be said to be the confidence interval.

Technically speaking, a confidence interval is the measure of uncertainty in the data. These are widely used to depict how much confidence one can place in the outcomes reflected in the results of a poll or survey. Confidence intervals are connected to confidence levels due to their inherent nature.

Confidence Intervals – The definition:

Theoretically, the confidence interval is the range of values which have derived from a sample population and depict the behavior of an unknown sample belonging to the broader population group. As can be observed from the natural surrounding that No two groups are the same. Thus, if two different samples are taken from a single population they can also yield a difference in the range. But, if this process is repeated several times, some parts of the resulting confidence interval will be able to replicate the unknown behavior. The percentage is known as the confidence level. 

Differentiating confidence interval from confidence level:

Though the concept of the confidence interval and confidence level are intrinsically linked, they still are different concepts in the domain of statistics whereas in slang language that can still be used interchangeably. Usually, confidence intervals are depicted in percentage terms whereas confidence intervals are sheer numbers. For example, if a survey of lactating mothers is conducted to have an estimate of the number of diapers purchased by each one of them. The results yield a confidence level of 95% and we get a confidence interval of (100,200). This would mean that a lactating mother on an average buys diapers somewhere between 100 and 200 and this could be true for 95% times. Diagrammatically, it is being displayed in Fig. 1.

Known Standard Deviation

How to calculate the confidence interval for a population with a known standard deviation:

<meta charset="utf-8" />

One of the methods to compute the confidence interval is the use of error bound method. The other being the use of a central limit theorem. In this blog, we’ll be using the error bound method. This method gets its name from the fact that it gives the boundary of the interval as obtained from the standard deviation making its application a bit more straightforward. To compute the interval for a population with mean ?, where the standard deviation of the population is known, x and the margin of error is required. IN this case, the margin of error (EBM) is called the error bound for the population. 

The mathematical formulation for margin of error is as:

The margin of error depends upon the confidence level (CL) which is to be achieved. depicts the probability that an unknown population parameter is not contained in the confidence level.

Mathematically, 

<meta charset="utf-8" />=1-CL

For a population with known standard deviation, the confidence interval is based upon the assumption that sampling distribution follows approximately a normal distribution.

To achieve a confidence interval of 95% of the probability distribution, 95% of the area under the curve should be computed. I.e. one should leave a 2.5% area on the tails. This for a distribution whose mean is 10 and distribution is normal would like as in Figure 2:

Figure 2: area under the curve

Steps for calculating the confidence interval using the Error Bound Method:

  • First of all, calculate the mean of the sample data. 
  • Calculate the z-score using the standard normal table corresponding to the confidence level desired.
  • Compute the error bound for EBM
  • Define the confidence interval
  • Explain the final output using standard English literature.

Example:

Let us say that we are interested in finding the mean scores on an examination. Now, a random sample of 40 scores is taken and mean for which is 68 and the standard deviation is 3. Here, the point of interest is to find the confidence interval with 95?curacy.

So,

To calculate the confidence interval, the sample mean is an error bound is required.

  • EBM = 
  • ?=3;n=40 (Given)
  • Confidence interval is 95% (CL = 0.95)
  • Thus, ?=0.05

=> 

=>

Thus, the left area is  is 0.025 and right are is 0.095

Using the probability distribution table for the standard normal distribution we can infer that for a normal distribution the value of z for this case is 1.96.

Thus, my error bound, EBM  = 1.96*() = 12.397 

EBM = 0.93

Lower bound  = 68 – 0.93 = 67.07

Upper bound = 68 + 0.93 = 68.93

Thus, 95% confidence interval is (67.07, 68.93)

Unknown Standard Deviation

While doing statistics over a dataset, the standard deviation of the same is rarely known. Thus, the method as described in the previous section is seldom used. But, unknown standard deviation had never been an issue, owing to the large sizes of datasets from which the standard deviation was calculated making way for calculation of confidence interval. But if the dataset had fewer observations, the final was never good enough to be trusted. 

This problem was faced by William S. Gosset of Guinness brewery in Dublin when he was experimenting with hops and barley and generated very few observations. While performing the experimentation, he realized that normal distribution was not the ideal way of calculating the confidence interval. This problem resulted in the genesis of the student’s t-distribution. 

The student’s t-distribution is the symmetrical curve. It's different from a normal distribution in the sense that its flat over the x-axis and hence, can take a larger number of standard deviations to capture the same amount of area under the curve. In reality, there can be an infinite number of students t curve which can be achieved by altering the number of samples. With the increase in sample size, the t-distribution takes the shape of a normal curve and with fewer samples, it flattens towards the x-axis. The relationship between normal distribution and t-distribution can be depicted as in figure 3. 

Figure 3: Relationship (source:openstax.org)

Developing the student t-distribution:

Step 1: Calculate t using the following formula:

Where Z is the standard normal distribution and x2 is the chi-squared distribution with v degree of freedom.

Step 2: By Substituting the numerator and denominator with expanded values we get:

Step 3:

Finally, for v = n-1, this equation becomes:

Reiterating the confidence interval formula for the mean for scenarios where sample size is less than 30 and standard deviation is unknown, the confidence interval becomes:

Here the point estimate of the population standard deviation, s has been substituted for the population standard deviation, ?, and t?,? has been substituted for Z?.

V represents the degree of freedom of the distribution and varies according to the size of sample. 

Conclusion:

A number of types of estimates are compounded from the statistics of a dataset. One of them which shows the level of certainty in talk is called confidence interval. It proposes a range of values to which an unknown parameter belongs. This is associated with the confidence level that the proposed parameter is in the range. 

Please leave the query and comments in the comment section.


    Janbask Training

    A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.


Comments

Trending Courses

AWS

  • AWS & Fundamentals of Linux
  • Amazon Simple Storage Service
  • Elastic Compute Cloud
  • Databases Overview & Amazon Route 53

Upcoming Class

-0 day 05 Aug 2020

DevOps

  • Intro to DevOps
  • GIT and Maven
  • Jenkins & Ansible
  • Docker and Cloud Computing

Upcoming Class

5 days 10 Aug 2020

Data Science

  • Data Science Introduction
  • Hadoop and Spark Overview
  • Python & Intro to R Programming
  • Machine Learning

Upcoming Class

12 days 17 Aug 2020

Hadoop

  • Architecture, HDFS & MapReduce
  • Unix Shell & Apache Pig Installation
  • HIVE Installation & User-Defined Functions
  • SQOOP & Hbase Installation

Upcoming Class

2 days 07 Aug 2020

Salesforce

  • Salesforce Configuration Introduction
  • Security & Automation Process
  • Sales & Service Cloud
  • Apex Programming, SOQL & SOSL

Upcoming Class

5 days 10 Aug 2020

QA

  • Introduction and Software Testing
  • Software Test Life Cycle
  • Automation Testing and API Testing
  • Selenium framework development using Testing

Upcoming Class

2 days 07 Aug 2020

Business Analyst

  • BA & Stakeholders Overview
  • BPMN, Requirement Elicitation
  • BA Tools & Design Documents
  • Enterprise Analysis, Agile & Scrum

Upcoming Class

-0 day 05 Aug 2020

MS SQL Server

  • Introduction & Database Query
  • Programming, Indexes & System Functions
  • SSIS Package Development Procedures
  • SSRS Report Design

Upcoming Class

9 days 14 Aug 2020

Python

  • Features of Python
  • Python Editors and IDEs
  • Data types and Variables
  • Python File Operation

Upcoming Class

2 days 07 Aug 2020

Artificial Intelligence

  • Components of AI
  • Categories of Machine Learning
  • Recurrent Neural Networks
  • Recurrent Neural Networks

Upcoming Class

2 days 07 Aug 2020

Machine Learning

  • Introduction to Machine Learning & Python
  • Machine Learning: Supervised Learning
  • Machine Learning: Unsupervised Learning

Upcoming Class

11 days 16 Aug 2020

Tableau

  • Introduction to Tableau Desktop
  • Data Transformation Methods
  • Configuring tableau server
  • Integration with R & Hadoop

Upcoming Class

2 days 07 Aug 2020

Search Posts

Reset

Receive Latest Materials and Offers on Data Science Course

Interviews