Context:¶

ExperienceMyServices reported that a typical American spends an average of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device with a standard deviation of 110 minutes.

To test the validity of this statement, you collected 30 samples from friends and family. The results for the time spent per day accessing the Internet via a mobile device (in minutes) are stored in "InternetMobileTime.csv".

Key Question:¶

Is there enough statistical evidence to conclude that the population mean time spent per day accessing the Internet via mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05.

Note: We can assume that the samples are randomly selected, independent, and come from a normally distributed population.

Importing necessary libraries¶

In [ ]:
# import the important packages
import pandas as pd  # library used for data manipulation and analysis
import numpy as np  # library used for working with arrays
import matplotlib.pyplot as plt  # library for visualization
import seaborn as sns  # library for visualization
%matplotlib inline

import scipy.stats as stats  # this library contains a large number of probability distributions as well as a growing library of statistical functions
In [1]:
# Connect to google
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Importing the Data¶

In [ ]:
mydata = pd.read_csv('InternetMobileTime.csv')
mydata.head()
Out[ ]:
Minutes
0 72
1 144
2 48
3 72
4 36
In [ ]:
mydata.shape
Out[ ]:
(30, 1)

Step 1: Define null and alternate hypotheses¶

The null hypothesis states that the mean Internet usage time, $\mu$ is equal to 144.¶

The alternative hypothesis states that the mean Internet usage time, $\mu$ is not equal to 144.¶

  • $H_0$: $\mu$ = 144
  • $H_a$: $\mu$ $\neq$ 144

Step 2: Decide the significance level¶

Here, we are given that $\alpha$ = 0.05.

In [ ]:
print("The sample size for this problem is", len(mydata))
The sample size for this problem is 30

Step 3: Identify the test statistic¶

The population is normally distributed and the population standard deviation is known to be equal to 110. So, we can use the Z-test statistic.

Step 4: Calculate the p-value using z-statistic¶

In [ ]:
sample_mean = mydata["Minutes"].mean()
In [ ]:
# calculating z-stat

n=30
mu = 144
sigma = 110

test_stat =  (sample_mean - mu)/(sigma/np.sqrt(n))
In [ ]:
test_stat
Out[ ]:
1.8157832663959144
In [ ]:
from scipy.stats import norm

# p-value for one-tailed test
p_value1 = 1 - norm.cdf(test_stat)

# we can find the p_value for the the two-tailed test from the one-tailed test
p_value_ztest = p_value1*2
In [ ]:
print('The p-value is: {0} '.format(p_value_ztest))
The p-value is: 0.06940362517785204 

Step 5: Decide to reject or not to reject the null hypothesis based on the z-statistic¶

In [ ]:
alpha_value = 0.05 # level of significance

print('Level of significance: %.2f' %alpha_value)

if p_value_ztest < alpha_value:
    print('We have evidence to reject the null hypothesis since the p-value is less than the level of significance'.format(p_value_ztest))
else:
    print('We have no evidence to reject the null hypothesis since the p-value is greater than the level of significance'.format(p_value_ztest))
Level of significance: 0.05
We have no evidence to reject the null hypothesis since the p-value is greater than the level of significance

We have calculated the z-statistic, which works on the assumption that population standard deviation is known but in real life, this assumption is very unlikely, and to deal with this problem there is another test called t-statistic, which is similar to the z-statistic, with the assumption that population standard deviations are not known and sample standard deviation is used to calculate the test statistic.

we will use scipy.stats.ttest_1samp which calculates the t-test for the mean of one sample given the sample observations. This function returns the t statistic and the p-value for a two-tailed t-test.

Step 6: Calculate the p-value using t-statistic¶

In [ ]:
t_statistic, p_value_ttest = stats.ttest_1samp(mydata, popmean = 144)
print('One sample t test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value_ttest))
One sample t test 
t statistic: [1.41131966] p value: [0.16878961] 

Step 7: Decide to reject or not to reject the null hypothesis based on t-statistic¶

In [ ]:
alpha_value = 0.05 # level of significance

print('Level of significance: %.2f' %alpha_value)

if p_value_ttest < alpha_value:
    print('We have evidence to reject the null hypothesis since the p-value is less than the level of significance'.format(p_value_ttest))
else:
    print('We have no evidence to reject the null hypothesis since the p-value is greater than the level of significance'.format(p_value_ttest))
Level of significance: 0.05
We have no evidence to reject the null hypothesis since the p-value is greater than the level of significance

Observation

  • So at a 5% significance level, we do not have enough statistical evidence to prove that the mean time spent on the Internet is not equal to 144 minutes.
In [2]:
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Mentored Session/Notebook - Mobile Internet Usage Analysis.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Mentored Session/Notebook - Mobile Internet Usage Analysis.ipynb to html
[NbConvertApp] Writing 299534 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week Two - Statistics for Data Science/Mentored Session/Notebook - Mobile Internet Usage Analysis.html