Mean, Variance and Standard deviation of a sample
Estimating variance and standard deviation of a sample drawn from a normal distribution
In this notebook, we will see how to correctly estimate variance and standard deviation from a sample drawn from a normal distribution. This notebook is inspired from one of the videos of Joshua Starmer.
# Let's first load our numerical Python library
import numpy as np
# Get a normal distribution with mean 50 and standard deviation of 10.
population = np.random.normal(50,10,1000)
print(population)
Our population is a normal(Gaussian) distribution with a mean of 50 and standard deviation of 10. Let's plot it...
from matplotlib import pyplot as plt # For plotting
plt.hist(population,50)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.text(15, 45, r'$\mu=50, \sigma^2=10$')
plt.title("histogram")
plt.show()
Now, we will draw a random sample from our normal distribution.
# Draw 100 random samples from population
sample = np.random.choice(population, 100)
# calculate mean of sample
mean_sample = np.mean(sample)
print("sample mean = ", mean_sample)
We can see that the mean of the sample (49.81) is near to the actual mean of population. This was expected as we have less data in the sample. Let's now try to estimate the variance and standard deviation of the population from the sample data. Please note that one can only estimate and not calculate the variance and standard deviation of the population. This is because most of the times, we don't have the original population but only a small sample. The formula to estimate the variance and standard deviation of the population is as follows:
$$\sigma^2_{estimated} = \frac{\Sigma (x - \bar{x})}{n-1} $$
Where, $\bar{x}$ is the mean of the sample, x is the individual value in the sample, and n is total number of measurements in the sample (100 in our case). Please note that in the equation, we use $n-1$ and not $n$. This is because we will underestimate our variance if we divide by $n$. For a very detailed proof, please refer to this video. Let's see if this is true...
# estimate variance and standard deviation of the sample
est_variance = np.var(sample, ddof = 1) # ddof = 1 means we divide by total number of measurements - 1
std_sample = np.sqrt(est_variance)
print("Estimated variance = ", est_variance, "and estimated standard deviation = ", std_sample)
Let's now estimate incorrect variance and standard deviation
# incorrect variance and standard deviation
incorrect_variance = np.var(sample)
incorrect_std_sample = np.std(sample)
print(f'Incorrect variance is {incorrect_variance} and biased standard deviation is {incorrect_std_sample}')
As one can see, we underestimated both variance and standard deviation if we use incorrect formula.