Measures of location describe the most “representative” value in a population.
\[E(X) = \mu = {\frac{1}{N}}\sum{x_i} \]
Measures of scale describe how far away from the mean most of the individuals in the population are.
\[Var(X) = \sigma^2 = \frac{1}{N}\sum({x_i - \mu)^2}\]
\(\Rightarrow\) Question: Why is SD is a more intuitive measure of dispersion than variance?
\[\bar{X} = \frac{\sum(x)}{n}, \ \ s^2 = \frac{\sum({x_i - \bar{X})^2}}{n-1}\] These statistics are estimators of the true (usually unknown) population parameters.
\(\Rightarrow\) Q: Why does the sample variance have \(n-1\) degrees of freedom?
The median and inter-quartile range (IQR) are alternative measures of location and scale that are robust to outliers.
\(\Rightarrow\) Q: What are the rules of thumb for the IQR?
\(\Rightarrow\) Q: What are the rules of thumb for the SD?
We’ve talked a lot about why taking random samples is important for obtaining representative estimates for population parameters, and things to watch out for to minimize sample bias (what are the possible sources of sample bias?)
\(\Rightarrow\) Q: What effects does sampling error have on our sample estimates?
Once we have taken a (hopefully unbiased) sample from a population, we can look at the distribution of our measurements for each individual in the sample.
For example, let’s simulate the height distribution of everyone who was this class in each of the last four years, by taking four random samples of 17 people (assuming biologists represent an unbiased sample of the human population!).
# ================================================================ #
# number of samples and sample size
= 17 # sample size
sample_size = 4 # number of samples
sample_num = 169
sample_mean = 14
sample_sd
# ================================================================ #
# matrix of random samples
= replicate(sample_num,
sample_data rnorm(sample_size, mean = sample_mean, sd = sample_sd))
= matrix(sample_data, ncol=sample_num)
sample_data colnames(sample_data) = paste0("s", 1:sample_num) # label the columns
= stack(data.frame(sample_data, stringsAsFactors = TRUE))
height_long names(height_long) = c("Height_mm","Sample")
# ================================================================ #
# make a second data frame holding height means for each sample
= height_long %>%
height_means group_by(Sample) %>%
summarise(mean_height = mean(Height_mm))
# height_means
# ================================================================ #
# plots
= ggplot(height_long %>% filter(Sample == "s1"), aes(x=Height_mm)) +
p1 geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
geom_vline(data = height_means %>% filter(Sample == "s1"),
aes(xintercept=mean_height)) +
xlab("Mean height (mm)")
= ggplot(height_long %>% filter(Sample == "s2"), aes(x=Height_mm)) +
p2 geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
geom_vline(data = height_means %>% filter(Sample == "s2"),
aes(xintercept=mean_height)) +
xlab("Mean height (mm)")
= ggplot(height_long %>% filter(Sample == "s3"), aes(x=Height_mm)) +
p3 geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
geom_vline(data = height_means %>% filter(Sample == "s3"),
aes(xintercept=mean_height)) +
xlab("Mean height (mm)")
= ggplot(height_long %>% filter(Sample == "s4"), aes(x=Height_mm)) +
p4 geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
geom_vline(data = height_means %>% filter(Sample == "s4"),
aes(xintercept=mean_height)) +
xlab("Mean height (mm)")
= ggarrange(p1,p2,p3,p4, nrow=2, ncol=2 )
composite_plot annotate_figure(composite_plot, top = text_grob("Class heights, 2018-2021",
face = "bold", size = 14))
Each time we take a new sample, there will be some random variation in the sample mean, which will usually not match the population mean precisely.
Let’s take another look at how much the average height of our class varied over the last four years. We will use a density plot rather than a histogram to make the patterns come out better.
# ================================================================ #
ggplot(height_long, aes(x=Height_mm, fill=Sample, color=Sample)) +
geom_density(alpha=0.2) +
geom_vline(data = height_means, aes(xintercept=mean_height, color=Sample)) +
xlab("Mean height (mm)")
Since the mean height of each class will always be a little different, what can we do to figure out how well our sample estimates represent the true population parameters?
Is there a way for us to know which mean estimate is closest the true population average? And how much variation might we expect to see in the population?
Let’s say we know that the average height of humans around the world is 169mm. If we take many, many samples of 20 people from the same population, and record the mean value from each of these, we can plot the distribution of these sample means:
= replicate(10000, {
x = rnorm(sample_size, mean = sample_mean, sd = sample_sd)
rsample mean(rsample)
})= data.frame(Mean_Height = x)
mean_height ggplot(mean_height, aes(x=Mean_Height)) +
geom_histogram(fill="lightblue", color="darkgray") +
geom_vline(data = mean_height, aes(xintercept=mean(Mean_Height))) +
labs(title = "Distribution of 10,000 sample means", x="Mean height (mm)")
Now the average value of all of these sample means looks pretty spot on!
However, the variation in our sample means is still rather large, so we might not have that much confidence in the precision of our estimate from any singe sample.
The sampling distribution of a sample estimator for a population parameter represents the distribution of all possible values for the sample statistic derived from a particular sample size, given an infinite number of samples drawn from the same population.
Fortunately, statistics for sampling distributions can be computed from a single sample - so you don’t actually need take numerous samples of \(N\) individuals (or independent measurements from a population) to get a good idea of how precise your sample estimates are! Whew.
Below we will illustrate empirically that this is the case.
One of the most important sampling distributions is the sampling distribution of the sample mean, \(\bar{X}\). When we talk about the distribution of sample means, the sample mean is now our random variable!
Let’s just let that sink in: The sample mean is the random variable that describes the distribution of sample means.
\(\Rightarrow\) Q: Why is the sampling distribution of \(\bar{X}\) of particular interest to us?
\(\Rightarrow\) Q: How do we quantify the precision of \(\bar{X}\)?
\(\Rightarrow\) Q: How does knowing the precision of \(\bar{X}\) help us quantify our confidence that our sample estimate reflects the true population mean?
With sufficiently large sample size \(N\), the distribution of the sample mean \(\bar{X}\) will converge on the population mean \(\mu\), with variance \(\sigma^2/N\). Obviously, when \(N\) equals the entire population, the sample mean will exactly equal the population mean!
In the notation for distributions (which we will learn more about soon) we can describe the distribution of \(\bar{X}\) as follows:
\[\bar{X} \sim \mathcal{N}(\mu,\frac{\sigma^2}{N})\]
In words, this equation says that: “The sample mean X bar follows a normal distribution with mean mu and variance equal to sigma squared divided by the sample size N.”
This point is worth repeating: “The uncertainty in \(\bar{X}\) decreases with increasing sample size.”
The standard deviation of a sample statistic is called its standard error.
The SEM provides a measure of how much the variable \(\bar{X}\) is expected to differ from sample to sample, i.e. the precision of which we are able to estimate the true population mean \(\mu\).
As noted above, the variance of \(\bar{X}\) is dependent on the sample size \(N\), and is equal to the population variance \(\sigma^2\) divided by \(N\):
\[Var(\bar{X}) = \frac{\sigma^2}{N} \] The SEM is simply the square root of \(Var(\bar{X})\):
\[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{N}}\]
When the sample includes the entire population, then we know the population mean precisely. In practice, however, we usually do not have access to the entire population.
Instead, we can use the sample standard deviation, \(s\), as an estimate for the population parameter \(\sigma\):
\[ SE_{\bar{X}} = \frac{s}{\sqrt{N}} \approx \frac{\sigma}{\sqrt{N}} \]
This also allows us to approximate the SD of the sampling distribution of \(\bar{X}\) (the SEM) using the SD of a single sample.
Let’s compare the mean and SD of two sampling distributions of different sample sizes.
# distribution of x-bar for sample of size 20
= replicate(10000, {
x20 = rnorm(n=20, mean=169, sd=14)
rsample mean(rsample)
})= data.frame(Mean_Height = x20)
mean_height20
# distribution of x-bar for sample of size 100
= replicate(10000, {
x100 = rnorm(n=100, mean=169, sd=14)
rsample mean(rsample)
})= data.frame(Mean_Height = x100)
mean_height100
# plot for x20
= ggplot(mean_height20, aes(x=Mean_Height)) +
hist20 geom_histogram(fill="lightblue", color="darkgray") +
geom_vline(aes(xintercept=mean(Mean_Height))) +
labs(title = "Sample size: 20", x="Mean height (mm)") +
xlim(c(154,184))
# plot for x100
= ggplot(mean_height100, aes(x=Mean_Height)) +
hist100 geom_histogram(fill="lightblue", color="darkgray") +
geom_vline(aes(xintercept=mean(Mean_Height))) +
labs(title = "Sample size: 100", x="Mean height (mm)") +
xlim(c(154,184))
= ggarrange(hist20, hist100, nrow=1, ncol=2)
composite_hist annotate_figure(composite_hist,
top = text_grob("Sampling distribution of the sample mean (10,000 samples)",
face = "bold", size = 12))
Now let’s compare the SD of each sampling distribution to the SEM computed from a single sample:
# sd of x20 vs. sem from a single sample of size 20
= rnorm(20, 169, 14)
x = sd(x)/sqrt(20)
sem20 sd(x20)
## [1] 3.11395
sem20## [1] 2.290034
# sd of x100 vs. sem from a single sample of size 100
= rnorm(100, 169, 14)
y = sd(x)/sqrt(100)
sem100 sd(x100)
## [1] 1.397376
sem100## [1] 1.024134
These are not exactly the same because they are different random samples, but they are pretty close! And using either method, we can clearly see that the SEM is much smaller for the larger sample size.
A Confidence Interval (CI) gives a range of values within which we would expect the true population parameter to fall most of the time. Confidence intervals can be calculated for all kinds of sample statistics!
\(\Rightarrow\) Q: What is the most common mistake people make in interpreting confidence intervals?
This animation contains a very intuitive visualization of what a CI of a sample is. Take a look at this and play around with changing the parameters.
\(\Rightarrow\) Q: How do the confidence intervals change with increasing sample size?
\(\Rightarrow\) Q: How do you think the sample mean affects the range of a confidence interval?
Since the sampling distribution of the sample mean approximates a normal distribution, we can easily deduce that the 95% CI is approximately equal to the sample mean plus or minus two times the SEM:
\[95\% CI \approx \bar{X} \pm 2*SE_\bar{X}\] To be more precise, we can use a \(z\)-score of 1.96 instead of 2 for the limits of the range, which is a bit closer to demarcating 95% of the data.
For a 95%CI, we expect the true population mean to fall somewhere within this range:
\[ \bar{X} - 1.96 * SE_\bar{X} < \mu < \bar{X} + 1.96 * SE_\bar{X} \]
\(\Rightarrow\) Q: Why does it make sense that the 95% CI spans \(\pm\) 2 times the SEM?
\(\Rightarrow\) Q: How many 95% CI’s computed from 100 random samples of height measurements in the human population will contain the true population mean?
Both the SE and the CI are heavily dependent on sample size. Consequently, this should be an important consideration in designing experiments, which we will get to later in the course.
The SEM and CI for smaller samples will always be less precise than estimates based on larger samples.
In hypothesis testing, which we will discuss next week, we choose a significance threshold like \(p=0.05\) to reject the null hypothesis that our sample comes from the null distribution.
Correspondingly, if the 95% CI does not contain the mean for the null hypothesis, then the \(p\)-value for our sample statistic is less than 0.05. We will discuss this in a lot more detail in coming weeks.
The CLT is one of the most fundamental concepts in statistics and encapsulates the issues discussed above.
Briefly, it says that, As the sample size increases, any sample statistic will converge on the true population parameter, and its distribution will be approximately normal.
The CLT highlights two important properties of independent and identically distributed (iid) random variables:
Since the SEM decreases as the square root of the sample size, in the limit our uncertainty in the sample mean \(\bar{X}\) will go to zero as \(N\) goes to infinity (and therefore \(\bar{X}\) will exactly equal the true population mean \(\mu\)):
\[\lim_{N \rightarrow \infty}SEM = \lim_{N \rightarrow \infty}\frac{\sigma}{\sqrt{N}} = 0\]
This property that the standard error of the mean approaches zero as the sample size increases is known in statistics books as tje law of large numbers or the law of averages.
Here we illustrate the concepts we have just discussed by sampling from a uniform distribution, where the probability of any outcome is the same.
\(\Rightarrow\) Q: Can you think of any cases where the sampling distribution is uniform?
Regardless of the specific uniform distribution we sample from, the lesson is the same.
Below we sample from a continuous uniform distribution, in which all possible real values in an interval are represented with equal probability.
\(\Rightarrow\) Q: Which of the above scenarios are examples of a continuous uniform distribution?
To help us visualize the results, we can write a function that takes a sample size and returns a ggplot object. Then, we can just call the function a bunch of times and draw the plots.
# a function to draw a histogram for a sample of size 'size' between zero and one
# the function takes the desired sample size and returns a ggplot object
= function(size){
runif.hist return(
ggplot(data.frame(trial = rep(1:size),
value = runif(size, 0, 1)),
aes(x=value)) +
geom_histogram(binwidth=.1, boundary=0,
fill="lightseagreen", color="black") +
geom_vline(aes(xintercept = mean(value)), color="red") +
ggtitle(paste("Sample size:", size, sep=" ")) +
theme_classic() #+
# theme(plot.title = element_text(size = 12))
)
}
# test out drawing samples of several different sizes
ggarrange(runif.hist(10),runif.hist(100),
runif.hist(1000),runif.hist(10000),
nrow=2, ncol=2)
If we repeated this exercise over and over, we would see exactly what we already expect:
We can also make an interactive plot (using RShiny) and experiment with changing the sample size on the fly [see the Shiny version of this document for interactive plot].
So far, we’ve just looked at the distributions of individual samples of a particular size drawn from a uniform continuous distribution.
Now let’s look at the mean of multiple samples. This will give us some idea of how precisely we can estimate the population mean, given a particular sample size.
Of course, the true mean of a continuous uniform distribution is the range divided by two (here, that would be 0.5).
replicate()
function to get the sample mean for each of 100 samples of size 10, and then visualize the results as a histogram:# Sample means for each of n samples of 10 observations
= 100
n.samples = 10
sample.size = replicate( n.samples, mean( runif(sample.size, min=0, max=1) ) )
sample.means #sample.means
# make a histogram of the results
ggplot(data.frame(sample.name = (1:n.samples),
sample.mean = sample.means),
aes(x=sample.mean)) +
geom_histogram(binwidth=0.05, boundary=0,
fill="rosybrown", color="black") +
xlim(0,1) +
ggtitle(paste(n.samples,"sample means for random samples of size",sample.size,sep=" "))
We can see from this that there is quite a lot of variation in the sample means!
Now let’s turn this into a function, which will make it a lot easier to visualize our results for different numbers and sizes of samples.
# create a function that takes two parameters (with defaults):
# n.samples = number of times to replicate the sampling
# sample.size = number of observations for each sample
# return a ggplot histogram object
= function(n.samples=100, sample.size=10) {
mean.runif.hist
# generate n.samples and compute the sample mean for each sample
= replicate(n.samples, mean(runif(sample.size, min=0, max=1)))
x.bar = data.frame(sample.name = 1:n.samples,
sample.means sample.mean = x.bar )
# plot the distribution of sample means
ggplot(sample.means,aes(x=x.bar)) +
geom_histogram(binwidth=0.02, fill="indianred1", color="black", alpha=0.5) +
xlim(0,1) +
# below is a trick to limit the number of significant digits
# displayed for the mean and SD (2 for 100, 3 for 1000, etc.)
ggtitle(paste("n=",n.samples,", size=",sample.size,"\n",
"(mean=", signif(mean(x.bar), log10(n.samples)),
", sd=", signif(sd(x.bar), log10(n.samples)),")",
sep="")) +
theme(text=element_text(size=9)) +
# draw vlines for the mean and SD of the sample means
geom_vline(aes(xintercept=mean(x.bar)), color="turquoise1", size=1) +
geom_vline(aes(xintercept=mean(x.bar) + sd(x.bar)),
color="blue", linetype="dotted", size=0.5) +
geom_vline(aes(xintercept=mean(x.bar) - sd(x.bar)),
color="blue", linetype="dotted", size=0.5)
}
Below we experiment with how the distributions of the sample means change the \(n\) and \(size\) parameters vary, using different combinations of \(n\) and \(size\) across several orders of magnitude (e.g. 10, 100, 1000).
sample.size
constant at 10 (top) or 100 (bottom) and vary n.samples
from 10-1000:# vary the number of samples ('n.samples') for fixed sample size ('sample.size')
= ggarrange(
mean_plots
# size = 10
mean.runif.hist(n.samples=10,sample.size=10),
mean.runif.hist(n.samples=100,sample.size=10),
mean.runif.hist(n.samples=1000,sample.size=10),
# size = 100
mean.runif.hist(n.samples=10,sample.size=100),
mean.runif.hist(n.samples=100,sample.size=100),
mean.runif.hist(n.samples=1000,sample.size=100),
nrow=2, ncol=3 )
annotate_figure(mean_plots, top = text_grob("Sample means drawn from a uniform distribution",
face = "bold", size = 14))
First, we hold the sample.size
constant at 10 (top) or 100 (bottom) and vary n.samples
from 10-1000:
What happens if we instead hold the number of samples n.sample
constant at 10 or 100, and then plot the means of samples of different sizes (e.g. 10, 100, or 1000)?
# vary the sample size ('sample.size') for fixed number of samples ('n.samples')
= ggarrange(
mean_plots
# n = 10
mean.runif.hist(n.samples=10,sample.size=10),
mean.runif.hist(n.samples=10,sample.size=100),
mean.runif.hist(n.samples=10,sample.size=1000),
# n = 100
mean.runif.hist(n.samples=100,sample.size=10),
mean.runif.hist(n.samples=100,sample.size=100),
mean.runif.hist(n.samples=100,sample.size=1000),
nrow=2, ncol=3 )
annotate_figure(mean_plots, top = text_grob("Sample means drawn from a uniform distribution",
face = "bold", size = 14))
We can use this function to make an interactive histogram of the sampling distribution of sample means and experiment with how the plot changes when we vary the \(n\) and \(size\) parameters across several orders of magnitude [see the Shiny version of this document for interactive plot].
\(\Rightarrow\) Q: What happened when you kept the sample size the same, but increased the number of sample sets?
\(\Rightarrow\ \) Q: How did the sample distribution change with increasing sample size when you held the number of samples constant?
We can also use box plots to summarize these distributions, which make it a little easier to compare them visually.
# make 100 samples of different sizes
<- sample.means.runif(100,10)
a <- sample.means.runif(100,100)
b <- sample.means.runif(100,1000)
c <- sample.means.runif(100,10000)
d
boxplot(d,c,b,a,
horizontal=TRUE, ylim=c(0.25,0.75), range=1, notch=T,
names=c("10","100","1k","10k"),
xlab = "Distribution of sample means",
ylab = "Sample Size",
main="Distribution of sample means for \n 100 samples of increasing sample size"
)
We can also make an interactive boxplot to visualize how varying the sample sizes for 100 samples changes the distribution of the sample means [see the Shiny version of this document for interactive plot].
The following table summarizes these results.
Sample Size | Mean Range | Mean of Sample Means | SEM |
---|---|---|---|
1 | 0-1 | NA | ~0.3 |
10 | 0.2-0.8 | ~0.5 | ~0.1 |
100 | 0.4-0.6 | ~0.5 | ~0.03 |
1000 | 0.47-0.53 | 0.50 | ~0.01 |
10000 | 0.49-0.51 | 0.50 | ~0.003 |
These tests show empirically that we need a 100-fold increase in the sample size in order to get a 10-fold decrease in the SEM.
So, the SEM indeed decreases as the square root of the sample size.
It is very rare that we know the true population parameters. We can report our uncertainty about how well a random variable estimates a population parameter using a confidence interval (CI).
We expect that a 95% CI of the mean will contain the true population mean 95% of the time. It is typical to see 90%, 95%, and 99% confidence intervals.
Recall that:
Therefore, our random variable, \(\bar{X}\), should be contained within around two standard deviations of the true mean 95% of the time (even though every once in a while it will be rather far off because we are taking random samples).
The 95% CI specifies that we expect 95% of random intervals \(\overline{X} \pm \sim 1.96 * SEM\) to contain the true population mean. Since \(SEM = s_x/\sqrt{N}\), combining these gives us:
\[95\%\ CI \approx \overline{X} \pm 1.96\frac{s_x}{\sqrt{N}}\]
So, with the sample mean, SD, and sample size in hand, we are almost ready to find the limits of the 95% CI.
There is one additional piece of information that will help us: the precise z-score (number of SDs away from the mean) that provides the exact boundary for the central 95% of the data (since even 1.96 is an approximation!)
Another way to represent the 95%CI is to write it as \(100(1-\alpha)\)%CI, where \(\alpha\) = 0.05, or the 5% of the data we want to exclude.
Since we want to split the remaining 5% between the two tails (2.5% at the bottom and 2.5% at the top), we want to find the \(z\)-score corresponding to the 97.5th percentile, or \(1-\alpha/2\), which we can add or subtract from the mean to get the CI.
So the 95% CI becomes:
\[ (1-\alpha)\%\ CI = \bar{X} \pm z_{1-(\alpha/2)}*\frac{\sigma}{\sqrt{N}}\]
where \(z_{1-(\alpha/2)}\) is the \(z\)-quantile function at probability \(1-(\alpha/2)\).
For a 95% CI, \(\alpha = 0.05\). So, to find the correct z-score for the limits of the 95% interval, we want to find the \(z\)-score for the 97.5th percentile.
To do this, we will use the quantile function qnorm()
. What we want is the value of qnorm()
at the 97.5th percentile, which will be something very close to 1.96:
qnorm(0.975)
## [1] 1.959964
Now we can plug this back into our equation for the 95% CI to get the ranges of the CI’s for different sample sizes.
\(\Rightarrow\ \) Calculate the 95% CI for 4 samples ranging in size from 10-10,000.
# use qnorm to find the z-score for the 95% CI (2.5 - 97.5 percentile range)
# Since normal is symmetric, we can add and subtract this to get the CI.
<- qnorm(0.975) # 1.959964...
Q
# compute the mean, SEM, and CI of our samples
for ( i in c(10, 100,1000,10000) ) {
<- (runif(i, min=0, max=1)) # random sample from uniform dist
sample <- mean(sample) # sample mean
mean_sample <- sd(sample)/sqrt(i) # standard error of the mean
sem <- c(mean_sample - Q*sem, mean_sample + Q*sem) # 95% CI
interval
# print the results
cat("Sample size:",i,"\nMean:",mean_sample,
"\n SEM:",sem,"\n CI:",interval,"\n\n",fill=FALSE)
}
## Sample size: 10
## Mean: 0.4493746
## SEM: 0.09552043
## CI: 0.262158 0.6365912
##
## Sample size: 100
## Mean: 0.5134436
## SEM: 0.02877359
## CI: 0.4570484 0.5698388
##
## Sample size: 1000
## Mean: 0.5111267
## SEM: 0.00914435
## CI: 0.4932041 0.5290493
##
## Sample size: 10000
## Mean: 0.5024249
## SEM: 0.002889351
## CI: 0.4967619 0.5080879
\(\Rightarrow\ \) What do we observe from these comparisons?
Key concepts:
The sampling distribution of the sample mean approximates a normal distribution and hence has predictable statistical properties.
The sample mean converges toward the population mean as the sample size increases.
Correspondingly, the variation in the mean is inversely proportional to the sample size.
The standard error of the mean, representing the expected variation in the mean from sample to sample, can be computed from a single sample of independent observations due to its direct dependency on sample size.
Whitlock & Schluter: Chapter 4
Aho: