I. Review: Summary statistics

Population parameters

Location / central tendency: mean

Measures of location describe the most “representative” value in a population.

  • The mean is the expected value of all individuals in a population and is the arithmetic mean.

\[E(X) = \mu = {\frac{1}{N}}\sum{x_i} \]

Scale / dispersion: variance and SD

Measures of scale describe how far away from the mean most of the individuals in the population are.

  • The variance of the population is the squared sum of differences between the population mean and each individual, divided by the total number of individuals in the population.

\[Var(X) = \sigma^2 = \frac{1}{N}\sum({x_i - \mu)^2}\]

  • The standard deviation (SD), \(\sigma\), is simply the square root of the variance.

\(\Rightarrow\) Question: Why is SD is a more intuitive measure of dispersion than variance?

Answer
  • Because it is in the same units as the units of measurement.

Sample mean and variance

  • The sample mean \(\bar{X}\) and variance \(s^2\) describe the sample distribution of an unbiased random sample of individuals taken from a population.

\[\bar{X} = \frac{\sum(x)}{n}, \ \ s^2 = \frac{\sum({x_i - \bar{X})^2}}{n-1}\] These statistics are estimators of the true (usually unknown) population parameters.

\(\Rightarrow\) Q: Why does the sample variance have \(n-1\) degrees of freedom?

Answer
  • We have only \(n-1\) independent values in the calculation, since we use up one degree of freedom in calculating the sample mean, \(\bar{X}\).
  • Because of this, dividing by \(n\) vs. \(n-1\) would underestimate the true variation in the population.

Robust estimators

The median and inter-quartile range (IQR) are alternative measures of location and scale that are robust to outliers.

  • The median is the middle value of a dataset: 50% of the datapoints are below and 50% are above this value.
  • The IQR represents the range of data between the 25th and the 75th percentiles.

IQR vs. SD

IQR and SD for a normal distribution (from Wikipedia)

\(\Rightarrow\) Q: What are the rules of thumb for the IQR?

Answer
  • The central 50% of the data are within the IQR.
  • Around 99% of the data are within the IQR +/- 1.5*IQR.

\(\Rightarrow\) Q: What are the rules of thumb for the SD?

Answer
  • Around 2/3 of the data are within 1 SD of the mean.
  • Around 95% of the data are within 2 SD.
  • Around 99% of the data are within 2.5 SD.
  • Around 99.7% of the data are within 3 SD.

II. Random sampling

We’ve talked a lot about why taking random samples is important for obtaining representative estimates for population parameters, and things to watch out for to minimize sample bias (what are the possible sources of sample bias?)

\(\Rightarrow\) Q: What effects does sampling error have on our sample estimates?

Answer
  • Sampling error, due to bias or other sources of uncertainty, can affect both the accuracy and the precision of sample estimates.

Distribution of a random sample

Once we have taken a (hopefully unbiased) sample from a population, we can look at the distribution of our measurements for each individual in the sample.

For example, let’s simulate the height distribution of everyone who was this class in each of the last four years, by taking four random samples of 17 people (assuming biologists represent an unbiased sample of the human population!).

# ================================================================ #
# number of samples and sample size
sample_size = 17    # sample size
sample_num  = 4     # number of samples
sample_mean = 169
sample_sd   = 14

# ================================================================ #
# matrix of random samples
sample_data = replicate(sample_num,
                        rnorm(sample_size, mean = sample_mean, sd = sample_sd))
sample_data = matrix(sample_data, ncol=sample_num)
colnames(sample_data) = paste0("s", 1:sample_num)  # label the columns

height_long = stack(data.frame(sample_data, stringsAsFactors = TRUE))
names(height_long) = c("Height_mm","Sample")

# ================================================================ #
# make a second data frame holding height means for each sample
height_means = height_long %>% 
  group_by(Sample) %>%
  summarise(mean_height = mean(Height_mm))
# height_means

# ================================================================ #
# plots
p1 = ggplot(height_long %>% filter(Sample == "s1"), aes(x=Height_mm)) +
  geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
  geom_vline(data = height_means %>% filter(Sample == "s1"),
             aes(xintercept=mean_height)) +
  xlab("Mean height (mm)")

p2 = ggplot(height_long %>% filter(Sample == "s2"), aes(x=Height_mm)) +
  geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
  geom_vline(data = height_means %>% filter(Sample == "s2"),
             aes(xintercept=mean_height)) +
  xlab("Mean height (mm)")

p3 = ggplot(height_long %>% filter(Sample == "s3"), aes(x=Height_mm)) +
  geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
  geom_vline(data = height_means %>% filter(Sample == "s3"),
             aes(xintercept=mean_height)) +
  xlab("Mean height (mm)")

p4 = ggplot(height_long %>% filter(Sample == "s4"), aes(x=Height_mm)) +
  geom_histogram(fill="lightblue", color="darkgray", binwidth = 2) +
  geom_vline(data = height_means %>% filter(Sample == "s4"),
             aes(xintercept=mean_height)) +
  xlab("Mean height (mm)")

composite_plot = ggarrange(p1,p2,p3,p4, nrow=2, ncol=2 )
annotate_figure(composite_plot, top = text_grob("Class heights, 2018-2021", 
               face = "bold", size = 14))

Variation among samples

Each time we take a new sample, there will be some random variation in the sample mean, which will usually not match the population mean precisely.

Let’s take another look at how much the average height of our class varied over the last four years. We will use a density plot rather than a histogram to make the patterns come out better.

# ================================================================ #
ggplot(height_long, aes(x=Height_mm, fill=Sample, color=Sample)) +
  geom_density(alpha=0.2) +
  geom_vline(data = height_means, aes(xintercept=mean_height, color=Sample)) +
  xlab("Mean height (mm)")

Variability of sample means

Since the mean height of each class will always be a little different, what can we do to figure out how well our sample estimates represent the true population parameters?

Is there a way for us to know which mean estimate is closest the true population average? And how much variation might we expect to see in the population?

Let’s say we know that the average height of humans around the world is 169mm. If we take many, many samples of 20 people from the same population, and record the mean value from each of these, we can plot the distribution of these sample means:

x = replicate(10000, {
  rsample = rnorm(sample_size, mean = sample_mean, sd = sample_sd)
  mean(rsample)
  })
mean_height = data.frame(Mean_Height = x)
ggplot(mean_height, aes(x=Mean_Height)) +
  geom_histogram(fill="lightblue", color="darkgray") +
  geom_vline(data = mean_height, aes(xintercept=mean(Mean_Height))) +
  labs(title = "Distribution of 10,000 sample means", x="Mean height (mm)")

Now the average value of all of these sample means looks pretty spot on!

However, the variation in our sample means is still rather large, so we might not have that much confidence in the precision of our estimate from any singe sample.

III. Sampling distributions of sample estimators

The sampling distribution of a sample estimator for a population parameter represents the distribution of all possible values for the sample statistic derived from a particular sample size, given an infinite number of samples drawn from the same population.

  • Sampling distributions can be computed for all kinds of sample statistics, and give us an idea of how closely we can expect our sample statistics to represent the true population parameters.

Fortunately, statistics for sampling distributions can be computed from a single sample - so you don’t actually need take numerous samples of \(N\) individuals (or independent measurements from a population) to get a good idea of how precise your sample estimates are! Whew.

Below we will illustrate empirically that this is the case.

Sampling distribution of the sample mean

One of the most important sampling distributions is the sampling distribution of the sample mean, \(\bar{X}\). When we talk about the distribution of sample means, the sample mean is now our random variable!

Let’s just let that sink in: The sample mean is the random variable that describes the distribution of sample means.

\(\Rightarrow\) Q: Why is the sampling distribution of \(\bar{X}\) of particular interest to us?

Answer
  • Because it allows us to determine how well our estimate represents the true central tendency of the population from which our samples are drawn, i.e. the precision of our estimate.

\(\Rightarrow\) Q: How do we quantify the precision of \(\bar{X}\)?

Answer
  • We use the distribution to compute the variation in \(\bar{X}\), which gives us an estimate of its precision.

\(\Rightarrow\) Q: How does knowing the precision of \(\bar{X}\) help us quantify our confidence that our sample estimate reflects the true population mean?

Answer
  • It allows us make an educated guess about the range of values in which we expect the true population mean to be found.
  • This is called a confidence interval.

\(\bar{X}\) follows a normal distribution

With sufficiently large sample size \(N\), the distribution of the sample mean \(\bar{X}\) will converge on the population mean \(\mu\), with variance \(\sigma^2/N\). Obviously, when \(N\) equals the entire population, the sample mean will exactly equal the population mean!

In the notation for distributions (which we will learn more about soon) we can describe the distribution of \(\bar{X}\) as follows:

\[\bar{X} \sim \mathcal{N}(\mu,\frac{\sigma^2}{N})\]

In words, this equation says that: “The sample mean X bar follows a normal distribution with mean mu and variance equal to sigma squared divided by the sample size N.”

  • This means that if you take a whole bunch of random samples from a population, then the distribution of the sample means will look pretty much like a bell curve.
  • Moreover, since the variation in \(\bar{X}\) is inversely proportional to \(N\), the amount of uncertainty in your estimator will shrink as the sample size gets bigger.

This point is worth repeating: “The uncertainty in \(\bar{X}\) decreases with increasing sample size.”

  • For example, no matter how many samples of size 20 you take, your uncertainty in \(\bar{X}\) will always be greater than if you took just one sample of 100 individuals. If you are still skeptical about this, read on!

IV. Standard Error of the Mean (SEM)

The standard deviation of a sample statistic is called its standard error.

  • For the sample mean \(\bar{X}\), the standard error is called the standard error of the mean and is abbreviated as SEM, or often simply as SE.

The SEM provides a measure of how much the variable \(\bar{X}\) is expected to differ from sample to sample, i.e. the precision of which we are able to estimate the true population mean \(\mu\).

As noted above, the variance of \(\bar{X}\) is dependent on the sample size \(N\), and is equal to the population variance \(\sigma^2\) divided by \(N\):

\[Var(\bar{X}) = \frac{\sigma^2}{N} \] The SEM is simply the square root of \(Var(\bar{X})\):

\[\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{N}}\]

Estimating the SEM using sample data

When the sample includes the entire population, then we know the population mean precisely. In practice, however, we usually do not have access to the entire population.

Instead, we can use the sample standard deviation, \(s\), as an estimate for the population parameter \(\sigma\):

\[ SE_{\bar{X}} = \frac{s}{\sqrt{N}} \approx \frac{\sigma}{\sqrt{N}} \]

This also allows us to approximate the SD of the sampling distribution of \(\bar{X}\) (the SEM) using the SD of a single sample.

Illustration

Let’s compare the mean and SD of two sampling distributions of different sample sizes.

# distribution of x-bar for sample of size 20
x20 = replicate(10000, {
  rsample = rnorm(n=20, mean=169, sd=14)
  mean(rsample)
  })
mean_height20 = data.frame(Mean_Height = x20)

# distribution of x-bar for sample of size 100
x100 = replicate(10000, {
  rsample = rnorm(n=100, mean=169, sd=14)
  mean(rsample)
  })
mean_height100 = data.frame(Mean_Height = x100)

# plot for x20
hist20 = ggplot(mean_height20, aes(x=Mean_Height)) +
  geom_histogram(fill="lightblue", color="darkgray") +
  geom_vline(aes(xintercept=mean(Mean_Height))) +
  labs(title = "Sample size: 20", x="Mean height (mm)") +
  xlim(c(154,184))

# plot for x100
hist100 = ggplot(mean_height100, aes(x=Mean_Height)) +
  geom_histogram(fill="lightblue", color="darkgray") +
  geom_vline(aes(xintercept=mean(Mean_Height))) +
  labs(title = "Sample size: 100", x="Mean height (mm)") +
  xlim(c(154,184))

composite_hist = ggarrange(hist20, hist100, nrow=1, ncol=2)
annotate_figure(composite_hist, 
  top = text_grob("Sampling distribution of the sample mean (10,000 samples)", 
  face = "bold", size = 12))

Now let’s compare the SD of each sampling distribution to the SEM computed from a single sample:

# sd of x20 vs. sem from a single sample of size 20
x = rnorm(20, 169, 14)
sem20 = sd(x)/sqrt(20)
sd(x20)
## [1] 3.11395
sem20
## [1] 2.290034

# sd of x100 vs. sem from a single sample of size 100
y = rnorm(100, 169, 14)
sem100 = sd(x)/sqrt(100)
sd(x100)
## [1] 1.397376
sem100
## [1] 1.024134

These are not exactly the same because they are different random samples, but they are pretty close! And using either method, we can clearly see that the SEM is much smaller for the larger sample size.

IV. Confidence Intervals

A Confidence Interval (CI) gives a range of values within which we would expect the true population parameter to fall most of the time. Confidence intervals can be calculated for all kinds of sample statistics!

  • The most common CI you will encounter is the 95% CI of the mean, which gives us an estimate for a range of values within which the true population mean will fall 95% of the time.
  • In other words, 95 out of 100 confidence intervals based on independent random samples will contain the true population mean.

\(\Rightarrow\) Q: What is the most common mistake people make in interpreting confidence intervals?

Answer
  • People often assume that any particular 95% CI will have a “95% chance of containing the true sample mean”. But this is not the case!
  • For any given CI, the true population mean either will, or will not, contain the true population mean.

This animation contains a very intuitive visualization of what a CI of a sample is. Take a look at this and play around with changing the parameters.

\(\Rightarrow\) Q: How do the confidence intervals change with increasing sample size?

Answer
  • The width of the CI decreases with increasing sample size.
  • This is because the variation in the mean for small sample sizes is a lot bigger!
  • So, a larger sample will be more representative of the larger population, and will provide a more accurate estimate of the true population parameter.

\(\Rightarrow\) Q: How do you think the sample mean affects the range of a confidence interval?

Answer
  • The sample mean is NOT related to sample variance.
  • So, the widths of CIs will vary a lot from sample to sample due to random variation…
  • … meaning that narrow CIs are not necessarily more accurate, as they may still be far away from the true mean.

95% CI of the mean

Since the sampling distribution of the sample mean approximates a normal distribution, we can easily deduce that the 95% CI is approximately equal to the sample mean plus or minus two times the SEM:

\[95\% CI \approx \bar{X} \pm 2*SE_\bar{X}\] To be more precise, we can use a \(z\)-score of 1.96 instead of 2 for the limits of the range, which is a bit closer to demarcating 95% of the data.

For a 95%CI, we expect the true population mean to fall somewhere within this range:

\[ \bar{X} - 1.96 * SE_\bar{X} < \mu < \bar{X} + 1.96 * SE_\bar{X} \]

\(\Rightarrow\) Q: Why does it make sense that the 95% CI spans \(\pm\) 2 times the SEM?

Answer
  • Because the central bulk of a normal distribution is within two SD of the mean.

\(\Rightarrow\) Q: How many 95% CI’s computed from 100 random samples of height measurements in the human population will contain the true population mean?

Answer
  • We expect the 95% CI for 95 of the 100 samples to contain the true population mean.

The SE and CI depend on sample size

Both the SE and the CI are heavily dependent on sample size. Consequently, this should be an important consideration in designing experiments, which we will get to later in the course.

The SEM and CI for smaller samples will always be less precise than estimates based on larger samples.

Connection between the CI and the \(p\)-value

In hypothesis testing, which we will discuss next week, we choose a significance threshold like \(p=0.05\) to reject the null hypothesis that our sample comes from the null distribution.

Correspondingly, if the 95% CI does not contain the mean for the null hypothesis, then the \(p\)-value for our sample statistic is less than 0.05. We will discuss this in a lot more detail in coming weeks.

V. The Central Limit Theorem

The CLT is one of the most fundamental concepts in statistics and encapsulates the issues discussed above.

Briefly, it says that, As the sample size increases, any sample statistic will converge on the true population parameter, and its distribution will be approximately normal.

  • Most commonly, the CLT is applied to the sampling distribution of the sample mean.

The CLT highlights two important properties of independent and identically distributed (iid) random variables:

  • The variation in the distribution of a sample statistic will be inversely proportional to the sample size \(N\).
    • This means that our confidence in that value will increase because the variation among the sample estimates will get smaller.
  • Moreover, the distribution of the sample statistic will follow a normal distribution centered around the population mean.
    • In other words, repeated estimates of your sample statistic will show a normal distribution, even if the underlying data distribution is not normal.
  • Side note: It turns out that the normal distribution has maximum entropy. This means that any individual sample gives no additional information about any of the other samples, i.e. they are uncorrelated, as you would expect for i.i.d. random variables.

Since the SEM decreases as the square root of the sample size, in the limit our uncertainty in the sample mean \(\bar{X}\) will go to zero as \(N\) goes to infinity (and therefore \(\bar{X}\) will exactly equal the true population mean \(\mu\)):

\[\lim_{N \rightarrow \infty}SEM = \lim_{N \rightarrow \infty}\frac{\sigma}{\sqrt{N}} = 0\]

Law of large numbers

This property that the standard error of the mean approaches zero as the sample size increases is known in statistics books as tje law of large numbers or the law of averages.

VI. Example

A. Uniform distribution

Here we illustrate the concepts we have just discussed by sampling from a uniform distribution, where the probability of any outcome is the same.

\(\Rightarrow\) Q: Can you think of any cases where the sampling distribution is uniform?

Some examples
  • The chance of a fair die landing on any of its faces
  • The chance of winning a lottery ticket
  • The month of the year in which anyone in the Biology Department was born
  • The time to the next cell division of a single yeast cell in log phase, from the moment you first look at it
  • The spatial distribution of territorial animals, or of plants in strong competition for water, light, or nutrients

Regardless of the specific uniform distribution we sample from, the lesson is the same.

Random sampling

Below we sample from a continuous uniform distribution, in which all possible real values in an interval are represented with equal probability.

\(\Rightarrow\) Q: Which of the above scenarios are examples of a continuous uniform distribution?

Answer
  • The time to wait for log-phase cells to divide
  • The spatial distribution of territorial animals or plants in competition for resources

  • To get a sample, all we have to do is just pick a bunch of random numbers between 0-1.
  • We will then compare random samples of different sizes.

To help us visualize the results, we can write a function that takes a sample size and returns a ggplot object. Then, we can just call the function a bunch of times and draw the plots.

# a function to draw a histogram for a sample of size 'size' between zero and one
# the function takes the desired sample size and returns a ggplot object
runif.hist = function(size){
  return(
    ggplot(data.frame(trial = rep(1:size), 
                      value = runif(size, 0, 1)), 
         aes(x=value)) +
    geom_histogram(binwidth=.1, boundary=0,
                   fill="lightseagreen", color="black") +
    geom_vline(aes(xintercept = mean(value)), color="red") +
    ggtitle(paste("Sample size:", size, sep=" ")) +
    theme_classic() #+
#    theme(plot.title = element_text(size = 12))
  )
}

# test out drawing samples of several different sizes
ggarrange(runif.hist(10),runif.hist(100),
          runif.hist(1000),runif.hist(10000), 
          nrow=2, ncol=2)

If we repeated this exercise over and over, we would see exactly what we already expect:

  • each sample is slightly different
  • smaller samples show greater variation in the sample mean than larger samples
  • larger samples more closely approximate the ideal distribution

We can also make an interactive plot (using RShiny) and experiment with changing the sample size on the fly [see the Shiny version of this document for interactive plot].

B. Sampling distribution of the sample mean

So far, we’ve just looked at the distributions of individual samples of a particular size drawn from a uniform continuous distribution.

Now let’s look at the mean of multiple samples. This will give us some idea of how precisely we can estimate the population mean, given a particular sample size.

  • For example, this could be the average wait time we would measure for a random cell to divide under the microscope if we looked at tens, hundreds, or thousands of cells.

Of course, the true mean of a continuous uniform distribution is the range divided by two (here, that would be 0.5).

Histograms

  • To get a better feel for how much the sample mean varies from sample to sample, we can use the replicate() function to get the sample mean for each of 100 samples of size 10, and then visualize the results as a histogram:
# Sample means for each of n samples of 10 observations
n.samples = 100
sample.size = 10
sample.means = replicate( n.samples, mean( runif(sample.size, min=0, max=1) ) )
#sample.means

# make a histogram of the results
ggplot(data.frame(sample.name = (1:n.samples), 
                  sample.mean = sample.means), 
       aes(x=sample.mean)) +
  geom_histogram(binwidth=0.05, boundary=0,
                 fill="rosybrown", color="black") +
  xlim(0,1) +
  ggtitle(paste(n.samples,"sample means for random samples of size",sample.size,sep=" "))

We can see from this that there is quite a lot of variation in the sample means!

Now let’s turn this into a function, which will make it a lot easier to visualize our results for different numbers and sizes of samples.

  • At the same time, let’s also put some vertical lines on the plots to show the mean and SD of the distribution.
# create a function that takes two parameters (with defaults):
#   n.samples = number of times to replicate the sampling
#   sample.size = number of observations for each sample
# return a ggplot histogram object
mean.runif.hist = function(n.samples=100, sample.size=10) {

  # generate n.samples and compute the sample mean for each sample
  x.bar = replicate(n.samples, mean(runif(sample.size, min=0, max=1)))
  sample.means = data.frame(sample.name = 1:n.samples,
                            sample.mean = x.bar )
  
  # plot the distribution of sample means
  ggplot(sample.means,aes(x=x.bar)) +
    geom_histogram(binwidth=0.02, fill="indianred1", color="black", alpha=0.5) +
    xlim(0,1) +
    
    # below is a trick to limit the number of significant digits
    # displayed for the mean and SD (2 for 100, 3 for 1000, etc.)
    ggtitle(paste("n=",n.samples,", size=",sample.size,"\n",
                  "(mean=", signif(mean(x.bar), log10(n.samples)),
                  ", sd=", signif(sd(x.bar), log10(n.samples)),")",
                  sep="")) +
    
    theme(text=element_text(size=9)) +
    
    # draw vlines for the mean and SD of the sample means
    geom_vline(aes(xintercept=mean(x.bar)), color="turquoise1", size=1) +
    geom_vline(aes(xintercept=mean(x.bar) + sd(x.bar)), 
               color="blue", linetype="dotted", size=0.5) +
    geom_vline(aes(xintercept=mean(x.bar) - sd(x.bar)), 
               color="blue", linetype="dotted", size=0.5)
}
Varying sample size vs. number of samples

Below we experiment with how the distributions of the sample means change the \(n\) and \(size\) parameters vary, using different combinations of \(n\) and \(size\) across several orders of magnitude (e.g. 10, 100, 1000).

  • First, we hold the sample.size constant at 10 (top) or 100 (bottom) and vary n.samples from 10-1000:
# vary the number of samples ('n.samples') for fixed sample size ('sample.size')
mean_plots = ggarrange(
  
  # size = 10
  mean.runif.hist(n.samples=10,sample.size=10),
  mean.runif.hist(n.samples=100,sample.size=10),
  mean.runif.hist(n.samples=1000,sample.size=10),
  
  # size = 100
  mean.runif.hist(n.samples=10,sample.size=100),
  mean.runif.hist(n.samples=100,sample.size=100),
  mean.runif.hist(n.samples=1000,sample.size=100), 
  
  nrow=2, ncol=3 )

  annotate_figure(mean_plots, top = text_grob("Sample means drawn from a uniform distribution", 
                                        face = "bold", size = 14))

  • First, we hold the sample.size constant at 10 (top) or 100 (bottom) and vary n.samples from 10-1000:

  • What happens if we instead hold the number of samples n.sample constant at 10 or 100, and then plot the means of samples of different sizes (e.g. 10, 100, or 1000)?

# vary the sample size ('sample.size') for fixed number of samples ('n.samples')
mean_plots = ggarrange(
  
  # n = 10
  mean.runif.hist(n.samples=10,sample.size=10),
  mean.runif.hist(n.samples=10,sample.size=100),
  mean.runif.hist(n.samples=10,sample.size=1000),
  
  # n = 100
  mean.runif.hist(n.samples=100,sample.size=10),
  mean.runif.hist(n.samples=100,sample.size=100),
  mean.runif.hist(n.samples=100,sample.size=1000),
  
  nrow=2, ncol=3 )

annotate_figure(mean_plots, top = text_grob("Sample means drawn from a uniform distribution", 
                                        face = "bold", size = 14))

Interactive plot

We can use this function to make an interactive histogram of the sampling distribution of sample means and experiment with how the plot changes when we vary the \(n\) and \(size\) parameters across several orders of magnitude [see the Shiny version of this document for interactive plot].

\(\Rightarrow\) Q: What happened when you kept the sample size the same, but increased the number of sample sets?

Answer
  • Increasing the number of samples sets doesn’t make that much difference.
  • The distribution of \(\bar{X}\) does not change very much, and the SEM stays about the same.

\(\Rightarrow\ \) Q: How did the sample distribution change with increasing sample size when you held the number of samples constant?

Answer
  • Increasing the sample size narrows the distribution of \(\bar{X}\) considerably - the SEM becomes smaller and smaller.
  • So, the bigger the sample, the more precise the estimate of the population mean. Eventually it will converge on the true population mean.

Box plots

We can also use box plots to summarize these distributions, which make it a little easier to compare them visually.

Varying sample size vs. number of samples
# make 100 samples of different sizes
a <- sample.means.runif(100,10)
b <- sample.means.runif(100,100)
c <- sample.means.runif(100,1000)
d <- sample.means.runif(100,10000)

boxplot(d,c,b,a, 
        horizontal=TRUE, ylim=c(0.25,0.75), range=1, notch=T,
        names=c("10","100","1k","10k"),
        xlab = "Distribution of sample means",
        ylab = "Sample Size",
        main="Distribution of sample means for \n 100 samples of increasing sample size"
        )

Interactive plot

We can also make an interactive boxplot to visualize how varying the sample sizes for 100 samples changes the distribution of the sample means [see the Shiny version of this document for interactive plot].

The following table summarizes these results.

Sample Size Mean Range Mean of Sample Means SEM
1 0-1 NA ~0.3
10 0.2-0.8 ~0.5 ~0.1
100 0.4-0.6 ~0.5 ~0.03
1000 0.47-0.53 0.50 ~0.01
10000 0.49-0.51 0.50 ~0.003

These tests show empirically that we need a 100-fold increase in the sample size in order to get a 10-fold decrease in the SEM.

So, the SEM indeed decreases as the square root of the sample size.

  • Also note that the SEM computed with the sample SD approximates the SEM computed using the true population SD for the larger sample sizes (n = 100, 1000, 10000).

C. Confidence Intervals

It is very rare that we know the true population parameters. We can report our uncertainty about how well a random variable estimates a population parameter using a confidence interval (CI).

We expect that a 95% CI of the mean will contain the true population mean 95% of the time. It is typical to see 90%, 95%, and 99% confidence intervals.

How do we calculate the CI?

Recall that:

  • The sampling distribution of the sample mean is normally distributed, and
  • 95% of the probability density of any normal distribution falls within around 2 SD of the mean.

IQR and SD for a normal distribution (from Wikipedia)

  • Therefore, our random variable, \(\bar{X}\), should be contained within around two standard deviations of the true mean 95% of the time (even though every once in a while it will be rather far off because we are taking random samples).

    • Note that since our sample estimate is a random variable, the edges of the interval are also random. Any particular CI either does or does not contain the true population mean, and around 5% of randomly sampled intervals will not contain the mean.

The 95% CI specifies that we expect 95% of random intervals \(\overline{X} \pm \sim 1.96 * SEM\) to contain the true population mean. Since \(SEM = s_x/\sqrt{N}\), combining these gives us:

\[95\%\ CI \approx \overline{X} \pm 1.96\frac{s_x}{\sqrt{N}}\]

So, with the sample mean, SD, and sample size in hand, we are almost ready to find the limits of the 95% CI.

The quantile function

There is one additional piece of information that will help us: the precise z-score (number of SDs away from the mean) that provides the exact boundary for the central 95% of the data (since even 1.96 is an approximation!)

Another way to represent the 95%CI is to write it as \(100(1-\alpha)\)%CI, where \(\alpha\) = 0.05, or the 5% of the data we want to exclude.

Since we want to split the remaining 5% between the two tails (2.5% at the bottom and 2.5% at the top), we want to find the \(z\)-score corresponding to the 97.5th percentile, or \(1-\alpha/2\), which we can add or subtract from the mean to get the CI.

So the 95% CI becomes:

\[ (1-\alpha)\%\ CI = \bar{X} \pm z_{1-(\alpha/2)}*\frac{\sigma}{\sqrt{N}}\]

where \(z_{1-(\alpha/2)}\) is the \(z\)-quantile function at probability \(1-(\alpha/2)\).

For a 95% CI, \(\alpha = 0.05\). So, to find the correct z-score for the limits of the 95% interval, we want to find the \(z\)-score for the 97.5th percentile.

To do this, we will use the quantile function qnorm(). What we want is the value of qnorm() at the 97.5th percentile, which will be something very close to 1.96:

qnorm(0.975)
## [1] 1.959964

Now we can plug this back into our equation for the 95% CI to get the ranges of the CI’s for different sample sizes.

Exercise

\(\Rightarrow\ \) Calculate the 95% CI for 4 samples ranging in size from 10-10,000.

Answer
# use qnorm to find the z-score for the 95% CI (2.5 - 97.5 percentile range)
# Since normal is symmetric, we can add and subtract this to get the CI.
Q <- qnorm(0.975)  # 1.959964...

# compute the mean, SEM, and CI of our samples
for ( i in c(10, 100,1000,10000) ) {
  
  sample <- (runif(i, min=0, max=1)) # random sample from uniform dist
  mean_sample <- mean(sample)        # sample mean
  sem <- sd(sample)/sqrt(i)          # standard error of the mean
  interval <- c(mean_sample - Q*sem, mean_sample + Q*sem)  # 95% CI
  
  # print the results
  cat("Sample size:",i,"\nMean:",mean_sample,
      "\n  SEM:",sem,"\n  CI:",interval,"\n\n",fill=FALSE)
}
## Sample size: 10 
## Mean: 0.4493746 
##   SEM: 0.09552043 
##   CI: 0.262158 0.6365912 
## 
## Sample size: 100 
## Mean: 0.5134436 
##   SEM: 0.02877359 
##   CI: 0.4570484 0.5698388 
## 
## Sample size: 1000 
## Mean: 0.5111267 
##   SEM: 0.00914435 
##   CI: 0.4932041 0.5290493 
## 
## Sample size: 10000 
## Mean: 0.5024249 
##   SEM: 0.002889351 
##   CI: 0.4967619 0.5080879

\(\Rightarrow\ \) What do we observe from these comparisons?

Answer
  • We observe that the 95% CI decreases as the sample size increases. If we repeat each of these 100 times, then 95 out of the 100 intervals will contain the true population mean.
  • This is because the SEM also gets smaller as the sample size increases.

Summary

Key concepts:

  • The sampling distribution of the sample mean approximates a normal distribution and hence has predictable statistical properties.

  • The sample mean converges toward the population mean as the sample size increases.

  • Correspondingly, the variation in the mean is inversely proportional to the sample size.

  • The standard error of the mean, representing the expected variation in the mean from sample to sample, can be computed from a single sample of independent observations due to its direct dependency on sample size.


References

Whitlock & Schluter: Chapter 4

Aho:

  • Section 3.2.2.2 (Normal distribution)
  • Section 5.2 (Sampling Distributions)
    • 5.2.2 (Sampling Distribution of \(\bar{X}\))
    • 5.2.2.1 (Central Limit Theorem)
  • Section 5.3 (Confidence Intervals)