In 1954, Jonas Salk’s vaccine was tested on elementary-school students across the United States and Canada. In the study, 401,974 students were divided randomly into two groups: kids in one group received the vaccine, whereas those in the other group (the control group) were injected with saline solution instead. The students were unaware of which group they were in.
Of those who received the vaccine, 0.016% developed paralytic polio during the study, whereas 0.057% of the control group developed the disease (Brownlee 1955). The vaccine seemed to reduce the rate of disease by two-thirds, but the difference between groups was quite small, only about 4 cases per 10,000.
Did the vaccine work, or did such a small difference arise purely by chance?
Before we address this question, let’s review the basics of hypothesis testing and work through a simple example.
To answer this question we need to formulate a null hypothesis, often denoted as \(H_o\), and an alternative hypothesis, \(H_A\).
If the probability of \(H_o\) is very low, then we can reject \(H_o\) and accept the alternate hypothesis \(H_A\).
Since we know that random samples display variation, this question becomes, “What is the chance that a difference at least as great as that observed is likely to occur just by chance, if \(H_o\) is true?”
There are two kinds of tests we can perform to compare \(H_o\) and \(H_A\):
A test statistic is a value based on the observed data that is compared to the null distribution to see how consistent the data are with expectation under \(H_o\).
To figure out how likely our observed data are under \(H_o\), we need to know something about the null distribution, i.e. the frequency of occurrence of all possible outcomes that WOULD be consistent with the null hypothesis.
Therefore we want to compare our results with the \(H_o\) distribution and see how often we would see a difference from the null expectation that is at least as great as the one we observed, just by chance.
Bisazza et al. (1996) tested the possibility of handedness in European toads, Bufo bufo, by sampling and measuring 18 toads from the wild. We will assume that this was a random sample. It was found that individual toads tended to use one forelimb more than the other. Of the 18 toads tested, 14 were right-handed and 4 were left-handed.
At this point the question became: do right-handed and left-handed toads occur with equal frequency in the toad population, or is one type more frequent than the other, as in the human population? Are these results evidence of a predominance of one type of handedness in toads?
Our null and alternative hypotheses are:
\(H_o\): Left- and right-handed toads are equally frequent in the population.
\(H_A\): Left- and right-handed toads are not equally frequent in the population.
The null expectation (no handedness) is that, out of 18 individuals sampled, we should observe 9 right-handed and 9 left-handed individuals.
The test statistic is 14, the number of right-handed individuals.
When will perform a two-sided test since we are looking for any difference in either direction.
The null distribution and observed outcome for the handedness problem are illustrated below:
The null distribution above assumes that the probability of right- or left-handed toads is 0.5, and so we would expect most often to see 9 right-handed toads in any sample of 18 individuals.
This distribution is an example of a binomial distribution with \(p=0.5\) (we will discuss the binomial in great detail in the next class). Therefore, we can use R’s function for the CDF of a binomial distribution, pbinom()
, to calculate the total probability of these two outcomes.
Since we will perform a two-tailed test, we are interested in the probability of seeing \(\ge 14\) OR \(\le 4\) right-handed toads:
# cdf of total probability up to and including 4
pbinom(q = 4, size = 18, prob = 0.5)
## [1] 0.01544189
# cdf of greater than 13 (same as >= 14)
pbinom(13, 18, 0.5, lower.tail = F)
## [1] 0.01544189
# alternative method for CDF >= 14
1 - pbinom(13, 18, 0.5, lower.tail = T)
## [1] 0.01544189
# combined total probability of both tails
pbinom(4, 18, 0.5) + pbinom(13,18,0.5, lower.tail = F)
## [1] 0.03088379
The \(p\)-value of our test statistic is 0.031. This is lower than the conventional cutoff of \(p=0.05\), also known as significance level, or \(\alpha\).
Thus we reject the \(H_o\) that there is no handedness and accept the \(H_A\) that there is a preferential handedness for toads.
We can formulate the question about vaccine effectiveness in a similar manner as for the handedness example above. Take a few minutes to formulate the null and alternative hypotheses, and then compute a \(p\)-value for the observed data on your own. Assume the study included 200,000 unvaccinated and 200,000 vaccinated children.
What are the null and alternative hypotheses?
H_o: The proportion of children who contracted polio in each group was the same
H_A: The proportion of children who contracted polio in the vaccine group was lower than in the control group
# sample size
n = 200000
# proportion of individuals with polio in each group
p.polio.ctl = 0.00057
p.polio.vac = 0.00016
# number of individuals with polio in each group
num.polio.ctl = p.polio.ctl * n
num.polio.vac = p.polio.vac * n
num.polio.ctl # 114
## [1] 114
num.polio.vac # 32
## [1] 32
# binomial test for unvaccinated group given H_o
#pbinom(num.polio.ctl, n, prob=p.polio.ctl) # control group
pbinom(114, 200000, prob=0.00057) # control group (hard-coded)
## [1] 0.5248817
# binomial test for vaccinated group given H_o
#pbinom(num.polio.vac, n, prob=p.polio.ctl) # vaccinated group
pbinom(32, 200000, prob=0.00057) # vaccinated group (hard-coded)
## [1] 1.059181e-19
Note: if we wanted to find the probability that the vaccine might actually result in more cases than expected, we would combine the lower- and upper-tailed probabilities to get the total probability of seeing either 114-82=32 or fewer cases, or 114+82=196 or more cases:
# total prob of seeing at most 32 or at least 196 cases
pbinom(32, 200000, prob=0.00057) + pbinom(195, 200000, prob=0.00057, lower.tail = FALSE)
## [1] 2.004552e-12
There are two basic kinds of errors we can make in hypothesis testing:
This is also illustrated in the table below:
The number \(\alpha\) is an arbitrary threshold at which we call an observed difference significant.
One can try to reduce the Type I error by simply reducing the \(\alpha\), for example setting \(\alpha=0.01\).
However, this makes the Type II error worse! If the \(\alpha\) threshold is too conservative, more true rejections will be missed.
On the other hand, if you are too liberal in setting \(\alpha\), you risk increasing false negatives (failing to accept a true \(H_o\)).
Type II error (False negatives)
The number \(\beta\) represents the proportion of the time that we do NOT reject \(H_o\), even though \(H_A\) is true.
We will talk more about power when we discuss experimental designs.
Note that a high \(p\)-value does not prove that the null hypothesis is true.
As we have already seen, confidence intervals enable us to estimate a range within which we would expect the true population parameter to fall some specified percentage of the time.
\(p\)-values are closely related to confidence intervals: they are essentially reciprocal, and play complementary roles in Statistics. However,
To investigate, Porter and Moore (1981) gave new T-shirts to children of nine mothers. Each child wore his or her shirt to bed for three consecutive nights. During the day, from waking until bedtime, the shirts were kept in individually sealed plastic bags. No scented soaps or perfumes were used during the study. Each mother was then given the shirt of her child and that of another, randomly chosen child and asked to identify her own by smell. Eight of nine mothers identified their children correctly.
In the exercise we will examine the following questions:
The examples used here are from Whitlock, Michael C. & Schluter, Dolph, The Analysis of Biological Data (Chapter 6).