Pulling yourself up by the straps of your boots

Bootstrapping is a technique to measure the accuracy of any statistical estimate. It relies on random resampling with replacement. The idea behind bootstrapping is very simple. As you know, collecting more samples from the population allows you to obtain better (more accurate) population estimates. However, this is not always possible in practice. So, instead of sampling from the population, you can sample from your existing sample. Is this approach as good as drawing more random samples from the population? Definitely not. Is it still useful though? Yes. Random resampling and re-calculation of the parameter you are estimating for each new sample allows you to obtain a distribution of the values of that parameter in different samples. Estimating the spread of that distribution (e.g. the 2.5%-97.5% percentiles) provides a measure of accuracy / uncertainty. Thus, you are making use of your existing sample to estimate the accuracy of the inferred parameter.

The light data set

We are going to work with data collected by Simon Newcomb in 1882, when he was measuring the speed of light. The data are recorded as deviations from 24,800 nanoseconds, which was the time that light took to pass through a system of mirrors in Newcomb’s experimental device. Take a look at the data and see if anything catches your eye.

data(light)

str(light)
##  num [1:66] 28 26 33 24 34 -44 27 16 40 -2 ...
##  - attr(*, "comment")= chr "Units: deviations from 24800 nanoseconds"
light
##  [1]  28  26  33  24  34 -44  27  16  40  -2  29  22  24  21  25  30  23  29  31
## [20]  19  24  20  36  32  36  28  25  21  28  29  37  25  28  26  30  32  36  26
## [39]  30  22  36  23  27  27  28  27  31  27  26  33  26  32  32  24  39  28  24
## [58]  25  32  25  29  27  28  29  16  23
hist(x = light,
     breaks = 30)

There are two obvious outliers. Dealing with outliers is always tricky. Sometimes, it is possible to find a plausible justification to remove them (as long as you are not removing them simply because they are outliers!) But in many other cases, leaving the outliers in is a more “honest” option. However, leaving the outliers in is going to make our data significantly deviate from normality.

# Shapiro test with the full data set
shapiro.test(light)
## 
##  Shapiro-Wilk normality test
## 
## data:  light
## W = 0.59115, p-value = 2.842e-12
# Shapiro test with outliers removed (for comparison)
light_without_outliers <- light[light>0]
shapiro.test(light_without_outliers)
## 
##  Shapiro-Wilk normality test
## 
## data:  light_without_outliers
## W = 0.98462, p-value = 0.6082

Moreover, if you summarize these data as mean ± SD, you will be describing a distribution that is quite different from the one you actually have - the mean will be about right, but the spread of the observations around the mean will differ a lot from the real data.

# select an arbitrary range of values on the X axis
x_range <- -50:50
# calculate the probability of observing each value given a normal distribution with the same mean and SD as "light"
normal_density <- dnorm(x = x_range,
                        mean = mean(light),
                        sd = sd(light) )

# plot the inferred normal distribution on top of the real one
hist(x = light,
     breaks = 30,
     xlim = c(-50, 50),
     freq = FALSE) # display the proportion of observations on the Y axis instead of frequency (default)
points(x = x_range,
       y = normal_density,
       type = "l")

# do the same for the data set without outliers,
# just to demonstrate that summarizing your data as mean ± SD would be appropriate in that case

normal_density_without_outliers <- dnorm(x = x_range,
                                         mean = mean(light_without_outliers),
                                         sd = sd(light_without_outliers) )

hist(x = light_without_outliers,
     breaks = 10,
     xlim = c(-50, 50),
     freq = FALSE)
points(x = x_range,
       y = normal_density_without_outliers,
       type = "l")

So, what can we do? Perhaps, a more appropriate way to describe the central tendency of the sample that includes the outliers would be to use median instead of mean.

median(light)
## [1] 27

But how would you measure the uncertainty of the median estimate?

Non-parametric bootstrap

Let us perform random sampling with replacement from the sample we have and calculate the median every time. This is the most basic bootstrapping strategy and it is non-parametric because we do not care about the shape of the distribution.

n_bootstraps <- 1000 # pick an arbitrary number of bootstrap replicates, the exact number does not matter too much

bootstrap_medians <- c() # create an empty vector to store medians of each sample
for (i in 1:n_bootstraps){
  # resample with replacement
  sample_i <- sample(x = light,
                     size = length(light),
                     replace = TRUE)
  # calculate the median
  sample_i_median <- median(sample_i)
  # append to "bootstrap_medians"
  bootstrap_medians <- c(bootstrap_medians, sample_i_median)
}

head(bootstrap_medians)
## [1] 26.0 26.0 27.0 27.0 27.5 26.0
tail(bootstrap_medians)
## [1] 27 27 27 28 28 28

How are the medians distributed?

# epdf
hist(bootstrap_medians)

# ecdf
plot(ecdf(bootstrap_medians))

# median of all medians is actually the same as the median of the sample
median(bootstrap_medians)
## [1] 27
median(light)
## [1] 27

As you see, depending on what sample you take, the median estimate range somewhere between 26 and 28, and they are centered around 27. To obtain a precise estimate of the distribution spread, we can look at 2.5% and 97.5% percentiles:

quantile(x = bootstrap_medians,
         probs = c(0.025,0.975))
##  2.5% 97.5% 
##    26    29

And here is our measure of the median accuracy! The way you could describe this in the paper is the following:

The median is 27 and the 95% bootstrap confidence interval is 26-28.5.

Final note: you can use bootstrapping to estimate the accuracy of any parameter, not just median or mean, we are only using median for illustrative purposes.

Gaussian bootstrap

In case you were bothered by the “peakiness” of the distribution of the resampled medians above, there is a trick that can make it smoother (although this is often unnecessary). You can simply add Gaussian (normally distributed) noise. The noise has to be centered around 0 in order to avoid shifting your median right or left along the X axis. The standard deviation of the noise is an arbitrary number.

n_bootstraps <- 1000 # pick an arbitrary number of bootstrap replicates, the exact number does not matter too much

smooth_bootstrap_medians <- c() # create an empty vector to store medians of each sample
for (i in 1:n_bootstraps){
  # resample with replacement
  sample_i <- sample(x = light,
                     size = length(light),
                     replace = TRUE)
  # add normally distributed noise centered around 0
  sample_i_noisy <- sample_i + rnorm(n = length(light),
                                     mean = 0, # must be 0
                                     sd = 0.1) # arbitrary number
  # calculate the median
  sample_i_noisy_median <- median(sample_i_noisy)
  # append to "smooth_bootstrap_medians"
  smooth_bootstrap_medians <- c(smooth_bootstrap_medians, sample_i_noisy_median)
}

# epdf
hist(smooth_bootstrap_medians)

# ecdf
plot(ecdf(smooth_bootstrap_medians))

# compare the different median estimates
median(smooth_bootstrap_medians)
## [1] 27.10821
median(bootstrap_medians)
## [1] 27
median(light)
## [1] 27
# 95% CI
quantile(x = smooth_bootstrap_medians,
         probs = c(0.025,0.975))
##     2.5%    97.5% 
## 25.98883 28.46340

The obtained values are very similar to the ones without noise, so you are not distorting your data in any meaningful way. Feel free to add random noise in those cases when the shape of the resampled distribution matters.