So far we have used \(t\)-tests to compare two samples, and we’ve explored transforming our data to make it look more normal so that we can proceed with standard parametric tests.

Two other important approaches are also available that do not assume the data are normally distributed (“distribution-free” methods): non-parametric tests and permutation tests, which look at empirical distributions using resampling methods.

Now that computing power is plentiful and cheap, resampling methods are becoming more common in biology, but standard statistical tests are still the go-to methods for many comparisons.

Non-parametric tests

Suppose we don’t want to transform our data and can’t assume that the data are normally distributed.

  • Instead of using the magnitude of differences between groups directly, nonparametric tests that look at the data in terms of their ranks (or ranked differences).

  • Rank tests compare medians instead of means between samples because they treat continuous data as ordinal rather than quantitative (interval) data.

The sign test and the Wilcoxon sign rank test can both be used instead of one-sample or paired \(t\)-tests when the data do not meet the assumptions of a \(t\)-test (i.e. the data are normally distributed).

Null hypothesis

The intuition behind the null hypothesis for all of these tests is that the median difference between two samples (or between a sample and an expected value) should be zero, if they belong to the same parent population.

  • This means that there should be no bias in the direction (sign) of the differences between two groups (or for one-sample tests, between a sample and the expected population mean).

  • In addition, if we were to rank the differences between groups, or just the raw values in each group (depending on the test), they should somehow average out between the two groups. This works well when the two groups have a similar skew and variance.

Test statistics for non-parametric tests

Different tests use different methods to generate a test statistic that is used to obtain a \(p\)-value for observed differences. All of them compute a statistic for both positive and negative differences, and then one of these is chosen to compute the \(p\)-value.

Since the amount of bias will be equally extreme in both directions, it doesn’t matter which statistic is chosen, but it is important to choose the correct tail of an expected distribution for comparison.

  • By default, the statistic that can be compared to a lower-tail of a distribution is chosen as the test statistic for obtaining a \(p\)-value.

Example: Plasma Triglyceride levels

Last time we used log-transformed values of this dataset to perform \(t\)-tests. Here, we will use different non-parametric tests to compare the pre- and post- data.

#####################################################################
# plasma triglyceride levels in the population (mg/ml)
# borderline high = 2-4 vs. normal < 2
# testing before and after diet and exercise changes (expect a decrease)
pre = c(2.55,3.38,2.37,4.11,3.27,2.58,4.20,3.22,5.10,2.62,3.06,1.23,2.27,2.24,1.39,2.63,2.61,4.30,1.46,3.35,2.79,2.42,4.63,1.57)
post = c(1.59,3.51,1.44,2.32,1.75,1.67,1.90,1.37,2.72,1.80,2.40,2.01,2.41,1.38,1.18,4.31,2.09,2.32,2.63,2.25,1.71,2.01,1.95,3.46)
#####################################################################

Nonparametric tests for paired data

Sign test

The sign test is the simplest nonparametric one-sample or paired two-sample test. The hypotheses to be tested are:

\(\ \ \ H_o\): The median paired difference between the groups is zero.

\(\ \ \ H_A\): The median paired difference between the groups is NOT zero.

The sign test simply takes the signs of the paired differences and counts up the number of negative and positive deviations from zero. This generates binary data, so the p-value is equivalent to the binomial probability for the observed number of negative deviations with an expected probability of \(p=0.5\).

The steps are:

  • Calculate the pairwise differences between the samples.
  • Count the number of nonzero differences \(n\).
  • The test statistic, let’s call it \(k\), is the smaller number of the positive and negative differences.
  • Calculate the binomial probability of observing \(k\) out of \(n\) differences given an expected probability of \(\pi = 0.5\).

First, let’s compute the test statistic manually. This will be the smaller of the negative and positive counts.

# difference in the two samples
tri_diff = 

# count of negative differences
neg_count = 
neg_count
## Error in eval(expr, envir, enclos): object 'neg_count' not found

# count of positive differences
pos_count = 
pos_count
## Error in eval(expr, envir, enclos): object 'pos_count' not found
# test statistic
# will be used to get p(X <= x) using lower tail of expected distribution
test_stat = 
test_stat
## Error in eval(expr, envir, enclos): object 'test_stat' not found

Since the sign test is just a binomial exact test using the positive and negative counts, the probability can be computed either using pbinom() or using the binomial proportions test, binom.test().

Now we use the test statistic to obtain a two-tailed \(p\)-value for the observed data under \(H_o\). This is the two-tailed binomial probability of seeing a value as extreme as our test statistic (here, neg_count).

# since pbinom(..., lower.tail=TRUE) gives p(X <= x), 
# for a 2-tailed test we multiply the probability by 2


# binomial test for counts (default is 2-sided)

Wilcoxon signed rank test

The Wilcoxon signed rank test is similar to the sign test except that it takes the sum of all the signed ranks, and it can easily be performed directly in R.

This test is a non-parametric equivalent to a paired \(t\)-test. When the underlying data are normally distributed, it has slightly less power to detect true differences between groups, but otherwise the opposite is true. We will discuss power more in the next class.

Assumptions of the test

It is often stated that a caveat of this test is that it assumes a symmetric distribution of paired differences. However, in practice this is not really a big limitation because under the null hypothesis that two groups are drawn from the same population, then the pairwise differences are guaranteed to be symmetrical, with both mean and median = 0.

Let’s see if the distribution of paired differences for the triglyceride data looks approximately symmetrical:

tri_diff = post - pre
hist(tri_diff, breaks=10)

abline(v = ... , lwd=2, col="blue")
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
abline(v = ... , lwd=2, col="red")
## Error in eval(expr, envir, enclos): '...' used in an incorrect context

Well, these differences look sort of symmetrical, but not clearly so. However, what we are actually interested in here is symmetry under \(H_o\), so we can use this test even when the actual paired differences are not symmetric about the mean/median.

The W statistic

Instead of looking at the actual magnitudes of the paired differences between paired samples, here we will look at the ranks of the differences.

Our null hypothesis is that the sum of the ranks from both groups are the same.

The steps are:

  • Calculate the \(N\) differences between the pairs.
  • Rank the absolute values of the \(n\) non-zero differences. Assign the average if there is a tie.
  • Also record the sign of the differences, \(sgn(x_{2i} - x_{1i})\) for \(i \in \{1..N\}\).
  • Compute the test statistic \(W\), defined as:

\[ W = \sum_{i=1}^Nsgn(x_{2i} - x_{1i}) * R_i \]

# compute paired differences
tri_diff = post - pre

# rank absolute values
rank_diff = 

# Calculate the Wilcoxon W statistic
w_stat = 
w_stat
## Error in eval(expr, envir, enclos): object 'w_stat' not found

If there is no difference between the two samples, we would expect that \(W=0\), since there should be no bias toward lower or higher ranks in either sample.

The V statistic

The R implementation of the Wilcoxon Rank Sum test uses something called a \(V\)-statistic, which is a little different than the \(W\)-statistic. The test will give the same \(p\)-value whether we use \(W\) or \(V\), since the formulations are equivalent.

To compute \(V\), instead of just adding up all the positive ranks and subtracting the negative ones to get our test statistic (\(H_o: W=0\)), we just look at the positive and negative summed ranks separately. If both groups are the from the same distribution, then we’d expect these to be pretty much the same.

The test statistic, \(V\), can range from \(0\) to \(n(n+1)/2\). Depending on the question we want to ask, we choose \(V\) as follows:

  • 2-tailed test: \(V = min(T_-, T_+)\), the minimum of the summed ranks with either sign
  • Upper-tailed test: \(V = T_-\)
  • Lower-tailed test: \(V = T_+\)

Let’s go ahead and compute the \(V\) statistic and a corresponding \(p\)-value using the normal approximation for \(V\):

# negative rank sums

t_neg = 
t_neg
## Error in eval(expr, envir, enclos): object 't_neg' not found

# positive rank sums

t_pos = 
t_pos
## Error in eval(expr, envir, enclos): object 't_pos' not found

# V statistic
v_stat = 
v_stat
## Error in eval(expr, envir, enclos): object 'v_stat' not found

Sampling distribution of Wilcoxon statistics

When the number of items sampled is more than ~20-25, the sampling distributions of the \(W\) and \(V\) statistics have a specific expected distribution under the null that is well approximated by a normal distribution.

We can simulate the sampling distribution for \(V\) to illustrate this for ourselves. It turns out that R actually has a family of functions for the \(V\) distribution! To generate the samples, we will use the rsignrank() function, which takes random samples from a Wilcoxon Sign Rank distribution for a certain sample size \(n\).

We will then plot a histogram of the sampling distribution with the mean overlaid in blue, and the value of our actual \(V\)-statistic overlaid in red.

n=24  # size of triglyceride dataset
  
# sampling distribution of the test statistic W
set.seed(13000)                       # Set seed for reproducibility
N = 100000                            # Specify sample size
y_rsignrank =  ...    # Take N random samples from W(sample size=n) 
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
head(y_rsignrank)                     # Print values to RStudio console
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'y_rsignrank' not found
hist(y_rsignrank,                     # Plot of randomly drawn density
     breaks = 50,
     main = "")
## Error in hist(y_rsignrank, breaks = 50, main = ""): object 'y_rsignrank' not found
abline(v=mean(y_rsignrank), col="blue", lwd=3)  # mean of sampling distribution
## Error in mean(y_rsignrank): object 'y_rsignrank' not found
abline(v=mean(v_stat), col="red", lwd=3)        # V-statistic for comparison
## Error in mean(v_stat): object 'v_stat' not found

Normal approximations for the \(p\)-value

In the absence of tied ranks (which we ignore for now to keep things simple), the population mean and variance of the \(W\)-statistic are:

\[ \mu_{W} = 0 ;\ \sigma_{W}^2 = \frac{n(n+1)(2n+1)}{6} \]

And for the \(V\)-statistic, they are:

\[ \mu_V = \frac{n(n+1)}{4}\ ;\ \ \sigma^2_V = \frac{n(n+1)(2n+1)}{24}\]

For the triglyceride data, the sample size is \(n=24\). Let’s use the normal approximations to compute a \(z\)-score and a \(p\)-value using both \(W\) and \(V\):

# p-value (2-tailed) with normal approximation for W
n = length(tri_diff)           # sample size
sigma_w = n*(n+1)*(2*n+1) / 6  # expected std dev of W statistic

z_w =   # z-stat


# p-value (2-tailed probability)
  
## Error: <text>:10:0: unexpected end of input
## 8: # p-value (2-tailed probability)
## 9:   
##   ^
# p-value (2-tailed) with normal approximation for V
mean_v = n*(n+1)/4              # mu = max_t / 2
sigma_v = n*(n+1)*(2*n+1) / 24  # sigma without ties

z_v =   # z-stat

# p-value (2-tailed probability)
  
## Error: <text>:9:0: unexpected end of input
## 7: # p-value (2-tailed probability)
## 8:   
##   ^

We can also remake our histogram above as a density plot, with a normal curve using the mean and SD for the V-distribution:

hist(y_rsignrank,
     breaks = 50, freq=FALSE,
     main = "")
## Error in hist(y_rsignrank, breaks = 50, freq = FALSE, main = ""): object 'y_rsignrank' not found
abline(v=mean(y_rsignrank), col="blue", lwd=3)
## Error in mean(y_rsignrank): object 'y_rsignrank' not found
abline(v=mean(v_stat), col="red", lwd=3)
## Error in mean(v_stat): object 'v_stat' not found

# overlay normal distribution with expected mean and SD
curve(dnorm(x, mean= ... , sd= ... ),
      col="darkblue", lwd=2, add=TRUE)
## Error in eval(expr, envir = ll, enclos = parent.frame()): '...' used in an incorrect context

This looks pretty good!

Wilcoxon Signed Rank Test in R

To compute a \(p\)-value for our test statistic, we can use the wilcox.test() function with the paired=TRUE option to perform a paired test on the triglyceride data. The test gives the same result when samples are entered in either order.

# wilcoxon signed rank test
  # post vs. pre
  # pre vs. post

Let’s see what psignrank() gives us for a two-tailed probability:

# 2-tailed p-value with CDF

Notice that \(p\)-values returned by the paired test and the CDF function are the same, and both methods gave us an “exact” probability, as reported by the output of the wilcox.test().

However, computing exact \(p\)-values for Wilcoxon statistics can become computationally expensive, especially for larger sample sizes, and particularly for the Rank Sum test for unpaired samples (see below).

By default, the wilcoxtest() will compute an exact \(p\)-value if the sample size is <50 and there are no ties. Otherwise, R will use a normal approximation with continuity correction to obtain a \(p\)-value.

Manual calculations vs. R functions

How do the manual calculations for the \(p\)-value compare to those you got using the wilcox.test() and psignrank() functions in R? Why might these differ?

# your answer here

Unpaired data: Mann-Whitney U / Wilcoxon Rank Sum test

What if our data are not normal and also not paired?

We can use a Wilcoxon rank-sum test or the equivalent Mann-Whitney U-test. Some people therefore like to refer to this test as a Mann-Whitney-Wilcoxon test.

This test is simlar to the paired signed-rank test, but instead of comparing ranks for the positive and negative differences between pairs of values, we will first combine the data, rank them together, and then compare how the summed ranks between the two independent groups.

This makes intuitive sense because if the data are drawn from a homogeneous population, we would expect the ranks of the measurements from two random samples to be relatively well interleaved (as if we had shuffled a deck of cards). So, we would expect that the overall sum of ranks should be about the same for both groups.

The hypotheses to be tested are:

\(\ \ \ H_o\): The sample distributions are the same.

\(\ \ \ H_A\): The sample distributions are NOT the same.

W-statistics

To perform the test, we first compute the \(W\)-statistic for both groups:

  • Combine the data.
  • Assign ranks from smallest (top-ranked) to largest (lowest-ranked).
  • Assign ties the midrank (the average of the ranks).
  • Compute the sum of ranks \(T_1\) for Sample 1 and \(T_2\) for Sample 2.
  • Calculate \(W\) using the formulas below.

\[ W_{1} = T_{1} - \frac{n_{1}(n_{1}+1)}{2} \] \[ W_{2} = T_{2} - \frac{n_{2}(n_{2}+1)}{2} \] where \(n_{1}\) and \(n_{2}\) are the sizes of the two groups.

Note: The sum of ranks for a sequential set of numbers starting with 1 is \(N(N+1)/2\). Check this out for yourself on some small sets of numbers (e.g. 1,2,3,4,5). Intuitively, then, if most the measurements from Sample 1 are smaller smaller than those from Sample 2, then the minimum sum of ranks for Sample 1 would be \(R_1 = n_1(n_1+1)/2\).

Computing the W-statistics

Ranks and rank sums:

Data = c(pre, post)

n_pre = length(pre)
n_post = length(post)
n = n_pre  # here, n is the same for both sets
  
# rank the combined data
DataRank = 
DataRank
## Error in eval(expr, envir, enclos): object 'DataRank' not found

# assign ranks to groups and take the sum
DataRankSums = tapply(DataRank,rep(c("pre","post"),each=n),sum)
## Error in tapply(DataRank, rep(c("pre", "post"), each = n), sum): object 'DataRank' not found
DataRankSums
## Error in eval(expr, envir, enclos): object 'DataRankSums' not found

Test statistics:

# compute t-values (used to get W)
t_pre   = 
t_post  = 

t_min = n*(n + 1)/2      # here n and t_min are the same for both groups

w_pre = 
w_pre
## Error in eval(expr, envir, enclos): object 'w_pre' not found
w_post = 
w_post
## Error in eval(expr, envir, enclos): object 'w_post' not found

Normal approximation

Again, the sampling distribution of \(W\) for samples > ~20-25 each is approximately normal, so we can compute a \(p\)-value for our test statistic using a normal approximation.

The population mean and variance (again, ignoring the term for ties) are:

\[ \mu_W = \frac{n_1n_2}{2}\ ;\ \ \sigma_{W}^2 = \frac{n_{1}n_{2}}{12} (n_1 + n_2 + 1) \]

Now we can calculate a \(z\)-score and use it to get a \(p\)-value. By convention, we use the group with the lower W (here that is w_post):

# mean and SD for W sampling distribution
mu_w = n_pre * n_post / 2
mu_w
## [1] 288

sd_w_squared = (n_pre*n_post/12) * (n_pre + n_post + 1)
sd_w = sqrt(sd_w_squared)
sd_w
## [1] 48.49742

# z-score for lower W value
z_w = 
z_w
## Error in eval(expr, envir, enclos): object 'z_w' not found

# p-value with CDF
  # using z-score
  # using W-stat

Note that using the upper-tail probability for the higher \(W\) gives the same \(p\)-value:

# z-score for higher W value
z_w =
z_w
## Error in eval(expr, envir, enclos): object 'z_w' not found

# p-value (since the sign of the z-score is reversed, we use the opposite tail)

U-statistics

The Mann-Whitney \(U\)-test uses a \(U\)-statistic instead, which measures the difference between the observed and minimum ranks. Mathematically, it is the same as the \(W\)-statistic, except the formulation is slightly different. The test statistic is the smaller of the rank sums.

\[ U_1 = n_1n_2 + \frac{n_1(n_1+1)}{2} - T_1\] \[U_2 = n_1n_2 - U_1\] where \(n_{1}\) and \(n_{2}\) are the size of the two groups.

The population mean and variance (again, ignoring the term for ties for now) are:

\[ \mu_U = \frac{n_1n_2}{2}\ ; \ \sigma_{U}^2 = \frac{n_{1}n_{2}}{12} (n_1 + n_2 + 1) \]

First, let’s compute the \(U\) statistics by hand:

# see above for calculation of t_pre and t_post

t_min = n*(n+1)/2 # minimum rank

u_pre = 
u_pre
## Error in eval(expr, envir, enclos): object 'u_pre' not found

u_post = 
u_post
## Error in eval(expr, envir, enclos): object 'u_post' not found

You can see that the values for the \(U\)-statistics are the same as the \(W\)-statistics, so using the normal approximation to compute the \(z\)-score and \(p\)-value will give exactly the same results as above using \(W\)-statistics!

Wilcoxon Rank Sum test in R

The test is implemented in R as wilcox.test(), with the (default) unpaired option. The \(p\)-value is the same when the groups are supplied in either order.

  # paired=FALSE is the default
  # the p-value is the same in either orientation

R also has the wilcox family of functions for the distribution of the Wilcoxon Rank Sum statistic. Since this is a discrete distribution, the presence of ties in the data makes the \(p\)-values inexact, as non-integer values are simply truncated.

w_pre
## Error in eval(expr, envir, enclos): object 'w_pre' not found
w_post
## Error in eval(expr, envir, enclos): object 'w_post' not found
## unadjusted values
2*pwilcox(w_pre, n_pre, n_post, lower.tail=F)       # truncates to 415
## Error in pwilcox(w_pre, n_pre, n_post, lower.tail = F): object 'w_pre' not found
2*pwilcox(w_post, n_pre, n_post)                    # truncates to 160
## Error in pwilcox(w_post, n_pre, n_post): object 'w_post' not found
## adjusted values
# p(X => x): need to subtract 1 or else get p(X > x)
2*pwilcox(w_pre - 1.5, n_pre, n_post, lower.tail=F) # decreases to 414
## Error in pwilcox(w_pre - 1.5, n_pre, n_post, lower.tail = F): object 'w_pre' not found

# p(X <= x)
2*pwilcox(w_post + 0.5, n_pre, n_post)              # increases to 161
## Error in pwilcox(w_post + 0.5, n_pre, n_post): object 'w_post' not found

The R documentation warns that the wilcox distribution functions are computationally inefficient and can sometimes crash R (!), so it is probably not a good idea to use them in general.

Manual calculations vs. R functions

How do the manual results compare with the results from the R functions? Why are they not identical?

# your answer here

What if we were to use the log-transformed data?

Try this out and see what you find. First, draw a histogram of the transformed data.

# histogram as above, but using log-transformed data

Now perform the test in R.

# Wilcoxon rank sum test
  # untransformed
  # log-transformed

How do the results compare with those for the untransformed data?

# your answer here

Permutation (shuffle) test 1

The mosaic package contains some convenient functions for sampling. The Sleep (capital S) dataset is included in this package. Here is the description of the study from the R documentation for this dataset:

In an experiment on memory (Mednicj et al, 2008), students were given lists of 24 words to memorize. After hearing the words they were assigned at random to different groups. One group of 12 students took a nap for 1.5 hours while a second group of 12 students stayed awake and was given a caffeine pill. The data set records the number of words each participant was able to recall after the break.

Let’s make a boxplot of these data and take a look at them.

data(Sleep)

# simple boxplot with base R (formula syntax)


# ggplot strip plot with boxplot

If the two samples came from the same population, we would expect that if we jumble up the labels on the data points, then the mean difference between them would be zero.

We can do this many times and then find a \(p\)-value for the observed difference between the two groups.

This method uses permutation to randomly reassign the labels. Because this essentially has the effect of shuffling a deck of cards, the test is often called a shuffle test.

Let’s try this out below. First, let’s look at the difference in the means and then perform a regular \(t\)-test on them.

Note that here we use the formula syntax for the mean() and t.test() functions, which uses a tilde (~) symbol. This is a convenient notation to specify a relationship between two variables. The “dependent” variable (the outcome) goes on the left, and the independent variable goes on the right.

# scramble Group with respect to outcome, Words
        # means of the two samples
obs =   # observed difference
obs
## Error in eval(expr, envir, enclos): object 'obs' not found
# t-test for HA: caffeine < sleep

Looks like getting some sleep is better for your memory than taking caffeine!

If we randomly shuffle the group labels, we will get a different mean between the sample groups every time. We use the shuffle() function to do the permutation:

# one shuffle

Now let’s try the permutation test and look at the results.

# permutation test: scramble Group with respect to outcome, Words

# 2,000 shuffles
sleep_null =

# from ggformula package, included with mosaic, enables formula-style expressions in ggplot
# here we just have one variable, but we can also use this to look at two variables
gf_histogram(gformula = ... , fill = ... , data = sleep_null, 
  binwidth = 0.4,
  xlab = "Distribution of difference in means under the null hypothesis")
## Error in eval(expr, envir, enclos): '...' used in an incorrect context
# empirical p-value

This is our empirical \(p\)-value for the difference in the sample means! How does it compare to the results of the \(t\)-test above?