Exercise: RR, OR, Fisher’s Exact Test

Is high blood pressure a risk factor for stroke?

Consider two different experimental designs for determining whether high blood pressure is a risk factor for strokes:

Design 1
- 2000 men between the ages of 70-80 are chosen at random from the population.
- These are divided into groups with and without a history of high blood pressure.
- The number of men in each group who experienced a stroke sometime within the following 5 years is recorded.
Design 2
- 2000 men between the ages of 70-80 are recruited for a study, of which 1000 have never had a stroke and 1000 have had a stroke in the last 5 years.
- The proportion of each group who have a history of high blood pressure is recorded.

The data might look something like this:

Design 1

	HiBP	LoBP
Stroke	100	50
NoStroke	600	1250

Design 2

	HiBP	LoBP
Stroke	200	800
NoStroke	50	950

1) Compute RR and OR manually

Let’s compute both $RR = \frac{p_{Stroke|HiBP}}{p_{Stroke|LoBP}}$ and $OR = \frac{p_{stroke|HiBP}/(1-p_{Stroke|HiBP})}{p_{Stroke|LoBP}/(1-p_{Stroke|LoBP}}$ for both scenarios:

## Design 1 ================= #
a=data_matrix1[1,1]
b=data_matrix1[1,2]
c=data_matrix1[2,1]
d=data_matrix1[2,2]

p1 = a/(a+c)
p2 = b/(b+d)
o1 = a/c
o2 = b/d
RR = p1 / p2
#OR = (p1/(1-p1)) /(p2/(1-p2))
OR = (a/c) / (b/d) # equivalent

# p1
# p2
# o1
# o2

cat("Design 1:\n", "RR =",RR,"\n", "OR =",OR,"\n")

## Design 1:
##  RR = 3.714286 
##  OR = 4.166667

## Design 2 ================= #
a=data_matrix2[1,1]
b=data_matrix2[1,2]
c=data_matrix2[2,1]
d=data_matrix2[2,2]

p1 = a/(a+c)
p2 = b/(b+d)
o1 = a/c
o2 = b/d

RR = p1 / p2
#OR = (p1/(1-p1)) /(p2/(1-p2))
OR = (a/c) / (b/d) # equivalent

# p1
# p2
# o1
# o2

cat("Design 2:\n", "RR =",RR,"\n", "OR =",OR,"\n")

## Design 2:
##  RR = 1.75 
##  OR = 4.75

2) Use R commands to compute RR and OR

We can use the riskratio() and oddsratio() functions from the “epitools” package to compute the risk ratio automatically.

Note 1: Look at the documentation for this command. It expects a table in a different format than what we have here! We can specify to change this by setting the rev parameter. We also need to transpose the table first.

library(epitools)

data_matrix1

##          HiBP LoBP
## Stroke    100   50
## NoStroke  600 1250

t(data_matrix1)

##      Stroke NoStroke
## HiBP    100      600
## LoBP     50     1250

riskratio(t(data_matrix1),correction = F, rev="both")

## $data
##       NoStroke Stroke Total
## LoBP      1250     50  1300
## HiBP       600    100   700
## Total     1850    150  2000
## 
## $measure
##                         NA
## risk ratio with 95% C.I. estimate    lower    upper
##                     LoBP 1.000000       NA       NA
##                     HiBP 3.714286 2.678835 5.149969
## 
## $p.value
##          NA
## two-sided   midp.exact fisher.exact   chi.square
##      LoBP           NA           NA           NA
##      HiBP 2.220446e-16 2.375474e-16 2.803567e-17
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

oddsratio(t(data_matrix1),correction = F, rev="both")

## $data
##       NoStroke Stroke Total
## LoBP      1250     50  1300
## HiBP       600    100   700
## Total     1850    150  2000
## 
## $measure
##                         NA
## odds ratio with 95% C.I. estimate    lower    upper
##                     LoBP 1.000000       NA       NA
##                     HiBP 4.157007 2.933727 5.961773
## 
## $p.value
##          NA
## two-sided   midp.exact fisher.exact   chi.square
##      LoBP           NA           NA           NA
##      HiBP 2.220446e-16 2.375474e-16 2.803567e-17
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"

riskratio(t(data_matrix2),correction = F, rev="both")

## $data
##       NoStroke Stroke Total
## LoBP       950    800  1750
## HiBP        50    200   250
## Total     1000   1000  2000
## 
## $measure
##                         NA
## risk ratio with 95% C.I. estimate    lower    upper
##                     LoBP     1.00       NA       NA
##                     HiBP     1.75 1.614968 1.896322
## 
## $p.value
##          NA
## two-sided midp.exact fisher.exact   chi.square
##      LoBP         NA           NA           NA
##      HiBP          0 3.232811e-25 3.602083e-24
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

oddsratio(t(data_matrix2),correction = F, rev="both")

## $data
##       NoStroke Stroke Total
## LoBP       950    800  1750
## HiBP        50    200   250
## Total     1000   1000  2000
## 
## $measure
##                         NA
## odds ratio with 95% C.I. estimate    lower   upper
##                     LoBP 1.000000       NA      NA
##                     HiBP 4.735764 3.452523 6.61294
## 
## $p.value
##          NA
## two-sided midp.exact fisher.exact   chi.square
##      LoBP         NA           NA           NA
##      HiBP          0 3.232811e-25 3.602083e-24
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "median-unbiased estimate & mid-p exact CI"

The above method seems most straightforward to me logically … but you can also reverse the matrix beforehand, and then execute the commands without adding rev="both":

riskratio(rev(data_matrix1), correction=F)
oddsratio(rev(data_matrix1), correction=F)

riskratio(rev(data_matrix2), correction=F)
oddsratio(rev(data_matrix2), correction=F)

Note 2: Beware!!! There is something funny about the results when you reverse the matrix first and then feed it as a vector to riskratio(), vs. reverse it and feed it as a matrix, or just transpose it. This can be really confusing!!!

library(epitools)

# compare different input formats============================================ #
data_matrix1
rev(data_matrix1)          # reverses order as d-b-c-a

matrix(rev(data_matrix1),  # flips flips a-d, b-c
       nrow=2, ncol=2)
matrix(rev(data_matrix1),  # flips a-d
       nrow=2, ncol=2,
       byrow = TRUE)
t(data_matrix1)            # flips b-c

# risk ratio ================================================================ #
#riskratio(data_matrix1, correction=F, rev="both")     # this is not correct
#riskratio(t(data_matrix1), correction=F)              # this is also incorrect

riskratio(t(data_matrix1), correction=F, rev="both")    # this works!

#riskratio(matrix(rev(data_matrix1), nrow=2, ncol=2),  # this also does not work! ... even with rev="both"
#          correction=F)                               # (default is byrow = FALSE)
#                                                      # then you need to traspose again, too complicated!

riskratio(rev(data_matrix1), correction=F)              # this works!

# riskratio(matrix(rev(data_matrix1),                   # this works too ... need to make matrix by row
#                  nrow=2, ncol=2,
#                  byrow = TRUE),
#           correction=F)

# riskratio(t(matrix(rev(data_matrix1),                  # this works, but too complicated!
#                    nrow=2, ncol=2,                     # you need to transpose the matrix also
#                    byrow = FALSE)),                    # (default is byrow = FALSE)
#          correction=F)

# odds ratio ================================================================ #
oddsratio(t(data_matrix1), correction=F, rev="both")
oddsratio(t(data_matrix2), correction=F, rev="both")

3) How do the RR and OR compare for these two scenarios?

We can see that for Design 1, the RR and OR are similar.

However, in the second case, RR is very different from OR.

It really doesn’t make sense to try to compute RR for Design 2, because we have vastly exaggerated the proportion of people who have had a stroke by picking 1000 people in each focal group. So “the probability that someone with HiBP has had a stroke” is not reflective of the actual incidence of strokes among a random sample of men with HiBP, and likewise for “the probability that someone with LoBP has had a stroke”.

In contrast, the OR is about the same in the two studies.

This makes sense, because even though the total number of people with a Stroke is way higher in Design 2, we still expect that the relative ratios of people who did vs. didn’t have a stroke in the two groups to be about the same as for Design 1.

4) Use Fisher’s Exact Test

Now use Fisher’s test to get a $p$-value and OR estimate for these data. Try out both a 2-sided and a 1-sided test for both datasets (alternative="greater").

fisher.test(data_matrix1)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data_matrix1
## p-value = 2.375e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.892459 6.055446
## sample estimates:
## odds ratio 
##   4.163592

fisher.test(data_matrix2)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data_matrix2
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  3.412951 6.704704
## sample estimates:
## odds ratio 
##   4.746561

fisher.test(data_matrix1, alternative="greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data_matrix1
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  3.056374      Inf
## sample estimates:
## odds ratio 
##   4.163592

fisher.test(data_matrix2, alternative="greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  data_matrix2
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  3.585626      Inf
## sample estimates:
## odds ratio 
##   4.746561

fisher.test(data_matrix1)$p.value
## [1] 2.375474e-16
fisher.test(data_matrix2)$p.value
## [1] 3.232811e-25

fisher.test(data_matrix1, alternative="greater")$p.value
## [1] 2.087913e-16
fisher.test(data_matrix2, alternative="greater")$p.value
## [1] 1.616406e-25

These are very similar to the results we got previously. The odds ratios differ slightly, though, possibly due to minor differences in rounding.

Note that the reported Fisher’s $p$-value from the epitools package is the same as a 2-sided fisher.test().
If you want to do a one-sided test, you should use Fisher’s and specify that explicitly (per above).
Also note that the overall report from Fisher’s just says p-value < 2.2e-16, but you can get the precise value using the $p.value notation.

5) Use `phyper()` function

Now use the hypergeometric distribution to get a $p$-value for these data. How do the results compare with your results above?

data_matrix1

##          HiBP LoBP
## Stroke    100   50
## NoStroke  600 1250

data_matrix2

##          HiBP LoBP
## Stroke    200  800
## NoStroke   50  950

# phyper(q, m, n, k, lower.tail = FALSE)
# phyper(a, a+b, c+d, a+c, lower.tail = FALSE)
#  q = observation (a   = stroke & high BP)
#  m = white balls (a+b = stroke, a.k.a. "success")
#  n = black balls (c+d = healthy, a.k.a. "failure")
#  k = total draws (a+c = High BP)

# Design 1
a=data_matrix1[1,1]
b=data_matrix1[1,2]
c=data_matrix1[2,1]
d=data_matrix1[2,2]
phyper(a - 1, a+b, c+d, a+c, lower.tail = FALSE)

## [1] 2.087913e-16

# Design 2
a=data_matrix2[1,1]
b=data_matrix2[1,2]
c=data_matrix2[2,1]
d=data_matrix2[2,2]
phyper(a - 1, a+b, c+d, a+c, lower.tail = FALSE)

## [1] 1.616406e-25

These are exactly the same as the $p$-values for a one-sided Fisher’s test.

6) Compute the significance using $\chi^2$ tests

How do the values compare?

chisq.test(data_matrix1)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data_matrix1
## X-squared = 69.981, df = 1, p-value < 2.2e-16

chisq.test(data_matrix2)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data_matrix2
## X-squared = 101.49, df = 1, p-value < 2.2e-16

chisq.test(data_matrix1)$p.value

## [1] 5.987268e-17

chisq.test(data_matrix2)$p.value

## [1] 7.181476e-24

Again, very similar and highly significant results, but they are not exactly the same since this is an approximate test. Also note that the result is equivalent to a two-sided test, since you cannot perform a one-sided test with this method.

Exercise: RR, OR, Fisher’s Exact Test

XDASI Fall 2021

10/12/2021

Is high blood pressure a risk factor for stroke?

1) Compute RR and OR manually

2) Use R commands to compute RR and OR

3) How do the RR and OR compare for these two scenarios?

4) Use Fisher’s Exact Test

5) Use `phyper()` function

6) Compute the significance using \(\chi^2\) tests

Exercise: RR, OR, Fisher’s Exact Test

XDASI Fall 2021

10/12/2021

Is high blood pressure a risk factor for stroke?

1) Compute RR and OR manually

2) Use R commands to compute RR and OR

3) How do the RR and OR compare for these two scenarios?

4) Use Fisher’s Exact Test

5) Use phyper() function

6) Compute the significance using \(\chi^2\) tests

5) Use `phyper()` function