Let Y be the dependent (response) variable, and A and B two independent (predictor) variables.
Notation | Meaning |
---|---|
Y ~ 1 | Y is not being predicted by any independent variables. |
Y ~ A | Y is being predicted by A. |
Y ~ A + B | Y is being predicted by A and B independently of each other. |
Y ~ A : B | Y is being predicted by an interaction between A and B. |
Y ~ A * B | The same as Y ~ A + B + A:B. |
Note: this is not an extensive list of notations, please refer to the R manual for additional details.
You may remember that in application to a single categorical variable, ANOVA is all about comparing between-group variance to within-group variance. When it comes to multiple variables, things get a little tricky. In such situations, we use ANOVA to compare between-model variance to within-model variance, where models are linear regression models with different sets of independent variables included in them. There are always two models in a comparison. The first one is simpler, i.e. it includes fewer variables or interactions. It is also referred to as reduced model. The second one includes an extra variable or interaction and it is called full model. There are several strategies to formulate these models, which is why we distinguish between three types of multivariate ANOVA.
anova()
] processes Y ~ A*B:Step | Reduced (simpler) mode | Full (more complex) model | Relevant output |
---|---|---|---|
1 | Y ~ 1 | Y ~ 1 + A | Pr(>F) for A |
2 | Y ~ 1 + A | Y ~ 1 + A + B | Pr(>F) for B |
3 | Y ~ 1 + A + B | Y ~ 1 + A + B + A:B | Pr(>F) for A:B |
Note: type I ANOVA does not compare Y ~ 1 with Y ~ 1 + B. In other words, it does not estimate the main effect of B. So, the order in which you specify independent variables matters, and Y ~ A*B and Y ~ B*A may produce different results. In most cases, you do not want to make any assumptions about the relative importance of your variables and you should not use type I ANOVA.
car::Anova(...,type="II")
] processes Y ~ A*B:Step | Reduced (simpler) mode | Full (more complex) model | Relevant output |
---|---|---|---|
1 | Y ~ 1 + B | Y ~ 1 + A + B | Pr(>F) for A |
2 | Y ~ 1 + A | Y ~ 1 + A + B | Pr(>F) for B |
3 | Y ~ 1 + A + B | Y ~ 1 + A + B + A:B | Pr(>F) for A:B |
Note: no interaction is assumed in the first two comparisons. If there is significant interaction, the results may be misleading. But as long as there is no significant interaction, type II ANOVA is usually the most appropriate and powerful type of ANOVA.
car::Anova(...,type="III")
and summary.lm()
] processes Y ~ A*B:Step | Reduced (simpler) mode | Full (more complex) model | Relevant output |
---|---|---|---|
1 | Y ~ 1 + B + A:B | Y ~ 1 + A + B + A:B | Pr(>F) for A |
2 | Y ~ 1 + A + A:B | Y ~ 1 + A + B + A:B | Pr(>F) for B |
3 | Y ~ 1 + A + B | Y ~ 1 + A + B + A:B | Pr(>F) for A:B |
Note: If there is no significant interaction, type III is less powerful than type II. If there is significant interaction, type III ANOVA is usually the most appropriate and powerful type of ANOVA.
We will work with the data set where researchers measured the height of male and female African elephants of different ages from birth to adulthood.
First, load and plot the data.
# load the data
elephants <- read.csv('Data_Elephants.csv')
# look at the data
str(elephants)
## 'data.frame': 288 obs. of 3 variables:
## $ Age : num 1.4 17.5 12.8 11.2 12.7 ...
## $ Height: num 120 227 235 210 220 ...
## $ Sex : chr "M" "M" "M" "M" ...
head(elephants)
## Age Height Sex
## 1 1.40 120 M
## 2 17.50 227 M
## 3 12.75 235 M
## 4 11.17 210 M
## 5 12.67 220 M
## 6 12.67 189 M
# plot Height as the function of Age, and color the two sexes separately
ggplot(data = elephants,
mapping = aes(x = Age,
y = Height,
color = Sex)) +
geom_point() +
scale_color_manual(values = c("darkgreen", "cornflowerblue"))
The relationship is obviously not linear. Try transforming one or both variables to make the relationship (sort of) linear.
# add a new column that contains the square root of Age
elephants$sqrt_Age <- sqrt(elephants$Age)
# plot the data
ggplot(data = elephants,
mapping = aes(x = sqrt_Age,
y = Height,
color = Sex)) +
geom_point() +
scale_color_manual(values = c("darkgreen", "cornflowerblue"))
First, let us subset young elephants, defined as 11 years old or younger. Let us examine the combined effect of age and sex on the body height in this age cohort.
# subset individuals that are <=11 years old
indices_of_young_individuals <- elephants$Age <= 11
elephants_young <- elephants[indices_of_young_individuals,]
# plot the data
ggplot(data = elephants_young,
mapping = aes(x = sqrt_Age,
y = Height,
color = Sex)) +
geom_point() +
scale_color_manual(values = c("darkgreen", "cornflowerblue"))
# perform type I ANOVA with Age first and Sex second
anova(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_young))
## Analysis of Variance Table
##
## Response: Height
## Df Sum Sq Mean Sq F value Pr(>F)
## sqrt_Age 1 268131 268131 1276.8417 < 2e-16 ***
## Sex 1 967 967 4.6057 0.03336 *
## sqrt_Age:Sex 1 275 275 1.3077 0.25452
## Residuals 161 33809 210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type I ANOVA with Sex first and Age second - note the different p-values!
anova(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_young))
## Analysis of Variance Table
##
## Response: Height
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 358 358 1.7062 0.1933
## sqrt_Age 1 268740 268740 1279.7412 <2e-16 ***
## Sex:sqrt_Age 1 275 275 1.3077 0.2545
## Residuals 161 33809 210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Age first and Sex second
Anova(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_young),
type = "II")
## Anova Table (Type II tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## sqrt_Age 268740 1 1279.7412 < 2e-16 ***
## Sex 967 1 4.6057 0.03336 *
## sqrt_Age:Sex 275 1 1.3077 0.25452
## Residuals 33809 161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Sex first and Age second - note that the result is the same as above
Anova(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_young),
type = "II")
## Anova Table (Type II tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## Sex 967 1 4.6057 0.03336 *
## sqrt_Age 268740 1 1279.7412 < 2e-16 ***
## Sex:sqrt_Age 275 1 1.3077 0.25452
## Residuals 33809 161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type III ANOVA with Age first and Sex second - note the higher p-value for Sex than with type II
Anova(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_young),
type = "III")
## Anova Table (Type III tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## (Intercept) 75189 1 358.0499 <2e-16 ***
## sqrt_Age 107292 1 510.9267 <2e-16 ***
## Sex 2 1 0.0084 0.9272
## sqrt_Age:Sex 275 1 1.3077 0.2545
## Residuals 33809 161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# again, the order does not matter
Anova(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_young),
type = "III")
## Anova Table (Type III tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## (Intercept) 75189 1 358.0499 <2e-16 ***
## Sex 2 1 0.0084 0.9272
## sqrt_Age 107292 1 510.9267 <2e-16 ***
## Sex:sqrt_Age 275 1 1.3077 0.2545
## Residuals 33809 161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# running summary() produces p-values equivalent to type III ANOVA, but it uses t-statistics
summary(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_young))
##
## Call:
## lm(formula = Height ~ sqrt_Age * Sex, data = elephants_young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.858 -9.823 -0.169 10.008 34.352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.9352 3.9602 18.922 <2e-16 ***
## sqrt_Age 39.3462 1.7407 22.604 <2e-16 ***
## SexM -0.4769 5.2105 -0.092 0.927
## sqrt_Age:SexM 2.6373 2.3062 1.144 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.49 on 161 degrees of freedom
## Multiple R-squared: 0.8885, Adjusted R-squared: 0.8864
## F-statistic: 427.6 on 3 and 161 DF, p-value: < 2.2e-16
# again, the order does not matter
summary(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_young))
##
## Call:
## lm(formula = Height ~ Sex * sqrt_Age, data = elephants_young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.858 -9.823 -0.169 10.008 34.352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.9352 3.9602 18.922 <2e-16 ***
## SexM -0.4769 5.2105 -0.092 0.927
## sqrt_Age 39.3462 1.7407 22.604 <2e-16 ***
## SexM:sqrt_Age 2.6373 2.3062 1.144 0.255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.49 on 161 degrees of freedom
## Multiple R-squared: 0.8885, Adjusted R-squared: 0.8864
## F-statistic: 427.6 on 3 and 161 DF, p-value: < 2.2e-16
Summary: the order of factors matters for type I ANOVA - only use type I if you have a really good reason to do so (it is also fine if you are testing an interaction between two categorical variables with a fully balanced experimental design). In this case, there is no significant interaction between age and sex, so applying type II ANOVA provided a more sensitive test for the effect of sex than type III.
Now, let us subset more mature elephants, defined as 12.5 years old or older Let us examine the combined effect of age and sex on the body height in this age cohort.
# subset individuals that are >=12.5 years old
indices_of_mature_individuals <- elephants$Age >= 12.5
elephants_mature <- elephants[indices_of_mature_individuals,]
# plot the data
ggplot(data = elephants_mature,
mapping = aes(x = sqrt_Age,
y = Height,
color = Sex)) +
geom_point() +
scale_color_manual(values = c("darkgreen", "cornflowerblue"))
# perform type I ANOVA with Age first and Sex second
anova(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_mature))
## Analysis of Variance Table
##
## Response: Height
## Df Sum Sq Mean Sq F value Pr(>F)
## sqrt_Age 1 10067 10066.8 27.989 8.669e-07 ***
## Sex 1 20301 20300.8 56.442 4.290e-11 ***
## sqrt_Age:Sex 1 1762 1761.7 4.898 0.02945 *
## Residuals 89 32011 359.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type I ANOVA with Sex first and Age second - note the different p-values!
anova(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_mature))
## Analysis of Variance Table
##
## Response: Height
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 15821 15821.0 43.987 2.486e-09 ***
## sqrt_Age 1 14547 14546.7 40.444 8.490e-09 ***
## Sex:sqrt_Age 1 1762 1761.7 4.898 0.02945 *
## Residuals 89 32011 359.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Age first and Sex second
Anova(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_mature),
type = "II")
## Anova Table (Type II tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## sqrt_Age 14547 1 40.444 8.49e-09 ***
## Sex 20301 1 56.442 4.29e-11 ***
## sqrt_Age:Sex 1762 1 4.898 0.02945 *
## Residuals 32011 89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Sex first and Age second - note that the result is the same as above
Anova(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_mature),
type = "II")
## Anova Table (Type II tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## Sex 20301 1 56.442 4.29e-11 ***
## sqrt_Age 14547 1 40.444 8.49e-09 ***
## Sex:sqrt_Age 1762 1 4.898 0.02945 *
## Residuals 32011 89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type III ANOVA with Age first and Sex second - note the higher p-value for Sex than with type II
Anova(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_mature),
type = "III")
## Anova Table (Type III tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## (Intercept) 21372 1 59.4218 1.714e-11 ***
## sqrt_Age 5546 1 15.4200 0.0001695 ***
## Sex 524 1 1.4567 0.2306592
## sqrt_Age:Sex 1762 1 4.8980 0.0294455 *
## Residuals 32011 89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# again, the order does not matter
Anova(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_mature),
type = "III")
## Anova Table (Type III tests)
##
## Response: Height
## Sum Sq Df F value Pr(>F)
## (Intercept) 21372 1 59.4218 1.714e-11 ***
## Sex 524 1 1.4567 0.2306592
## sqrt_Age 5546 1 15.4200 0.0001695 ***
## Sex:sqrt_Age 1762 1 4.8980 0.0294455 *
## Residuals 32011 89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# running summary() produces p-values equivalent to type III ANOVA, but it uses t-statistics
summary(lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_mature))
##
## Call:
## lm(formula = Height ~ sqrt_Age * Sex, data = elephants_mature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.282 -12.747 1.113 12.962 53.491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148.948 19.322 7.709 1.71e-11 ***
## sqrt_Age 16.283 4.147 3.927 0.000169 ***
## SexM -39.447 32.684 -1.207 0.230659
## sqrt_Age:SexM 15.944 7.204 2.213 0.029446 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.97 on 89 degrees of freedom
## Multiple R-squared: 0.5009, Adjusted R-squared: 0.4841
## F-statistic: 29.78 on 3 and 89 DF, p-value: 2.01e-13
# again, the order does not matter
summary(lm(formula = Height ~ Sex * sqrt_Age,
data = elephants_mature))
##
## Call:
## lm(formula = Height ~ Sex * sqrt_Age, data = elephants_mature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.282 -12.747 1.113 12.962 53.491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 148.948 19.322 7.709 1.71e-11 ***
## SexM -39.447 32.684 -1.207 0.230659
## sqrt_Age 16.283 4.147 3.927 0.000169 ***
## SexM:sqrt_Age 15.944 7.204 2.213 0.029446 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.97 on 89 degrees of freedom
## Multiple R-squared: 0.5009, Adjusted R-squared: 0.4841
## F-statistic: 29.78 on 3 and 89 DF, p-value: 2.01e-13
Summary: the order of factors matters for type I ANOVA - only use type I if you have a really good reason to do so (it is also fine if you are testing an interaction between two categorical variables with a fully balanced experimental design). In this case, there is a significant interaction between age and sex, so applying type III ANOVA provides a more appropriate test of main effects than type II.
Library interactions
provides a handy function interact_plot()
to plot regression lines that take into account interactions. Importantly, it returns a ggplot
object, which you can tweak to suit your aesthetic and scientific needs!
# specify the linear regression models
elephants_young_model <- lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_young)
elephants_mature_model <- lm(formula = Height ~ sqrt_Age * Sex,
data = elephants_mature)
# plot
interact_plot(elephants_young_model,
pred = sqrt_Age,
modx = Sex,
plot.points = TRUE)
interact_plot(elephants_mature_model,
pred = sqrt_Age,
modx = Sex,
plot.points = TRUE)
# this is a ggplot object, so you can modify it using the syntax you already know
interact_plot(elephants_mature_model,
pred = sqrt_Age,
modx = Sex,
plot.points = TRUE) +
scale_color_manual(values = c("darkgreen", "cornflowerblue")) +
theme_grey()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.