Primer on interactions

Primer on model formulae in R

Let Y be the dependent (response) variable, and A and B two independent (predictor) variables.

Notation Meaning
Y ~ 1 Y is not being predicted by any independent variables.
Y ~ A Y is being predicted by A.
Y ~ A + B Y is being predicted by A and B independently of each other.
Y ~ A : B Y is being predicted by an interaction between A and B.
Y ~ A * B The same as Y ~ A + B + A:B.

Note: this is not an extensive list of notations, please refer to the R manual for additional details.

Types of ANOVA

You may remember that in application to a single categorical variable, ANOVA is all about comparing between-group variance to within-group variance. When it comes to multiple variables, things get a little tricky. In such situations, we use ANOVA to compare between-model variance to within-model variance, where models are linear regression models with different sets of independent variables included in them. There are always two models in a comparison. The first one is simpler, i.e. it includes fewer variables or interactions. It is also referred to as reduced model. The second one includes an extra variable or interaction and it is called full model. There are several strategies to formulate these models, which is why we distinguish between three types of multivariate ANOVA.


How type I ANOVA [anova()] processes Y ~ A*B:

Step Reduced (simpler) mode Full (more complex) model Relevant output
1 Y ~ 1 Y ~ 1 + A Pr(>F) for A
2 Y ~ 1 + A Y ~ 1 + A + B Pr(>F) for B
3 Y ~ 1 + A + B Y ~ 1 + A + B + A:B Pr(>F) for A:B

Note: type I ANOVA does not compare Y ~ 1 with Y ~ 1 + B. In other words, it does not estimate the main effect of B. So, the order in which you specify independent variables matters, and Y ~ A*B and Y ~ B*A may produce different results. In most cases, you do not want to make any assumptions about the relative importance of your variables and you should not use type I ANOVA.


How type II ANOVA [car::Anova(...,type="II")] processes Y ~ A*B:

Step Reduced (simpler) mode Full (more complex) model Relevant output
1 Y ~ 1 + B Y ~ 1 + A + B Pr(>F) for A
2 Y ~ 1 + A Y ~ 1 + A + B Pr(>F) for B
3 Y ~ 1 + A + B Y ~ 1 + A + B + A:B Pr(>F) for A:B

Note: no interaction is assumed in the first two comparisons. If there is significant interaction, the results may be misleading. But as long as there is no significant interaction, type II ANOVA is usually the most appropriate and powerful type of ANOVA.


How type III ANOVA [car::Anova(...,type="III") and summary.lm()] processes Y ~ A*B:

Step Reduced (simpler) mode Full (more complex) model Relevant output
1 Y ~ 1 + B + A:B Y ~ 1 + A + B + A:B Pr(>F) for A
2 Y ~ 1 + A + A:B Y ~ 1 + A + B + A:B Pr(>F) for B
3 Y ~ 1 + A + B Y ~ 1 + A + B + A:B Pr(>F) for A:B

Note: If there is no significant interaction, type III is less powerful than type II. If there is significant interaction, type III ANOVA is usually the most appropriate and powerful type of ANOVA.

The elephants data set

We will work with the data set where researchers measured the height of male and female African elephants of different ages from birth to adulthood.

First, load and plot the data.

# load the data
elephants <- read.csv('Data_Elephants.csv')

# look at the data
str(elephants)
## 'data.frame':    288 obs. of  3 variables:
##  $ Age   : num  1.4 17.5 12.8 11.2 12.7 ...
##  $ Height: num  120 227 235 210 220 ...
##  $ Sex   : chr  "M" "M" "M" "M" ...
head(elephants)
##     Age Height Sex
## 1  1.40    120   M
## 2 17.50    227   M
## 3 12.75    235   M
## 4 11.17    210   M
## 5 12.67    220   M
## 6 12.67    189   M
# plot Height as the function of Age, and color the two sexes separately
ggplot(data = elephants,
       mapping = aes(x = Age,
                     y = Height,
                     color = Sex)) +
  geom_point() +
  scale_color_manual(values = c("darkgreen", "cornflowerblue"))

The relationship is obviously not linear. Try transforming one or both variables to make the relationship (sort of) linear.

# add a new column that contains the square root of Age
elephants$sqrt_Age <- sqrt(elephants$Age)

# plot the data
ggplot(data = elephants,
       mapping = aes(x = sqrt_Age,
                     y = Height,
                     color = Sex)) +
  geom_point() +
  scale_color_manual(values = c("darkgreen", "cornflowerblue"))

Effect of age and sex on height in young elephants

First, let us subset young elephants, defined as 11 years old or younger. Let us examine the combined effect of age and sex on the body height in this age cohort.

# subset individuals that are <=11 years old
indices_of_young_individuals <- elephants$Age <= 11
elephants_young <- elephants[indices_of_young_individuals,]

# plot the data
ggplot(data = elephants_young,
       mapping = aes(x = sqrt_Age,
                     y = Height,
                     color = Sex)) +
  geom_point() +
  scale_color_manual(values = c("darkgreen", "cornflowerblue"))

# perform type I ANOVA with Age first and Sex second
anova(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_young))
## Analysis of Variance Table
## 
## Response: Height
##               Df Sum Sq Mean Sq   F value  Pr(>F)    
## sqrt_Age       1 268131  268131 1276.8417 < 2e-16 ***
## Sex            1    967     967    4.6057 0.03336 *  
## sqrt_Age:Sex   1    275     275    1.3077 0.25452    
## Residuals    161  33809     210                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type I ANOVA with Sex first and Age second - note the different p-values!
anova(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_young))
## Analysis of Variance Table
## 
## Response: Height
##               Df Sum Sq Mean Sq   F value Pr(>F)    
## Sex            1    358     358    1.7062 0.1933    
## sqrt_Age       1 268740  268740 1279.7412 <2e-16 ***
## Sex:sqrt_Age   1    275     275    1.3077 0.2545    
## Residuals    161  33809     210                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Age first and Sex second
Anova(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_young),
      type = "II")
## Anova Table (Type II tests)
## 
## Response: Height
##              Sum Sq  Df   F value  Pr(>F)    
## sqrt_Age     268740   1 1279.7412 < 2e-16 ***
## Sex             967   1    4.6057 0.03336 *  
## sqrt_Age:Sex    275   1    1.3077 0.25452    
## Residuals     33809 161                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Sex first and Age second - note that the result is the same as above
Anova(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_young),
      type = "II")
## Anova Table (Type II tests)
## 
## Response: Height
##              Sum Sq  Df   F value  Pr(>F)    
## Sex             967   1    4.6057 0.03336 *  
## sqrt_Age     268740   1 1279.7412 < 2e-16 ***
## Sex:sqrt_Age    275   1    1.3077 0.25452    
## Residuals     33809 161                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type III ANOVA with Age first and Sex second - note the higher p-value for Sex than with type II
Anova(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_young),
      type = "III")
## Anova Table (Type III tests)
## 
## Response: Height
##              Sum Sq  Df  F value Pr(>F)    
## (Intercept)   75189   1 358.0499 <2e-16 ***
## sqrt_Age     107292   1 510.9267 <2e-16 ***
## Sex               2   1   0.0084 0.9272    
## sqrt_Age:Sex    275   1   1.3077 0.2545    
## Residuals     33809 161                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# again, the order does not matter
Anova(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_young),
      type = "III")
## Anova Table (Type III tests)
## 
## Response: Height
##              Sum Sq  Df  F value Pr(>F)    
## (Intercept)   75189   1 358.0499 <2e-16 ***
## Sex               2   1   0.0084 0.9272    
## sqrt_Age     107292   1 510.9267 <2e-16 ***
## Sex:sqrt_Age    275   1   1.3077 0.2545    
## Residuals     33809 161                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# running summary() produces p-values equivalent to type III ANOVA, but it uses t-statistics
summary(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_young))
## 
## Call:
## lm(formula = Height ~ sqrt_Age * Sex, data = elephants_young)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.858  -9.823  -0.169  10.008  34.352 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    74.9352     3.9602  18.922   <2e-16 ***
## sqrt_Age       39.3462     1.7407  22.604   <2e-16 ***
## SexM           -0.4769     5.2105  -0.092    0.927    
## sqrt_Age:SexM   2.6373     2.3062   1.144    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.49 on 161 degrees of freedom
## Multiple R-squared:  0.8885, Adjusted R-squared:  0.8864 
## F-statistic: 427.6 on 3 and 161 DF,  p-value: < 2.2e-16
# again, the order does not matter
summary(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_young))
## 
## Call:
## lm(formula = Height ~ Sex * sqrt_Age, data = elephants_young)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.858  -9.823  -0.169  10.008  34.352 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    74.9352     3.9602  18.922   <2e-16 ***
## SexM           -0.4769     5.2105  -0.092    0.927    
## sqrt_Age       39.3462     1.7407  22.604   <2e-16 ***
## SexM:sqrt_Age   2.6373     2.3062   1.144    0.255    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.49 on 161 degrees of freedom
## Multiple R-squared:  0.8885, Adjusted R-squared:  0.8864 
## F-statistic: 427.6 on 3 and 161 DF,  p-value: < 2.2e-16

Summary: the order of factors matters for type I ANOVA - only use type I if you have a really good reason to do so (it is also fine if you are testing an interaction between two categorical variables with a fully balanced experimental design). In this case, there is no significant interaction between age and sex, so applying type II ANOVA provided a more sensitive test for the effect of sex than type III.

Effect of age and sex on height in more mature elephants

Now, let us subset more mature elephants, defined as 12.5 years old or older Let us examine the combined effect of age and sex on the body height in this age cohort.

# subset individuals that are >=12.5 years old
indices_of_mature_individuals <- elephants$Age >= 12.5
elephants_mature <- elephants[indices_of_mature_individuals,]

# plot the data
ggplot(data = elephants_mature,
       mapping = aes(x = sqrt_Age,
                     y = Height,
                     color = Sex)) +
  geom_point() +
  scale_color_manual(values = c("darkgreen", "cornflowerblue"))

# perform type I ANOVA with Age first and Sex second
anova(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_mature))
## Analysis of Variance Table
## 
## Response: Height
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## sqrt_Age      1  10067 10066.8  27.989 8.669e-07 ***
## Sex           1  20301 20300.8  56.442 4.290e-11 ***
## sqrt_Age:Sex  1   1762  1761.7   4.898   0.02945 *  
## Residuals    89  32011   359.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type I ANOVA with Sex first and Age second - note the different p-values!
anova(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_mature))
## Analysis of Variance Table
## 
## Response: Height
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Sex           1  15821 15821.0  43.987 2.486e-09 ***
## sqrt_Age      1  14547 14546.7  40.444 8.490e-09 ***
## Sex:sqrt_Age  1   1762  1761.7   4.898   0.02945 *  
## Residuals    89  32011   359.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Age first and Sex second
Anova(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_mature),
      type = "II")
## Anova Table (Type II tests)
## 
## Response: Height
##              Sum Sq Df F value   Pr(>F)    
## sqrt_Age      14547  1  40.444 8.49e-09 ***
## Sex           20301  1  56.442 4.29e-11 ***
## sqrt_Age:Sex   1762  1   4.898  0.02945 *  
## Residuals     32011 89                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type II ANOVA with Sex first and Age second - note that the result is the same as above
Anova(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_mature),
      type = "II")
## Anova Table (Type II tests)
## 
## Response: Height
##              Sum Sq Df F value   Pr(>F)    
## Sex           20301  1  56.442 4.29e-11 ***
## sqrt_Age      14547  1  40.444 8.49e-09 ***
## Sex:sqrt_Age   1762  1   4.898  0.02945 *  
## Residuals     32011 89                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# perform type III ANOVA with Age first and Sex second - note the higher p-value for Sex than with type II
Anova(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_mature),
      type = "III")
## Anova Table (Type III tests)
## 
## Response: Height
##              Sum Sq Df F value    Pr(>F)    
## (Intercept)   21372  1 59.4218 1.714e-11 ***
## sqrt_Age       5546  1 15.4200 0.0001695 ***
## Sex             524  1  1.4567 0.2306592    
## sqrt_Age:Sex   1762  1  4.8980 0.0294455 *  
## Residuals     32011 89                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# again, the order does not matter
Anova(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_mature),
      type = "III")
## Anova Table (Type III tests)
## 
## Response: Height
##              Sum Sq Df F value    Pr(>F)    
## (Intercept)   21372  1 59.4218 1.714e-11 ***
## Sex             524  1  1.4567 0.2306592    
## sqrt_Age       5546  1 15.4200 0.0001695 ***
## Sex:sqrt_Age   1762  1  4.8980 0.0294455 *  
## Residuals     32011 89                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# running summary() produces p-values equivalent to type III ANOVA, but it uses t-statistics
summary(lm(formula = Height ~ sqrt_Age * Sex,
         data = elephants_mature))
## 
## Call:
## lm(formula = Height ~ sqrt_Age * Sex, data = elephants_mature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.282 -12.747   1.113  12.962  53.491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    148.948     19.322   7.709 1.71e-11 ***
## sqrt_Age        16.283      4.147   3.927 0.000169 ***
## SexM           -39.447     32.684  -1.207 0.230659    
## sqrt_Age:SexM   15.944      7.204   2.213 0.029446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.97 on 89 degrees of freedom
## Multiple R-squared:  0.5009, Adjusted R-squared:  0.4841 
## F-statistic: 29.78 on 3 and 89 DF,  p-value: 2.01e-13
# again, the order does not matter
summary(lm(formula = Height ~ Sex * sqrt_Age,
         data = elephants_mature))
## 
## Call:
## lm(formula = Height ~ Sex * sqrt_Age, data = elephants_mature)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.282 -12.747   1.113  12.962  53.491 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    148.948     19.322   7.709 1.71e-11 ***
## SexM           -39.447     32.684  -1.207 0.230659    
## sqrt_Age        16.283      4.147   3.927 0.000169 ***
## SexM:sqrt_Age   15.944      7.204   2.213 0.029446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.97 on 89 degrees of freedom
## Multiple R-squared:  0.5009, Adjusted R-squared:  0.4841 
## F-statistic: 29.78 on 3 and 89 DF,  p-value: 2.01e-13

Summary: the order of factors matters for type I ANOVA - only use type I if you have a really good reason to do so (it is also fine if you are testing an interaction between two categorical variables with a fully balanced experimental design). In this case, there is a significant interaction between age and sex, so applying type III ANOVA provides a more appropriate test of main effects than type II.

Plotting interactions

Library interactions provides a handy function interact_plot() to plot regression lines that take into account interactions. Importantly, it returns a ggplot object, which you can tweak to suit your aesthetic and scientific needs!

# specify the linear regression models
elephants_young_model <- lm(formula = Height ~ sqrt_Age * Sex,
                            data = elephants_young)

elephants_mature_model <- lm(formula = Height ~ sqrt_Age * Sex,
                             data = elephants_mature)

# plot
interact_plot(elephants_young_model,
              pred = sqrt_Age,
              modx = Sex,
              plot.points = TRUE)

interact_plot(elephants_mature_model,
              pred = sqrt_Age,
              modx = Sex,
              plot.points = TRUE)

# this is a ggplot object, so you can modify it using the syntax you already know
interact_plot(elephants_mature_model,
              pred = sqrt_Age,
              modx = Sex,
              plot.points = TRUE) +
  scale_color_manual(values = c("darkgreen", "cornflowerblue")) +
  theme_grey()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.