The following example is from these Khan Academy videos:

We previously discussed how we can use the \(t\)-test to determine if two sample distributions come from populations with the same mean (in which case, assuming equal variances, we can say that they come from the same population).

In many cases, we will have multiple sample groups and we will want to ask a similar question: Are the means of the different samples the same ?

To answer this question we will look at a very simple case with three conditions – \(a\), \(b\), and \(c\) – and ask if their means are significantly different.

# measurements for three conditions
a=c(3,2,1)
b=c(5,3,4)
c=c(5,6,7)

anova_mat =    # combine the data into a 3x3 matrix
anova_mat
## Error in eval(expr, envir, enclos): object 'anova_mat' not found

Let’s take a quick look at the data using a boxplot.

Looking at the boxplots above, it is clear to see that the means between the groups are indeed different. So the question we want to ask is, Are the differences significant?

Instead of looking at the difference between the sample means, as we did with \(t\)-test, we will compare variances. There are three different variances that we can calculate:

We also need the degrees of freedom. Given that you know the average, how many values you need to know? It’s simply one less than the number of items being considered for each comparison, because using the mean you can always calculate the last value.

To calculate SST, we simply take the difference of all the values from the overall mean, square them, and then take the sum.

# overall mean of the data
anova_mat_mean = 

# total variation = sum of squared deviations 
#                   of each data point from the overall mean
SST = 
SST
## Error in eval(expr, envir, enclos): object 'SST' not found

Since this is a sample of the entire population, our degrees of freedom equal the total number of values minus one.

# total degrees of freedom = (# of data points) - 1
SST_df = 
SST_df
## Error in eval(expr, envir, enclos): object 'SST_df' not found

SSW ( Within Group Sum of Squares ) = variation of the data within each group. Here we calculate the variation of each point relative to the mean of its own group and simply add up the squared differences across all the groups.

anova_mat_col_mean = 
anova_mat_col_mean

SSW=0
for ( i in ... ) {
  SSW = 
}
SSW
## Error: <text>:7:1: unexpected '}'
## 6:   SSW = 
## 7: }
##    ^

When calculating the degree of freedom, remember that we calculated the sum of squared differences relative every group’s mean, so if we have m groups and n samples in each group, then df = m*(n-1).

SSW_df = 
SSW_df
## Error in eval(expr, envir, enclos): object 'SSW_df' not found

SSB ( Between Group Sum of Squares ) = variation of the group mean to the overall mean. First, we find the sum of squared differences for each group mean compared to the overall mean. We also multiply by the number of values in the group to create a SS comparison for each of the original datapoints.

SSB = 0
for ( i in ... ) {
 SSB = 
}
SSB
## Error: <text>:4:1: unexpected '}'
## 3:  SSB = 
## 4: }
##    ^

For calculating between group degree of freedom, remember that if we have m groups, so it is simply m-1.

SSB_df = 
SSB_df
## Error in eval(expr, envir, enclos): object 'SSB_df' not found

Finally since our variance calculations are sums of squares, they can be considered to follow a \(\chi^2\) distribution. If the variance within the groups is the same and if the means of the groups are the same, then the variance between the groups should be the same as within the groups.

We can take this one step further and say that if the variance between the groups is greater than within the groups, then the means of the groups are different. Any change in the ratio would fit an F-distribution and a \(p\)-value can be calculated.

Fortunately, R has a family of functions for the F-distribution just like for any other distribution!

# F-statistic
Fstat = 
  
# probability (p-value) [df1 = df(W); df2 = df(B) ]
## Error: <text>:6:0: unexpected end of input
## 4: # probability (p-value) [df1 = df(W); df2 = df(B) ]
## 5: 
##   ^

We can confirm our results using the aov function.

library(reshape2)

# we use the melt function to reshape the data frame into three columns:
# Var1 = the three groups, indexed as 1, 2, 3
# Var2 = the three groups, indexed by their variable name
# value = the value of each data point
anova_mat.melt = melt(anova_mat)
## Error in melt(anova_mat): object 'anova_mat' not found
anova_mat.melt  # look at this new data structure
## Error in eval(expr, envir, enclos): object 'anova_mat.melt' not found
# look at the result of the ANOVA command `aov`
# the syntax is to do the analysis of the values in response to the factors (groups a,b,c)
summary(aov( ... ))
## Error in summary(aov(...)): '...' used in an incorrect context