In general, numerical summaries of data that represent their structure, such as measures of central tendency and dispersion.
Whitlock & Schluter, Chapter 3
So, statistics are estimations of population parameters based on random samples, and we use random variables to represent them.
We use mathematical shorthand to describe essential concepts.
Most of the math we will use will involve addition, subtraction, multiplication, and division. We will also use exponentiation, logarithms, and integrals (which are just sums, extended to continuous variables).
There are certain conventions that you should be aware of and get used to as a matter of habit. For example:
These are typically represented with Greek letters:
Here, \(X\) represents all of the measurements in the population. \(E(X)\) is read as “the expected value of \(X\)”.
These are instead represented with Latin letters:
In general, notation varies somewhat among different texts. We will almost certainly abuse formal notation throughout the course.
The most basic descriptor of a sample is a measure of its representative value. The most familiar measures of central tendency are the mean and median.
What are their essential properties and how do they differ?
\[Mean = \frac{\sum(x)}{n} = \bar{X}\]
Another way to describe the mean is that it is an estimator of the expected value of all the elements in a population (i.e. the population average \(\mu\)).
The mean is an unbiased estimator, meaning that in the limit, the mean of the sampling distribution of \(\bar{X}\) is equal to the population parameter \(\mu\). We will talk about the sampling distribution of the mean later when we consider the Central Limit Theorem.
It is, however, very sensitive to outliers: extreme values (with large deviations from the mean) make a big contribution.
The mean has a finite sample breakdown point equal to \(1/n\). This means that an arbitrarily small proportion of data can cause the population mean to become arbitrarily small or large, irrespective of where the bulk of the data lie.
To determine which is the best measure of central tendency, we need to determine if the population has a lot of outliers that would distort our estimate of the population mean, and whether it is strongly skewed.
If data are symmetrically distributed, the mean and median are approximately the same, but they begin to diverge when this is not the case.
In some cases, other measures of central tendency can be useful. We will not use these here, but it’s good to know they exist, in case you find a need for them later.
These methods are robust estimators of location that eliminate or uniformly transform outliers. In biology, these are rarely used, since we are often very interested in unusual values and removing or altering them could lead to loss of informative data points.
Trimmed mean: The mean calculated after removal of outliers, i.e. extreme sample measurements that fall outside of the central bulk of the data.
Windsorized mean: The mean after setting all extreme values equal to the maximum allowed central range.
In both cases, the central part of an ordered sample is defined as \(1-2\kappa\), where \(0 < \kappa < 0.5\) and \(\kappa\) is the proportion of values to be trimmed from each tail of the distribution. For example, if 20% of the values are to be considered outliers, \(\kappa = 0.2\) and all values in the bottom or top 20% will be either removed (trimmed mean) or set to a value equal to the 20th or 80th percentile (Windsorized mean) of all values.
If the phenomena of interest involve a multiplicative process or reciprocal values, then the arithmetic mean is not an appropriate measure of central tendency. In these cases, sample means can be calculated using a log transform to calculate products (geometric mean) or using the inverse of the arithmetic mean of the reciprocals (harmonic mean).
Processes that can be described using exponentials, such as growth rates, require the geometric mean. The geometric mean is the arithmetic mean of the log-transformed measurements (because multiplication becomes additive after taking the log). Example: you want to find the average growth rate of wild-type and mutant yeast to evaluate the fitness of specific mutants evolved in a chemostat with different carbon sources.
The harmonic mean is used for processes that are described by an inverse relation, such as velocity (e.g. \(v = d/t\), where \(d\) is distance traveled and \(t\) is the time). Example: you want to measure how long it takes a wild-type fly or one with an olfactory mutation to traverse from a fixed starting point to the source of the odor. The average time traveled for a fixed distance is then the distance divided by the harmonic mean of the speed, e.g. seconds = mm / (average mm/sec).
Measures of scale are used to quantify data variability and dispersion (i.e. how spread out the data are).
A simple way to calculate dispersion is to sum up the absolute difference between all the individual measurements and some measure of central tendency (either the sample mean or sample median), and then take the mean or median of these values.
These measures are called the mean or median absolute deviation around the mean or median, and you might see any combination of these abbreviated as MAD (which can be confusing!).
The median absolute deviation around the median (MAD median) is a measure of the spread of the data that is robust to outliers. It is defined as the median of the absolute distances between each of the data points and the sample median: \(MAD = median |x_i - median(X)|\).
To estimate the standard deviation of a distribution using the MAD, it is multiplied by a constant \(c\) as a scaling factor: \(s \approx c * MAD\). For a normal distribution, \(c = 1.4826\) and \(s \approx 1.4826 * MAD\).
\[\sigma^2 = \frac{1}{N}\sum({x_i - \mu)^2}\]
\[s^2 = \frac{\sum({x_i - \bar{X})^2}}{n-1}\]
Here \(x_i\) represents an individual measurement, \(N\) represents the number of items in the total population, and \(n\) represents the total number of sampled items.
For the sample variance, we use \(n-1\) in the denominator because, after using all the sample measures to calculate the mean, we only have left \(n-1\) independent measurements, or degrees of freedom.
The sum of squares is the most commonly used measure of dispersion and has useful mathematical properties.
It should be noted that, like the mean, the variance is very sensitive to outliers and thus is not a robust measure of dispersion.
The figure shows the difference in the amount of data captured by the IQR vs. the standard deviation:
The \(z\)-score represents the number of standard deviations that some value is away from the mean.
The functions for computing basic descriptive statistics in R are:
Measures | R commands |
---|---|
Mean, \(\bar{Y}\) | mean() |
Variance, \(s^2\) | var() |
Std Dev, \(s\) | sd() |
IQR | IQR() |
Mean, Median, Quartiles, Min, Max | summary() |
Authors: Kris Gunsalus & Manpreet Katari