At its core, statistics is about probabilities and probabilistic inference. So we need to build some common vocabulary to talk about probability.
Let’s start with some of the basics:
\(\Rightarrow\) Question: What is a variable?
For example, the formula for instantaneous velocity of a falling body is:
\[f(t) = gt\]
\(\Rightarrow\) Q: What is a deterministic model?
\(\Rightarrow\) Q: What is a probabilistic model?
For example, a probabilistic model for the instantaneous velocity of a falling body is:
\[f(t) = gt + \varepsilon\]
When we do experiments with an unknown outcome, we need a common framework for thinking about the process and outcomes. Let’s review some basic concepts:
\(\Rightarrow\) Q: What is a random variable?
\(\Rightarrow\) Q: What is a random trial?
\(\Rightarrow\) Q: What is an event?
\(\Rightarrow\) Q: What is a probability, in terms of random trials?
In mathematical notation, we can write:
\[P[A] = \frac{N(A)}{N}\]
Which is to say, “the probability of event \(A\)” equals the proportion of events with outcome \(A\) relative to all possible outcomes.
A probability distribution describes the probabilities of all possible outcomes of a random variable.
Discrete vs. continuous distributions describe situations in which there are a limited vs. an infinite number of outcomes.
To write mathematical formulas for distributions, we need some notation to describe random variables, outcomes, and probability distributions:
A probability density function (PDF) is a mathematical expression that describes the distribution of all possible outcomes for a random variable.
The density function, written as \(f(x)\), gives the height of a PDF for any outcome \(x\). Both types of PDFs will be valid \(iff\) (“if and only if”):
In other words,
Since all possible values of \(x\) are discrete, the density at any point, \(x_i\), is equivalent to the probability that our random variable \(X\) equals any individual value \(x_i\). Thus we can write:
\[f(x) = P(X=x),\ \ \ x \in X = \{x_1, x_2, ...\}.\]
Usually, discrete distributions are illustrated using histograms, and continuous PDFs are illustrated using line graphs.
Let’s take a simple example of rolling two fair 6-sided dice. Below are two plots illustrating this scenario:
This is an example of an empirical distribution, which represents experimental outcomes rather than ideal outcomes.
# ============================================================================ #
# distributions for rolling a pair of dice
num.dice <- 2L
num.rolls <- 10000L
outcomes <- matrix(sample(1:6, num.dice * num.rolls, replace = TRUE),
nrow = num.rolls, ncol = num.dice)
sums <- rowSums(outcomes)
#barplot(table(sums))
dd = data.frame(table(sums))
freq.plot = ggplot(dd) +
geom_bar(aes(x=sums, y=Freq), stat='identity', fill="peachpuff", color="black") +
labs(x="Total face value" , y= "Count") +
theme_classic()
prop.plot = ggplot(dd) +
geom_bar(aes(x=sums, y=Freq/num.rolls), stat='identity', fill="peachpuff", color="black") +
labs(x="Total face value" , y= "Probability") +
theme_classic()
ggarrange(freq.plot, prop.plot, nrow=1, ncol=2)
Plots for variables that can take on any real value are often drawn using line plots, particularly for ideal distributions.
Empirical distributions for continuous data can also be plotted by binning the values and using histograms, which shows the variation in the actual data a little better. This is illustrated below using a large number of random samples from a standard normal distribution.
# ============================================================================ #
# standard normal distribution
set.seed(9-18-2021) # set a seed random number generator
# this makes the simulation reproducible
z.scores = seq(-5,5,by=.01) # a sequence of length 1001
samples = rnorm(length(z.scores)) # random samples
hist(samples, prob="true", col="blanchedalmond",
breaks=40,
xlim=c(-4,4),
xlab="z-score",
main="Standard normal distribution\n(1000 samples and ideal curve)")
lines(density(samples),col="cyan", lwd=2) # smoothened curve
curve(dnorm(x), add=TRUE, # ideal curve
col="darkblue", lwd=2)
A really important fact to keep in mind is for a continuous variable, the probability at any discrete value of \(x\) is zero.
\(\Rightarrow\) Q: Why is this the case?
Note that continuous distributions, we will see the density may sometimes exceed one across a range of values. Nevertheless, the total area under any PDF is always equal to one.
The cumulative distribution function answers the question, “What is the probability of observing a value less than or equal to \(x\) as an outcome?” This is called a lower-tailed probability.
It can also be used to answer the question, “What is the probability of observing a value greater than \(x\)?” This is the upper-tailed probability and is obtained by subtracting the value of the CDF at \(x\) from 1.
The CDF for a random variable \(X\) is denoted \(F(x)\) and gives the lower-tail probability \(P(X \le x)\) for the corresponding PDF. This probability is given by the total area underneath a density, for all outcomes up to and including the value \(x\).
A CDF drawn using actual sampled data is called an empirical CDF.
For a discrete random variable, the total probability is the sum of the probabilities for each possible value from the smallest one up to the value of interest, \(x\):
\[ F(x) = P(X \le x) = \sum_{x_i \le x} f(x_i) \]
The CDF for the example of rolling two dice looks like this:
For a continuous random variable, the CDF gives the total probability the \(X\) will take a value of \(x\) or less.
We find the total probability using integration, which gives us the area under a curve:
\[ F(x) = P(X \le x) = \int_{-\infty}^x f(t)dt\]
Closed forms of some continuous distribution functions (such as a uniform distribution) allow solutions to be found without the need for integration.
The figure1 below illustrates these relationships:
Say we want to find the total probability that an observed outcome falls within an interval between two values. How do we find this?
For discrete distributions, we just sum up the individual probabilities for each possible outcome:
\[ P(x_a \lt X \le x_b) = \sum_{x_i=x_a+1}^{x_b} f(x_i) = F(X_b) - F(x_a)\]
We find the total area under the curve for a continuous distribution by integrating our density function across the range of interest:
\[ P(x_a \lt X \le x_b) = \int_{a}^{b} f(x) = F(b) - F(a) = F\Biggr|_{-\infty}^{b} - F\Biggr|_{-\infty}^{a}\]
Say we want to find the central 95% of some normally distributed data:
y <- rnorm(100000)
den <- density(y) # Density estimation
# Middle 95% of data
value1 <- qnorm(0.025)
value2 <- qnorm(0.975)
# Lower and higher indices on the X-axis
l <- min(which(den$x >= value1))
h <- max(which(den$x < value2))
# Plot
plot(den,
xlab = "z-score",
main="Middle 95% of the data")
polygon(c(den$x[c(l, l:h, h)]),
c(0, den$y[l:h], 0),
col = "slateblue1")
To do this we would just take the area from \(-\infty\) to \(b\), and subtract the area from \(-\infty\) to \(a\):
par(mfrow=c(1,2))
# upper plot
plot(den,
xlab = "z-score",
main="Area up to b=1.96 (q=0.975)\n(97.5% quantile)")
# total area less than or equal to b
value = value2
polygon(c(den$x[den$x <= value ], value),
c(den$y[den$x <= value ], 0),
col = "slateblue1",
border = 1)
# lower plot
plot(den,
xlab = "z-score",
main="Area up to a=-1.96 (q=0.025)\n(2.5% quantile)")
# total area less than or equal to a
value = value1
polygon(c(den$x[den$x <= value ], value),
c(den$y[den$x <= value ], 0),
col = "slateblue1",
border = 1)
Fortunately, many distribution functions are already built into R, so we don’t usually have to worry about manual summation or integration to figure out probabilities! (R also provides the capability to perform integration directly, if you want to check that the built-in functions are giving you the correct answer.)
R has four families of commands relating to distributions that you should become familiar with. For a normal distribution, these are:
Understanding the relationship between the PDF, the CDF, and the inverse CDF is easier if we depict them graphically.
dnorm gives the height of the distribution for any particular value of a discrete or continuous distribution (PDF), whereas pnorm give the total probability (CDF).
Note: this can be a little confusing at first, because one tends to automatically associate “PDF” with “pnorm”. You can help yourself by thinking “density” vs. “total probability” function.
par(mfrow=c(1,2))
z.scores = seq(-5,5,by=.01) # a sequence of length 1001
PDF = dnorm(z.scores)
CDF = pnorm(z.scores)
plot(z.scores, PDF,
xlim=c(-4,4),
type = "l",
xlab="Z-score", ylab="Density",
main = "Standard Normal PDF")
abline(v=0, lty="dashed", col="red")
plot(z.scores, CDF,
xlim=c(-4,4),
type = "l",
xlab="Z-score", ylab="Total probability",
main = "Standard Normal CDF")
abline(v=0, lty="dashed", col="red")
abline(h=0.5, lty="dashed", col="black")
The diagram below shows the shape of the CDF for a standard normal distribution, which represents the total probability as a function of the \(z\)-score.
Recall that we used the qnorm
function last week to get the \(z\)-score for the bounds of our 95% confidence intervals. Conversely, we can use the CDF to get the total probability, or area under the curve, at each of these quantiles:
qnorm(0.975) # upper bound for 95%CI
## [1] 1.959964
qnorm(0.025) # lower bound for 95%CI (the same since the normal is symmetric)
## [1] -1.959964
pnorm(c(-1.959964,1.959964))
## [1] 0.025 0.975
We can see that these are really the inverse of each other by applying them sequentially:
pnorm(qnorm(0.975))
## [1] 0.975
qnorm(pnorm(1.959964))
## [1] 1.959964
You can verify this yourself in R:
# CDF and QF for the standard normal
# CDF for z-scores ranging from -3 to 3
pnorm(-3:3, mean=0, sd=1)
## [1] 0.001349898 0.022750132 0.158655254 0.500000000 0.841344746 0.977249868
## [7] 0.998650102
# QF for probabilities ranging from 0.1 to .9
qnorm(seq(0,1, by=0.1), mean=0, sd=1)
## [1] -Inf -1.2815516 -0.8416212 -0.5244005 -0.2533471 0.0000000
## [7] 0.2533471 0.5244005 0.8416212 1.2815516 Inf
Say we want to find the central 95% of a normal distribution. We can use qnorm to find the \(z\)-score corresponding to the lower 2.5% and upper 97.% bounds of the data, which we already know should be around -1.96 and 1.96:
qnorm(c(0.025,0.975))
## [1] -1.959964 1.959964
We can plot this on the normal PDF, but we will really use the CDF to compute the area under the curve for these limits2:
Once we have the \(z\)-scores (or actual \(x\)-values for a non-standardized distribution), finding the total area is as simple as subtracting the values of pnorm()
between the upper and lower bounds of the interval:
# compute area under the standard normal for central 95% of data
# the lower z-score is found by qnorm(0.025) and the upper by qnorm(0.975)
lower_bound = qnorm(0.025, mean=0, sd=1)
upper_bound = qnorm(0.975, mean=0, sd=1)
c(lower_bound, upper_bound)
## [1] -1.959964 1.959964
# total probability between lower and upper bound z-scores
pnorm(upper_bound) - pnorm(lower_bound)
## [1] 0.95
# this is the same
pnorm(qnorm(0.975)) - pnorm(qnorm(0.025))
## [1] 0.95
Say we are making compact cars and we want to know what proportion of humans are above 195mm, and therefore might not fit in our car? Let’s assume \(\mu = 169\) and \(\sigma = 14\).
We can do this in two ways:
pnorm
to get either the upper.tail
probability (the probability of getting a value at least as high as the one observed)lower
tail` probability and subtract from 1# equivalent ways to find total probability X > x
x=195
pnorm(x, mean=169, sd=14, lower.tail=FALSE) # upper tail probability
## [1] 0.03164542
1 - pnorm(x, mean=169, sd=14, lower.tail=TRUE) # 1 - lower tail probability
## [1] 0.03164542
On the other extreme, some peope are just too short to see above the steering wheel and reach the pedals.
Say we want to accommodate 80% of the population and let the other 10% find other modes of transportation. What are the shortest and tallest people that can fit in the car?
# boundaries for 10%-90% of human heights
qnorm(c(0.1,0.9), mean=169, sd=14)
## [1] 151.0583 186.9417
Say we want to compare a sample of the height of 100 healthy human individuals to a theoretical normal distribution.
\(\Rightarrow\) Q: Why might this particular question be of interest to us?
If the data do not conform approximately to a normal distribution, we cannot use tests that require this assumption and must therefore use other methods.
One way we can tell how well the shape of a distribution conforms to a theoretical distribution is using a QQ plot, which compares theoretical quantiles on the x-axis with experimental quantiles on the y-axis
To make the QQ plot of your data against a normal distribution, just use the qqnorm()
function.
par(mfrow=c(2,2))
qqnorm(rnorm(100), # sample 100 individuals
xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles",
main = "Normal Q-Q Plot")
qqline(y) # add an ideal qq line
qqnorm(rnorm(100), # sample 100 individuals
xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles",
main = "Normal Q-Q Plot")
qqline(y) # add an ideal qq line
qqnorm(rnorm(100), # sample 100 individuals
xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles",
main = "Normal Q-Q Plot")
qqline(y) # add an ideal qq line
qqnorm(rnorm(100), # sample 100 individuals
xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles",
main = "Normal Q-Q Plot")
qqline(y) # add an ideal qq line
Image from a short tutorial by A. Kujowska in Towards data science↩︎
Image from a short tutorial by A. Kujowska in Towards data science↩︎