Several relatively recent GWAS studies have determined that 80% of height is heritable, and that many loci contribute to height variation.
Let’s look at the heights of 1000 fathers and their adult sons (in inches). This dataset can be loaded from the “UsingR” package by John Verzani.
Load the dataset and take a look at it.
# add the dataset to your environment
data(father.son)
# inspect it
Put the data in long form and make a histogram.
# long form
# plot
Using the example in the class notes, make a normal QQ plot of the full dataset. Do these data look approximately normally distributed?
# qqplot and qqline
Take a look at the QQplot. What units are the quantiles in on the x- and y-axes? What do they represent?
Assuming the 1000 measurements we have are representative of the population, take a random sample of 100 heights and make a new QQ plot. Does your random sample look normal too?
# random sample of 100 heights
# qqplot and qqline
Sometimes we will want to resample from a finite dataset to simulate taking a bunch of random samples from a larger population. In such cases we often want to sample “with replacement” to get a more representative sample. Can you think of a reason why this could make a difference?
Try this out below, setting replace=TRUE
. Do your results look much different here?
# sample with replacement
# qqplot and qqline
What are the mean and sd of the combined data?
# mean and sd of heights
Use R’s dqpr functions to find the boundaries for the IQR of the combined data, based on an idealized normal distribution with the same mean and sd as the actual dataset.
# IQR lower and upper for heights
Assuming that the idealized distribution is a close approximation of the larger population, what proportion of men in the population are expected to be shorter than 65 inches?
# total probability < 65
What proportion of the men are expected to measure between over 75 inches?
# total probability > 75
What proportion of the men are expected to measure between 65 and 75 inches?
# total probability 65-75
Do these sum to 100% of the total population?
What height corresponds to the lowest 10% of heights? The highest 10%?
# 10% and 90% quantiles
R contains a function to compute an empirical CDF (ECDF) called ecdf()
. Use this to plot the ECDF for this dataset.
fs.ecdf = ecdf(fs.long$values)
## Error in sort(x): object 'fs.long' not found
plot(fs.ecdf)
## Error in plot(fs.ecdf): object 'fs.ecdf' not found
summary(fs.ecdf)
## Error in summary(fs.ecdf): object 'fs.ecdf' not found
# what is it?
class(fs.ecdf) # a function!
## Error in eval(expr, envir, enclos): object 'fs.ecdf' not found
quantile(fs.ecdf)
## Error in quantile(fs.ecdf): object 'fs.ecdf' not found
quantile(fs.ecdf)[3] # 50% quantile
## Error in quantile(fs.ecdf): object 'fs.ecdf' not found
# proportion below a certain value
fs.ecdf(c(60,65,70,75))
## Error in fs.ecdf(c(60, 65, 70, 75)): could not find function "fs.ecdf"
fs.ecdf(mean(fs.long$values)) # what should this be?
## Error in fs.ecdf(mean(fs.long$values)): could not find function "fs.ecdf"
It’s often useful to simulate data to explore possibilities. Let’s do this below for some normally distributed data.
# set a seed
set.seed(9-18-1963)
# simulate a standard normal distribution
n = 100
z = rnorm(n, mean=0, sd=1) # take 100 samples from std normal (with replacement!)
mean(z) # check
## [1] -0.02739443
var(z)
## [1] 0.7666996
Let’s compare the ECDF and the theoretical ECDF.
x = seq(-3, 3, length = n)
ecdf.z.fun = ecdf(z)
class(ecdf.z.fun) # it's a function
## [1] "ecdf" "stepfun" "function"
ecdf.z.fun(0) # what should this be?
## [1] 0.52
Now let’s plot it:
# We can plot it
plot(x, ecdf.z.fun(x), lwd=2, col="red", type="l", ylab="CDF", main="ECDF")
lines(x, pnorm(x), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
We can do something similar with the PDF:
# make a histogram object, but don't plot it
hist.obj = hist(z, breaks=30, plot=FALSE)
class(hist.obj) # a list
## [1] "histogram"
hist.obj$breaks # breakpoints used
## [1] -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
## [16] 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
hist.obj$density # estimated probabilities
## [1] 0.10 0.15 0.20 0.20 0.35 0.45 0.30 0.30 0.55 0.65 0.25 0.40 0.15 0.40 0.10
## [16] 0.20 0.00 0.05 0.10 0.05 0.00 0.05
Now we can make the plot with the ideal distribution overlaid:
# We can plot it
plot(hist.obj, col="lightblue", freq=FALSE, main="Histogram")
lines(x, dnorm(x), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("lightblue","black"))
Note that if you re-run this code several times with a new sample each time, the results can vary quite a bit. You can make your simulations come out the same every time using set.seed(<some integer>)
. Try adding this before you generate your random sample and see what happens.
Now that you’ve seen how to simulate a normally distributed sample and compare it to an ideal one, see if you can similar graphs using the height dataset at your own pace after class.