Several relatively recent GWAS studies have determined that 80% of height is heritable, and that many loci contribute to height variation.
Let’s look at the heights of 1000 fathers and their adult sons (in inches). This dataset can be loaded from the “UsingR” package by John Verzani.
Load the dataset and take a look at it.
data(father.son)
head(father.son)
## fheight sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221
Put the data in long form and make a histogram.
= stack(father.son)
fs.long
ggplot(data = fs.long, aes(x=values, fill=ind)) +
geom_histogram(position="identity",
alpha = 0.5,
binwidth = 1)
Make a normal QQ plot of the full dataset.
par(mfrow = c(1,3))
# all
qqnorm(fs.long$values)
qqline(fs.long$values)
# we can also look at fathers and sons separately
qqnorm(father.son$fheight)
qqline(father.son$fheight)
qqnorm(father.son$sheight)
qqline(father.son$sheight)
Take a look at the QQplot. What units are the quantiles in on the x- and y-axes, and what do they represent? Do these data look approximately normally distributed?
Assuming the 1000 measurements we have are representative of the population, take a random sample of 100 heights and make a new QQ plot. Does your random sample look normal too?
# sample 100 random individuals
= sample(fs.long$values, 100)
f.sample qqnorm(f.sample)
qqline(f.sample)
Sometimes we will want to resample from a finite dataset to simulate taking a bunch of random samples from a larger population. In such cases we often want to sample “with replacement” to get a more representative sample. Can you think of a reason why this could make a difference?
# If the random sample is large relative to the "population" from which it is
# being drawn, sampling without replacement will not give a random sample,
# because the available pool of options will shrink with each new draw and
# therefore each trial will not be independent.
Try this out below, setting replace=TRUE
. Do your results look much different here?
= sample(fs.long$values, 100, replace=TRUE)
f.sample qqnorm(f.sample)
qqline(f.sample)
# results are very similar in this case.
What are the mean and sd of the combined data?
= mean(fs.long$values)
mean.fs = sd(fs.long$values)
sd.fs c(mean.fs, sd.fs)
## [1] 68.185583 2.823728
Use R’s dqpr functions to find the boundaries for the IQR of the combined data, based on an idealized normal distribution with the same mean and sd as the actual dataset.
= qnorm(0.25, mean.fs, sd.fs)
iqr.lower = qnorm(0.75, mean.fs, sd.fs)
iqr.upper c(iqr.lower,iqr.upper)
## [1] 66.28101 70.09016
Assuming that the idealized distribution is a close approximation of the larger population, what proportion of men in the population are expected to be shorter than 65 inches?
pnorm(65, mean.fs, sd.fs)
## [1] 0.1296287
What proportion of the men are expected to measure between over 75 inches?
pnorm(75, mean.fs, sd.fs, lower.tail = FALSE)
## [1] 0.007905067
1 - pnorm(75, mean.fs, sd.fs)
## [1] 0.007905067
What proportion of the men are expected to measure between 65 and 75 inches?
pnorm(75, mean.fs, sd.fs) - pnorm(65, mean.fs, sd.fs)
## [1] 0.8624662
Do these sum to 100% of the total population?
pnorm(65, mean.fs, sd.fs) + # under 65
pnorm(75, mean.fs, sd.fs) - pnorm(65, mean.fs, sd.fs) + # between 65-75
pnorm(75, mean.fs, sd.fs, lower.tail = FALSE) # over 75
## [1] 1
What height corresponds to the lowest 10% of heights? The highest 10%?
qnorm(0.1, mean.fs, sd.fs)
## [1] 64.56683
qnorm(0.9, mean.fs, sd.fs)
## [1] 71.80434
R contains a function to compute an empirical CDF (ECDF) called ecdf()
. It takes a vector of measurements and returns an ecdf
object that can itself be run as a function on a new dataset.
Use this function to plot the ECDF for this dataset.
= ecdf(fs.long$values)
fs.ecdf plot(fs.ecdf)
summary(fs.ecdf)
## Empirical CDF: 2153 unique values with summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.51 66.32 68.20 68.18 70.02 78.36
# what is it?
class(fs.ecdf) # a function!
## [1] "ecdf" "stepfun" "function"
# output is very similar to that of `summary()`
quantile(fs.ecdf)
## 0% 25% 50% 75% 100%
## 58.50708 66.31346 68.19702 70.02889 78.36479
quantile(fs.ecdf)[3] # 50% quantile
## 50%
## 68.19702
# proportion of data below a certain value
fs.ecdf(c(60,65,70,75))
## [1] 0.003710575 0.129406308 0.744434137 0.992115028
fs.ecdf(mean(fs.long$values)) # what should this be?
## [1] 0.4967532
It’s often useful to simulate data to explore possibilities. Let’s do this below for some normally distributed data.
Note that if you write some code involving random numbers and then re-run it several times to generate a new sample each time, the results can vary quite a bit.
You can make your simulations come out the same every time using set.seed(<some integer>)
. Try adding this before you generate your random sample and see what happens in comparison with not using a seed.
# set a seed (or not)
set.seed(9-18-1963)
# simulate a standard normal distribution
= 100 # number of trials
n = rnorm(n, mean=0, sd=1) # 100 samples from std normal
z.scores # (with replacement!)
mean(z.scores) # check the results
## [1] -0.02739443
var(z.scores)
## [1] 0.7666996
Let’s compare the ECDF and the theoretical ECDF. To do this, we first compute the cumulative probabilities for our random sample of z-scores.
We assign the result to a variable, which can also be run as a function later on a new dataset.
# generate empirical ecdf for the sample from the standard normal (above)
= ecdf(z.scores)
ecdf.z.fun
class(ecdf.z.fun) # it's a function
## [1] "ecdf" "stepfun" "function"
ecdf.z.fun(0) # what should this be?
## [1] 0.52
To do this, we will first create a sequence of numbers to use on the x-axis. For a standard normal, these will be in units of standard deviations, be we could also do this for actual measurements like heights (we will also try this below).
We then run the function we created above (using our random sample of z-scores) on a new set of values contained in the x.seq
vector. This will generate an empirical CDF across the range of numbers specified in x.seq
, which follows the same eCDF as the original data we fed it.
We will also superimpose a curve representing the ideal CDF for a standard normal distribution. We will use the same set of values in x.seq
to generate the total probabilities for the ideal CDF.
# sequence of numbers for the x-axis
= seq(-3, 3, length = n)
x.seq
# plot x=sequence of z-scores (x.seq) vs.
# y=total probs for each value (run ecdf.z.fun on new set of values)
plot(x.seq, ecdf.z.fun(x.seq),
type="l", lwd=2, col="red",
ylab="CDF", main="ECDF")
# superimpose line for ideal std normal
# (x=z-scores vs. y=ideal CDF values)
lines(x.seq, pnorm(x.seq), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
We can do something similar with the PDF:
# make a histogram object, but don't plot it
= hist(z.scores, breaks=30, plot=FALSE)
hist.obj class(hist.obj) # a list
## [1] "histogram"
$breaks # breakpoints used hist.obj
## [1] -1.8 -1.6 -1.4 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
## [16] 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
$density # estimated probabilities hist.obj
## [1] 0.10 0.15 0.20 0.20 0.35 0.45 0.30 0.30 0.55 0.65 0.25 0.40 0.15 0.40 0.10
## [16] 0.20 0.00 0.05 0.10 0.05 0.00 0.05
Now we can make the plot with the ideal distribution overlaid:
# We can plot it
plot(hist.obj, col="lightblue", freq=FALSE, main="Histogram")
lines(x.seq, dnorm(x.seq), lwd=2)
legend("topleft", legend=c("Histogram", "Actual"), lwd=2,
col=c("lightblue","black"))
Now that you’ve seen how to simulate a normally distributed sample and compare it to an ideal one, see if you can similar graphs using the height dataset at your own pace after class.
# father data
# generate a sequence of x-values using father range
# length needs to match that of the number of height measurements
# so we subtract 1 from the denominator
= ((max(father.son$fheight) - min(father.son$fheight))/
step length(father.son$fheight) - 1))
(= seq(from = min(father.son$fheight),
x.seq to = max(father.son$fheight),
by = step)
# generate ecdf using father height data
= ecdf(father.son$fheight)
f.ecdf
# plot the x.seq sequence along the x-axis, and
# y = the ecdf function applied to new data in sequence x.seq
plot(x.seq, f.ecdf(x.seq),
lwd=2, col="red", type="l",
ylab="CDF", main="ECDF for father heights")
# add ideal line (x=sequence, y=CDF)
lines(x.seq, pnorm(x.seq,mean=mean(father.son$fheight),
sd=sd(father.son$fheight)), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
# son data
# generate a sequence of x-values using father range
# length needs to match that of the number of height measurements
= ((max(father.son$sheight) - min(father.son$sheight))/
step length(father.son$sheight) - 1))
(= seq(from = min(father.son$sheight),
x to = max(father.son$sheight),
by = step)
# ecdf
= ecdf(father.son$sheight)
s.ecdf
# plot the sequence along the x-axis, and
# y = the ecdf function applied to new data in sequence x.seq
plot(x.seq, s.ecdf(x.seq),
lwd=2, col="red", type="l",
ylab="CDF", main="ECDF for son heights")
# add ideal line (x=sequence, y=CDF)
lines(x.seq, pnorm(x.seq,mean=mean(father.son$sheight),
sd=sd(father.son$sheight)), lwd=2)
legend("topleft", legend=c("Empirical", "Actual"), lwd=2,
col=c("red","black"))
## first, let us analyze the father data
# determine the mean of the real distribution
= mean(father.son$fheight)
father_mean # determine the SD of the real distribution
= sd(father.son$fheight)
father_sd
# create a sequence of equally spaced numbers between
# the lowest and the highest real value
= seq(from = min(father.son$fheight),
father_range to = max(father.son$fheight),
by = 0.1) # an arbitrary number - the smaller the spacing,
# the smoother the curve will look on the plot
# simulate an ideal distribution with the same range, mean and SD as the real distribution
= dnorm(x = father_range,
father_norm_distribution mean = father_mean,
sd = father_sd)
# plot
hist(x = father.son$fheight,
col = "lightblue",
freq = FALSE,
main = "Father height")
lines(x = father_range,
y = father_norm_distribution,
lwd = 2)
legend("topleft",
legend = c("Histogram","Actual"),
lwd = 2,
col = c("lightblue","black"))
## do the same for the son data
# determine the mean of the real distribution
= mean(father.son$sheight)
son_mean # determine the SD of the real distribution
= sd(father.son$sheight)
son_sd
# create a sequence of equally spaced numbers between the lowest and the highest real value
= seq(from = min(father.son$sheight),
son_range to = max(father.son$sheight),
by = 0.1)
# simulate an ideal distribution with the same range, mean and SD as the real distribution
= dnorm(x = son_range,
son_norm_distribution mean = son_mean,
sd = son_sd)
# plot
hist(x = father.son$sheight,
col = "lightblue",
freq = FALSE,
main = "Son height")
lines(x = son_range,
y = son_norm_distribution,
lwd = 2)
legend("topleft",
legend = c("Histogram","Actual"),
lwd = 2,
col = c("lightblue","black"))
Et voilà !!!