Practice problems

1. Nucleic acids: Sampling, permutations and combinations

  1. Restriction enzymes recognize specific sequences in DNA and cut the DNA within or near those sites. Many restriction enzymes recognize 6-bp sequences. How many possible combinations of 6 nt sequences are there?
4^6
## [1] 4096
# There are 4 possibilities for each position, and they are drawn independently,
# so there are 4*4*4*4*4*4 = 4^6 = 4096 possibilities.
  1. At what frequency would you expect to find a binding site for a particular restriction enzyme that recognizes a 6 bp sequence? Note: Restriction enzymes usually recognize a palindromic sequence, so you don’t need to worry about looking on both strands.
1/4^6
## [1] 0.0002441406
# Since there are 4096 possible 6-mers, you would expect to find a specific one
# around 1/4096th of the time.
# This is 2.4 sites per 10,000 bases, or one site every 4096 bases.
  1. What is the average length of a fragment produced by a 6-cutter?
4^6
## [1] 4096
# One cut every 4096 bases on average translates to an expected fragment size of 4096.
  1. The EcoR1 enzyme cuts the sequence “GAATTC”. How many different sequences could be made out of this set of nucleotides (assume that you treat each A and T as distinct individuals)?
# nPerm = n!
factorial(6) # 720
## [1] 720
  1. You are interested in synthesizing a bunch of random oligonucleotides for a SELEX experiment. How many different oligos of length 22 could you synthesize?
4^22 # 1.76e+13
## [1] 1.759219e+13
  1. How many ways could you make a sequence of 6 nt by grabbing them at random from a bag of 12 nucleotides (assuming each base is treated as a distinct unique entity, and you can only pick each nt once)? This is an example of “random sampling without replacement”.
# nPerm = n!/(n-k)!
factorial(12)/factorial(12-6) # 665280
## [1] 665280
  1. How many combinations of 6nt are there in a set of 12 random nucleotides? (That is, pick any group of 6 nt, where each one is treated as unique, and you don’t care about the sequence in which you pick them.)
# nComb = nPerm/k! = n!/k!(n-k)!
factorial(12)/( factorial(6) * factorial(12-6) ) # 924
## [1] 924
  1. How are your answers in parts e and f related?
# your answer here

The number of combinations is smaller than the number of permutations by a factor of k! since all the permutations of the k-mer sequence get collapsed into one combination.

2. Bernoulli distribution

The frequency of a minor allele at the CFTR gene was found to be 0.17. What is the probability that any allele chosen at random from the population will NOT be that same allele?

0.17^0 * (1-0.17)^1
## [1] 0.83

\[ P(X=0) = f(0) = (0.17)^0(1-0.17)^1 = 0.83\]

3. Binomial distribution

Consider a genome in which the four bases \(A,C,G,T\) are present in equal proportions (this is actually rather uncommon, but let’s go with it for now).

For the following questions, think of them as “What is the probability of X and not Y?” (So, you are asking a binary question about one possible outcome vs. all others.)

If you were to pick a sequence of 10 nt completely at random, what is the chance that exactly 3 bases will be an \(A\) (or any other homopolymer)?

dbinom(3,10,.25)
## [1] 0.2502823

What is the chance that you will find less than 5 \(G\)s (i.e. 4 or fewer)?

pbinom(4,10,.25)
## [1] 0.9218731

What is the chance that you will find between 2 and 4 \(T\)s?

\[ P(2 \le X \le 4) = \sum_{k=2}^4 {10 \choose k}(0.25)^k(0.75)^{10-k} = \sum_{k=0}^4 {10 \choose k}(0.25)^k(0.75)^{10-k} - \sum_{k=0}^1 {10 \choose k}(0.25)^k(0.75)^{10-k}\]

# these are equivalent: a) sum up the probabilities for all values
sum(dbinom(2:4,10,.25))
## [1] 0.6778479
# b) get the cumulative mass up to and including 4, then
#    subtract the cumulative up to 1 (to get 2 inclusive)
pbinom(4,10,.25) - pbinom(1,10,.25)
## [1] 0.6778479

The quantile function qbinom(p, size, prob) returns the smallest value of \(q\) such that \(Pr(X \le q) \ge p\). The quantile is defined as the smallest value \(x\) such that \(F(x) \ge p\). For example, \(F(x) \ge 0.75\) means that 75% of the distribution is less than \(x\).

What is the maximum number of any single base you would expect to find in 75% of 10-mers picked at random from the genome?

qbinom(.75,10,.25)
## [1] 3

What is the maximum number of any single base you would expect to find in 99% of random 10-mers?

qbinom(.99,10,.25)
## [1] 6

Binomial Proportions Test

Say you are analyzing the GC content across different regions of the genome and you find a region of 100 bases upstream of your favorite gene that has 63% CG content. How likely is it that this region has no significant CG bias?

First use the pbinom() function to get a p-value, and then perform a two-sided test using binom.test().

## pbinom 
# use P(X < or = x) for lower tail and P(X > x-1) for upper tail
# two-sided test is looking for values more extreme than the one observed
# at either end of the distribution
# lower.tail = 100-63 = 37
pbinom(37,100, 0.5) + pbinom(62,100, 0.5, lower.tail=FALSE) # lt + ut
## [1] 0.01203298
pbinom(62,100, 0.5, lower.tail=FALSE) * 2                   # 2*upper tail
## [1] 0.01203298
## binom.test
# gives P(X <= n-x) + P(X >= x) for a two-sided test,
# so use the actual value you are looking for and select the type of test
binom.test(63,100, 0.5, alternative = c("two.sided"))
## 
##  Exact binomial test
## 
## data:  63 and 100
## number of successes = 63, number of trials = 100, p-value = 0.01203
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5276484 0.7244334
## sample estimates:
## probability of success 
##                   0.63

Are the results the same?

## Yes they are exactly the same. The `binom.test()` function performs an exact test and should give the same results as `pbinom()`. 

# It is also easier to use because you don't have to remember to subtract one for an upper-tailed test.