Analytical vs. numerical solutions

In principle, there are two approaches to solving mathematical problems. Finding an analytical solution requires formulating the problem in a certain well-understood way, which allows calculating an exact solution using formulas. An example is finding the mean and standard deviation of a normal distribution using the formulas you always find in statistics textbooks. A numerical solution is achieved by making a series of guesses and checking every time whether the solution looks better or worse than before. Once the problem is solved well enough (even if the solution is not exactly right!), a numerical solution is declared to be found. It is possible to guess the mean and standard deviation of a normal distribution numerically, but numerical methods are more often used in situations when the analytical solution is either too complicated or does not exist. For example, the very popular R function lm(), which fits parameters to a linear regression model, uses a numerical procedure to find the slope and the intercept that would result in a line describing the real data decently well.

Finding the probability of a Bernoulli trial: the analytical solution

First, let us simulate 10 Bernoulli trials (e.g. flipping a coin) with an 80% probability of success:

p_real <- 0.8 # the "real" probability of success

set.seed(99) # set seed for reproducibility

# Bernoulli distribution is a binomial distribution with number of trials (the size parameter) = 1
my_distr <- rbinom(n = 10,
                   size = 1,
                   prob = p_real)

my_distr
##  [1] 1 1 1 0 1 0 1 1 1 1

Now, for the rest of this exercise, let us forget that we know the “real” probability of success and formulate the question in the following way:

Given our data (heads-heads-heads-tails-heads-tails-heads-heads-heads-heads), what is the most likely probability of success in a single trial?

The analytical solution would be simply the mean of our 0 and 1 vector:

mean(my_distr)
## [1] 0.8

Finding the probability of a Bernoulli trial: the numerical solution

As you see, the analytical solution to the given problem is so simple that normally you would not need to use numerical analysis for it. However, we will go ahead and do this because this problem is simple enough to calculate the numerical solution using a relatively simple code, so we will easily see the general principle behind numerical analysis.

The approach we will use is called maximum likelihood estimation. In simple words, we need to estimate what is the likelihood of observing our data with different values of the probability parameter and then pick the probability value that maximizes the likelihood of observing our data.

First, let us write a function to calculate the likelihood of observing our data (the likelihood of observing a specific combination of outcomes) with a given probability of success:

likelihood_function <- function(sequence, probability){
  n_successes <- length(which(sequence == 1))
  n_failures <- length(which(sequence == 0))
  likelihood <- probability^n_successes * (1-probability)^n_failures
  return(likelihood)
}

# does it perform as expected, i.e. is the likelihood the highest at probability = 0.8 and lower at other values?
likelihood_function(sequence = my_distr,
                    probability = 0.8)
## [1] 0.006710886
likelihood_function(sequence = my_distr,
                    probability = 0.7)
## [1] 0.005188321
likelihood_function(sequence = my_distr,
                    probability = 0.9)
## [1] 0.004304672
likelihood_function(sequence = my_distr,
                    probability = 0)
## [1] 0
likelihood_function(sequence = my_distr,
                    probability = 1)
## [1] 0

Second, let us use the function above to calculate the likelihood of observing our data with a whole range of arbitrarily selected probabilities between 0 and 1:

# generate an arbitrary sequence of probabilities between 0 and 1
hypothetical_p_range <- seq(from = 0,
                            to = 1,
                            by = 0.01)

# calculate the likelihood of observing our data with each of the probability values above
likelihood_range <- c()
for (p in hypothetical_p_range){
  likelihood_p <- likelihood_function(sequence = my_distr,
                                      probability = p)
  likelihood_range <- c(likelihood_range,
                        likelihood_p)
}

likelihood_range
##   [1] 0.000000e+00 9.801000e-17 2.458624e-14 6.173245e-13 6.039798e-12
##   [6] 3.525391e-11 1.484109e-10 4.985976e-10 1.420024e-09 3.564699e-09
##  [11] 8.100000e-09 1.697937e-08 3.329778e-08 6.174266e-08 1.091494e-07
##  [16] 1.851688e-07 3.030529e-07 4.805599e-07 7.409821e-07 1.114292e-06
##  [21] 1.638400e-06 2.360525e-06 3.338648e-06 4.643058e-06 6.357950e-06
##  [26] 8.583069e-06 1.143537e-05 1.505067e-05 1.958526e-05 2.521742e-05
##  [31] 3.214890e-05 4.060614e-05 5.084142e-05 6.313368e-05 7.778918e-05
##  [36] 9.514174e-05 1.155527e-04 1.394103e-04 1.671291e-04 1.991483e-04
##  [41] 2.359296e-04 2.779552e-04 3.257244e-04 3.797496e-04 4.405523e-04
##  [46] 5.086575e-04 5.845884e-04 6.688590e-04 7.619677e-04 8.643885e-04
##  [51] 9.765625e-04 1.098888e-03 1.231712e-03 1.375317e-03 1.529910e-03
##  [56] 1.695612e-03 1.872447e-03 2.060325e-03 2.259033e-03 2.468220e-03
##  [61] 2.687386e-03 2.915868e-03 3.152831e-03 3.397253e-03 3.647916e-03
##  [66] 3.903399e-03 4.162069e-03 4.422077e-03 4.681352e-03 4.937602e-03
##  [71] 5.188321e-03 5.430787e-03 5.662080e-03 5.879094e-03 6.078556e-03
##  [76] 6.257057e-03 6.411080e-03 6.537045e-03 6.631354e-03 6.690450e-03
##  [81] 6.710886e-03 6.689403e-03 6.623016e-03 6.509125e-03 6.345623e-03
##  [86] 6.131037e-03 5.864671e-03 5.546777e-03 5.178737e-03 4.763272e-03
##  [91] 4.304672e-03 3.809045e-03 3.284601e-03 2.741951e-03 2.194448e-03
##  [96] 1.658551e-03 1.154223e-03 7.053690e-04 3.403052e-04 9.227447e-05
## [101] 0.000000e+00

Third, plot the probability of success in a single trial on the X axis and the likelihood of observing our data on the Y axis:

my_data <- data.frame(likelihood_range, hypothetical_p_range)

ggplot(my_data,
       aes(x = hypothetical_p_range,
           y = likelihood_range)) +
  geom_point() +
  xlab("Probability of success in a single trial") +
  ylab("Likelihood of observing our data")

The solution is already pretty obvious, but how precise it will be is affected by the step size we used for the X axis. To find the solution that is as precise as possible, let us use the built-in function optimize(), which uses a modified version of golden-section search to find the value of X that minimizes or maximized the value of Y:

# "fix" the parameter sequence inside likelihood_function() and make it always equal to my_distr
# we have to do this because we only want the optimize() function below to find the probability parameter
likelihood_function_for_optimization <- function(p){
  likelihood_function(sequence = my_distr,
                      probability = p)
}

optimize(f = likelihood_function_for_optimization,
         interval = c(0,1), # search between 0 and 1
         maximum = TRUE) # find the probability value that maximizes, and not minimizes, the likelihood
## $maximum
## [1] 0.799982
## 
## $objective
## [1] 0.006710886

Do the same using fitdist() from the fitdistrplus package

The code above was just an illustration of the general idea behind maximum likelihood estimation and other numerical methods. In reality, you will likely not need to write a custom code for similar problems. Popular functions to fit distribution parameters are fitdistr() from the package MASS and fitdist() from fitdistrplus. We will use the latter.

The parameter distr is either a custom density function, or the name of one of the “d/p/q/r” family functions (e.g. “binom”, “nbinom”, “norm” etc.). In the latter case, fixed and starting arguments are the arguments from the corresponding “d/p/q/r” function. For example, we have to pass size = 1 to the list of fixed arguments because all of our trials are Bernoulli trials. Also, we have to pass prob = <something> to the list of starting values because we want the function to pick the best value of prob, and fitdist() may fail if it starts parameter search with a prob value close to either 0 or 1 (this is not always the case, but this is the case here).

fitdist(data = my_distr,
        distr = "binom",
        fix.arg = list(size = 1),
        start = list(prob = 0.5))
## $start.arg
## $start.arg$prob
## [1] 0.5
## 
## 
## $fix.arg
## $fix.arg$size
## [1] 1
## Fitting of the distribution ' binom ' by maximum likelihood 
## Parameters:
##       estimate Std. Error
## prob 0.7999928   0.126492
## Fixed parameters:
##      value
## size     1

Importantly, fitdist() performs bootstrapping, so it also outputs the error of the parameter estimate.