The purpose of this lesson is to explain the concept of probability density and likelihood, using a simple example of Normal probability density function.
We will use data on housefly wing length [mm], which you can find here.
The original data are from Sokal & Hunter (1955) A morphometric analysis of DDT-resistant and non-resistant housefly strains Ann. Entomol. Soc. Amer. 48: 499-507.
To fetch the data directly from the web, run:
wings <- read.table("http://goo.gl/4lPBG6", header=FALSE)[,1]*0.1
Let’s examine the data:
wings
## [1] 3.6 3.7 3.8 3.8 3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1 4.1 4.1 4.1 4.2
## [18] 4.2 4.2 4.2 4.2 4.2 4.2 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.4 4.4 4.4
## [35] 4.4 4.4 4.4 4.4 4.4 4.4 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.6
## [52] 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7
## [69] 4.7 4.8 4.8 4.8 4.8 4.8 4.8 4.8 4.8 4.9 4.9 4.9 4.9 4.9 4.9 4.9 5.0
## [86] 5.0 5.0 5.0 5.0 5.0 5.1 5.1 5.1 5.1 5.2 5.2 5.3 5.3 5.4 5.5
par(mfrow=c(1,2))
hist(wings, freq=TRUE, col="grey")
points(wings, jitter(rep(0, times=100), factor=10))
hist(wings, freq=FALSE, col="grey")
points(wings, jitter(rep(0, times=100), factor=0.7))
This is our formal model definition:
\[ p(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu)^2/(2\sigma^2)} \]
where \(p()\) is probability density function (also known as PDF). The function has two parameters: \(\mu\) (mean) and \(\sigma\) (standard deviation).
For \(\mu = 0\) and \(\sigma = 1\) the model looks like this:
curve(dnorm, from=-4, to=4, ylab="p(x)")
Remember:
Probability density is different from probability.
Probability density is denoted by \(p\), while probability is \(P\).
Probability density can be higher than 1.
Probability density function must integrate to 1.
Probability distribution is a very ambiguous and confusing term.
KEY PROBLEM: How do we decide which parametrization is the best for our data?
The likelihood function is the density evaluated at the data \(x_1\), … ,\(x_n\), viewed as a function of model parameters (\(\mu\) and \(\sigma\) in case of the Normal model). We write it as \(L(\mu, \sigma | x) = p(x | \mu, \sigma)\).
Calculation of likelihood in R is easy! The R functions dnorm()
, dpois()
, dunif()
, dbinom()
, dgamma()
, dbeta()
etc. are exactly for that!
Example: What is the likelihood of the first data value in the wings
dataset, given the \(Normal(\mu=4, \sigma=1)\) model?
my.mean = 4
my.sd = 1
Here is the data point that we will examine. Note that I use the letter i to denote index:
wings.i <- wings[1]
wings.i
## [1] 3.6
Here is how you calculate the likelihood for the data point wings.i
using the function dnorm
:
L <- dnorm(x=wings.i, mean=my.mean, sd=my.sd)
L
## [1] 0.3682701
Let’s plot it:
curve(dnorm(x, my.mean, my.sd), from=0, to=7,
ylab="p(wings | mu, sigma)", xlab="wings",
main=paste("p(wings.i | mu, sigma) = ", round(L, 4)))
points(wings.i, 0)
abline(v=wings.i, col="red")
abline(h=L, col="red")
Basic probability theory tells us that:
\[P(A \cap B) = P(A) \times P(B|A) = P(B) \times P(A|B) \]
Where \(P(A \cap B)\) is joint probability, associated with AND, meaning “at the same time”.
The problem is that joint probability for more than two events, e.g. \(P( A \cap B \cap C \cap D )\), can be almost impossible to calculate, with the exception of A and B being independent! Then: \[P(A \cap B) = P(A) \times P(B) \] and hence \[ P( A \cap B \cap C \cap D ) = P(A) \times P(B) \times P(C) \times P(D)\]
It follows that it is useful to subject probability density \(p()\) to the same rules as probability \(P()\). Hence, we can calculate the likelihood of the whole dataset as a product of likelihoods of all individual data points!
wings
## [1] 3.6 3.7 3.8 3.8 3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1 4.1 4.1 4.1 4.2
## [18] 4.2 4.2 4.2 4.2 4.2 4.2 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.3 4.4 4.4 4.4
## [35] 4.4 4.4 4.4 4.4 4.4 4.4 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.6
## [52] 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.7 4.7 4.7 4.7 4.7 4.7 4.7 4.7
## [69] 4.7 4.8 4.8 4.8 4.8 4.8 4.8 4.8 4.8 4.9 4.9 4.9 4.9 4.9 4.9 4.9 5.0
## [86] 5.0 5.0 5.0 5.0 5.0 5.1 5.1 5.1 5.1 5.2 5.2 5.3 5.3 5.4 5.5
L <- dnorm(x=wings, mean=my.mean, sd=my.sd)
L
## [1] 0.3682701 0.3813878 0.3910427 0.3910427 0.3969525 0.3969525 0.3989423
## [8] 0.3989423 0.3989423 0.3989423 0.3969525 0.3969525 0.3969525 0.3969525
## [15] 0.3969525 0.3969525 0.3910427 0.3910427 0.3910427 0.3910427 0.3910427
## [22] 0.3910427 0.3910427 0.3813878 0.3813878 0.3813878 0.3813878 0.3813878
## [29] 0.3813878 0.3813878 0.3813878 0.3682701 0.3682701 0.3682701 0.3682701
## [36] 0.3682701 0.3682701 0.3682701 0.3682701 0.3682701 0.3520653 0.3520653
## [43] 0.3520653 0.3520653 0.3520653 0.3520653 0.3520653 0.3520653 0.3520653
## [50] 0.3520653 0.3332246 0.3332246 0.3332246 0.3332246 0.3332246 0.3332246
## [57] 0.3332246 0.3332246 0.3332246 0.3332246 0.3122539 0.3122539 0.3122539
## [64] 0.3122539 0.3122539 0.3122539 0.3122539 0.3122539 0.3122539 0.2896916
## [71] 0.2896916 0.2896916 0.2896916 0.2896916 0.2896916 0.2896916 0.2896916
## [78] 0.2660852 0.2660852 0.2660852 0.2660852 0.2660852 0.2660852 0.2660852
## [85] 0.2419707 0.2419707 0.2419707 0.2419707 0.2419707 0.2419707 0.2178522
## [92] 0.2178522 0.2178522 0.2178522 0.1941861 0.1941861 0.1713686 0.1713686
## [99] 0.1497275 0.1295176
prod(L)
## [1] 1.657651e-50
This is a ridiculously small number! Which is why we have the Negative Log Likelihood, also known as the deviance:
- sum(log(L))
## [1] 114.6239
We can encapsulate it into a single function:
deviance.function <- function(x, mean, sd)
{
LL <- dnorm(x=x, mean=mean, sd=sd, log=TRUE) # note the log!!!
deviance <- - sum(LL)
return(deviance)
}
# it's a function of model parameters, so try to play
# around with different paramter values
deviance.function(wings, mean=0, sd=1)
## [1] 1134.624
Deviance (negative log-likelihood) can be then minimized (likelihood is maximized) in order to find the most likely model parameters - these are the Maximum Likelihood Estimators (MLE) of model parameters.
Find and plot the MLE of the Normal model and the ‘wings’ data. Use the following functions and scripts:
deviance.function()
hist()
curve()
optim()
to find MLEoptim
needs the deviance.function
to take only one object, which can be a vector (or a list). So I will just put all of the data into dat
object, and the parameters into par
object.
deviance.function.for.optim <- function(par, dat)
{
LL <- dnorm(x=dat, mean=par[1], sd=par[2], log=TRUE) # note the log!!!
deviance <- - sum(LL)
return(deviance)
}
And run the actual optimization:
optim(par=c(mean=0,var=1),
fn=deviance.function.for.optim,
dat=wings)
## Warning in dnorm(x = dat, mean = par[1], sd = par[2], log = TRUE): NaNs
## produced
## Warning in dnorm(x = dat, mean = par[1], sd = par[2], log = TRUE): NaNs
## produced
## $par
## mean var
## 4.5500648 0.3899714
##
## $value
## [1] 47.733
##
## $counts
## function gradient
## 103 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
There are other functions derived from probability density functions. I have a tutorial post on that.
Type ?pnorm
, ?qnorm
or ?rnorm
and check it out: