Petr Keil, Jan Smycka
January 2016
Rule for joint (AND) probability is \[ P(A \cap B) = P(A) \times P(B|A) \] \( A \) and \( B \) can be swapped arbitrarily \[ P(A \cap B) = P(B) \times P(A|B) \] and so \[ P(B) \times P(A|B) = P(A) \times P(B|A) \] which we can rearrange to get \[ P(A|B) = \frac {P(A) \times P(B|A)}{P(B)} \] which is the Bayes rule.
We can replace \( A \) and \( B \) by model parameters \( \theta \) and the data \( y \) to get
\( p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)} \)
where
\( p(y|\theta) \) … likelihood
\( p(\theta) \) … prior
\( p(\theta|y) \) … posterior
\( p(y) \) … the horrible thing
\[ p(y)=\sum_\theta p(\theta) \times p(y|\theta) \]
\[ p(y)=\int_\theta p(\theta) \times p(y|\theta) d\theta \]
In most cases we can't calculate \( p(y) \). But we can calculate the ratio of \( p(\theta_1|y) \) and \( p(\theta_2|y) \):
\[ \frac{p(\theta_1|y) }{ p(\theta_2|y)}=\frac{p(\theta_1) \times p(y|\theta_1)}{p(\theta_2) \times p(y|\theta_2)} = \alpha \]
We can also say that
\[ p(\theta|y) \propto p(\theta) \times p(y|\theta) \]
We can use the ratio \( \frac{p(\theta_1|y) }{ p(\theta_2|y)} \) to sample from the posterior distribution by a numerical sampling algorithm called Markov Chain Monte Carlo (MCMC).
An example of Metropolis-Hastings MCMC estimation of mean (\( \mu \)) and standard deviation (\( \sigma \)) of a Normal distribution (by M. Joseph; R code here)
The result of three Markov chains running on the 3D Rosenbrock function using the Metropolis-Hastings algorithm. (Source: Wikipedia article).
\[ p(\theta|y) \propto p(\theta) \times p(y|\theta) \]
and not
\[ p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)} \]
WinBUGS, OpenBUGS www.openbugs.net – uses BUGS language
JAGS mcmc-jags.sourceforge.net/ – uses BUGS language
STAN - mc-stan.org
uses more sophisticated MCMC sampling than Gibbs or M-H
suffers less from steps autocorrelation
may be more efficient for complicated posteriors (cigars, camels)
language is more difficult to command then BUGS
(Integrated Nested Laplace Approximation)
Uses the idea that in certain types of models \( p(\theta_a|\theta_b, y) \) can be aproximated by normal distribution (note: \( \theta_a \) and \( \theta_b \) are two different parameters in the same model).
This makes the expression \( p(y|\theta) \times p(\theta) \) integrable.
It is fast.
Does not suffer from sampling issues.
Applicable only to some classes of models (for details see http://www.r-inla.org/models/latent-models).