Petr Keil, Jan Smycka

January 2016

Rule for joint (AND) probability is
\[ P(A \cap B) = P(A) \times P(B|A) \]
\( A \) and \( B \) can be swapped arbitrarily
\[ P(A \cap B) = P(B) \times P(A|B) \]
and so
\[ P(B) \times P(A|B) = P(A) \times P(B|A) \]
which we can rearrange to get
\[ P(A|B) = \frac {P(A) \times P(B|A)}{P(B)} \]
which is **the Bayes rule**.

We can replace \( A \) and \( B \) by model parameters \( \theta \) and the data \( y \) to get

\( p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)} \)

where

\( p(y|\theta) \) … likelihood

\( p(\theta) \) … prior

\( p(\theta|y) \) … posterior

\( p(y) \) … the horrible thing

\[ p(y)=\sum_\theta p(\theta) \times p(y|\theta) \]

\[ p(y)=\int_\theta p(\theta) \times p(y|\theta) d\theta \]

In most cases we can't calculate \( p(y) \). But we can calculate the ratio of \( p(\theta_1|y) \) and \( p(\theta_2|y) \):

\[ \frac{p(\theta_1|y) }{ p(\theta_2|y)}=\frac{p(\theta_1) \times p(y|\theta_1)}{p(\theta_2) \times p(y|\theta_2)} = \alpha \]

We can also say that

\[ p(\theta|y) \propto p(\theta) \times p(y|\theta) \]

We can use the ratio \( \frac{p(\theta_1|y) }{ p(\theta_2|y)} \) to sample from the posterior distribution by a numerical sampling algorithm called Markov Chain Monte Carlo (**MCMC**).

**Metropolis-Hastings algorithm**– See YouTube video, time 1:00 onwards.**Gibbs algorithm**

An example of Metropolis-Hastings MCMC estimation of mean (\( \mu \)) and standard deviation (\( \sigma \)) of a Normal distribution (by M. Joseph; R code here)

The result of three Markov chains running on the 3D Rosenbrock function using the Metropolis-Hastings algorithm. (Source: Wikipedia article).

\[ p(\theta|y) \propto p(\theta) \times p(y|\theta) \]

and not

\[ p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)} \]

**WinBUGS, OpenBUGS** www.openbugs.net – uses BUGS language

**JAGS** mcmc-jags.sourceforge.net/ – uses BUGS language

**STAN** - mc-stan.org

uses more sophisticated MCMC sampling than Gibbs or M-H

suffers less from steps autocorrelation

may be more efficient for complicated posteriors (cigars, camels)

language is more difficult to command then BUGS

(Integrated Nested Laplace Approximation)

Uses the idea that in certain types of models \( p(\theta_a|\theta_b, y) \) can be aproximated by normal distribution (note: \( \theta_a \) and \( \theta_b \) are two different parameters in the same model).

This makes the expression \( p(y|\theta) \times p(\theta) \) integrable.

It is fast.

Does not suffer from sampling issues.

Applicable only to some classes of models (for details see http://www.r-inla.org/models/latent-models).