Jan Smycka
Rule for joint (AND) probability is \[ P(A \cap B) = P(A) \times P(B|A) \] \( A \) and \( B \) can be swapped arbitrarily \[ P(A \cap B) = P(B) \times P(A|B) \] and so \[ P(B) \times P(A|B) = P(A) \times P(B|A) \] which we can rearrange to get \[ P(A|B) = \frac {P(B|A) \times P(A)}{P(B)} \] which is the Bayes rule.
We can also say that \[ P(B)=P(B|A) \times P(A)+P(B|nonA) \times P(nonA) \] so \[ P(A|B) =\frac {P(B|A) \times P(A)}{P(B)} \]
\[ =\frac {P(B|A) \times P(A)}{P(B|A) \times P(A)+P(B|nonA) \times P(nonA)} \]
if we use sum sign \[ P(A|B)=\frac {P(B|A) \times P(A)}{\sum_A P(B|A) \times P(A)} \]
we can replace \( A \) and \( B \) by model parameters \( \theta \) and the data \( y \) to get
\( p(\theta|y) = \frac {p(\theta) \times p(y|\theta)}{p(y)} \)
where
\( p(y|\theta) \) … likelihood
\( p(\theta) \) … prior
\( p(\theta|y) \) … posterior
\( p(y) \) … the horrible thing
\[ p(y)=\sum_\theta p(\theta) \times p(y|\theta) \]
\[ p(y)=\int_\theta p(\theta) \times p(y|\theta) d\theta \]
We have five data points from poisson distribution
x=rpois(5,20)
x
[1] 19 20 24 27 21
We would like to know posterior distribution of \( \lambda \)
Poisson distribution posterior can be expressed as
\[ p(\lambda|x)=\dfrac{p(x|\lambda)*p(\lambda)}{p(x)} \]
where
\[ p(x|\lambda)=\prod\limits_{i=1}^n \dfrac{\lambda^{x_i}e^{-\lambda}}{x_i!}; p(\lambda)=0.01\\ \]
\[ p(x)=\int_\lambda (\prod\limits_{i=1}^n \dfrac{\lambda^{x_i}e^{-\lambda}}{x_i!}*0.01) d\lambda\\ \]
is accurate
cannot be obtained for most model types
We can try to compute numerically
The numerator can be expressed as R function
baypo=function(l){
#likelihood
lh=function(l){
prod(((l^x)*exp(-l))/factorial(x))
}
#likelihood and prior
100000000*lh(l)*dunif(l,0,100)
}
Now we can generate a grid and feed the function
lambda=seq(0,100,0.01)
postdist=NULL
for (i in 1:length(lambda)){
postdist[i]=baypo(lambda[i])
}
prob=postdist/sum(postdist)
theoretically works always
can be paralelized on clusters
is terribly slow and wasteful
So what about more clever ways of sampling?
If we want to know where the modus is
If we want to know where the modus is
If we want to know where the modus is
If we want to know where the modus is
If we want to know where the modus is
If we want to know where the modus is
If we want to know where the modus is, we accept each \( \lambda_n \) which satisfies
\[ \frac{p(\lambda_n|y) }{ p(\lambda_o|y)}=\frac{p(x|\lambda_n)*p(\lambda_n) }{ p(x|\lambda_o)*p(\lambda_o)}>1 \]
If we want to hit each value of \( \lambda \) with probability \( p(\lambda|y) \), we accept each \( \lambda_n \)
\[ \frac{p(\lambda_n|y) }{ p(\lambda_o|y)}=\frac{p(x|\lambda_n)*p(\lambda_n) }{ p(x|\lambda_o)*p(\lambda_o)}>1 \]
with probablity 1, and
\[ \frac{p(\lambda_n|y) }{ p(\lambda_o|y)}=\frac{p(x|\lambda_n)*p(\lambda_n) }{ p(x|\lambda_o)*p(\lambda_o)}<1 \]
with probability \( \frac{p(\lambda_n|y) }{ p(\lambda_o|y)} \)
And now switch to “histogram thinking”
is more efficient than numerical integration
problem with camels
need to set up the jumps
can be slow if we have many variables
samples from the beginning do not reflect the distribution
correlation between subsequent steps
Lets imagine we have more variables than one. For example data from normal distribution and we want to estimate distribution of \( \mu \) and \( \sigma \).
\[ p(\mu, \sigma|y)=\dfrac{p(y|\mu, \sigma)*p(\mu)*p(\sigma)}{p(y)} \]
Sampling from univariate distribution is always quite simple.
If we knew \( \sigma \) we could sample \( \mu \) from \( p(\mu|\sigma,y) \)
If we knew \( \mu \) we could sample \( \sigma \) from \( p(\sigma|\mu,y) \)
Yes, if sampling from univariate is simple, we didn't have to do it by M-H in previous chapter. But M-H can be easily generalized to more dimension unlike direct sampling.
Get initial values of \( \mu \) and \( \sigma \).
Take initial \( \sigma \) and sample \( \mu \) from \( p(\mu|\sigma,y) \)
Take sampled \( \mu \) and sample \( \sigma \) from \( p(\sigma|\mu,y) \)
Repeat
for many parameters is more efficient than numerical integration and M-H
problem with camels
problem with cigars
samples from the beginning do not reflect the distribution
correlation between subsequent steps
from Maxwell Joseph's GitHub (http://mbjoseph.github.io/images/metrop.gif)
There is a plenty of environments that can do Gibbs sampling for you
… and all those use BUGS language
y <- rnorm(5, 48.1, 6.1)
N <- 5
my.data <- list(y=y, N=N)
my.data
$y
[1] 36.12103 48.29577 56.31363 52.99604 46.24476
$N
[1] 5
BUGS language works with graphs
model
{
# p(mu) ... mu prior
mu ~ dnorm (0, 0.001)
# p(sigma) ... sigma prior
sigma ~ dunif(0, 100)
tau<-1/(sigma*sigma)
# p(y|mu, sigma) ... likelihood
for(i in 1:N)
{
y[i] ~ dnorm(mu, tau)
}
}
We will dump the model to a file using cat("", file="")
cat("
model
{
# p(mu) ... mu prior
mu ~ dnorm (0, 0.001)
# p(sigma) ... sigma prior
sigma ~ dunif(0, 100)
tau<-1/(sigma*sigma)
# p(y|mu, sigma) ... likelihood
for(i in 1:N)
{
y[i] ~ dnorm(mu, tau)
}
}
", file="my_model.txt")
library(R2jags)
fitted.model <- jags(data=my.data, model.file="my_model.txt", parameters.to.save=c("mu", "sigma"), n.chains=1, n.iter=1000, n.burnin=100)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph Size: 14
Initializing model
uses more sophisticated MCMC sampling than Gibbs or M-H
suffers less from steps autocorrelation
may be more efficient for complicated posteriors (cigars, camels)
language is more difficult to command then BUGS
(Integrated Nested Laplace Approximation)
Uses the idea that in certain types of models \( p(some_parameters|other_parameters,data) \) can be aproximated by normal distribution.
This makes the expression \( p(data|parameters)*p(parameters) \) integrable.
fast
does not suffer from sampling issues
applicable only to some classes of models (for details see http://www.r-inla.org/models/latent-models)