Introduction to Bayesian Data Analysis

SFG, 13 November 2015

Highlights

Give you a flavour of Bayesian thinking.
Go through some simple examples (including code) of how to do this in R.
But first, a question…

If all of your friends jumped off a bridge, would you jump too?

Bayesians don't ignore prior information!

Introduction

Bayes' rule is very easily derived

\(P(A|B) = \frac{P(A \cap B)}{P(B)}\)

\(P(B|A) = \frac{P(A \cap B)}{P(A)}\)

\(\Rightarrow P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)\)

\(\Rightarrow P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)

The general structure provided by this rule can be applied to any statistical problem.

A canonical application: Mammograms and false positives

What is probability that a women has breast cancer (A) if she get a positive mammogram (B)?
We have \(P(A) = 1.4\%\), \(P(\overline{A}) = 98.6\%\), \(P(B|A) = 75\%\), \(P(B|\overline{A}) = 10\%\).
Plug into Bayes' theorem: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\overline{A})P(\overline{A})} \] \[ \therefore P(A|B) = \frac{0.75 * 0.014}{0.75 * 0.014 + 0.1 * 0.986} = 10\% \]

Bayesian versus frequentist

Bayesians
- Probability is a degree of (subjective) belief.
- Parameters (e.g. regression coefficients) are treated as random variables.
Frequentists
- Probability is the limiting frequency in a large number of repeated draws.
- Parameters are treated as fixed, but unknown.

Bayesian concepts are intuitive, matching our everday understanding of "probability" or "belief in a hypothesis".
- Compare confusion over p-values and confidence intervals in frequentist paradigm.
- Bayesian methods are also very flexible and can accomodate any type of distribution.

Conjugate priors

In some cases — when the prior and data (i.e. likelihood function) are simple enough and come from the same "family" of distributions — we can analytically derive the posterior. We call these conjugate priors.
Example: The binomial distribution is conjugate with the beta distribution.
- Binomial: \(f(k|\theta) = \binom{N}{K}\theta^k (1-\theta)^{N-k}\)
- Beta: \(f(\theta|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\)
- Posterior: \(f(\theta|k,\alpha,\beta) \propto \theta^{k+\alpha-1}(1-\theta)^{N-k+\beta-1}\)

We can apply these distributions to game of coin flips. We are interested in whether a coin is "fair" or not.

Conjugate priors (cont.)

We have two participants. Let's call them Matt and Chris.
Matt is a trusting ecologist from Canada. He is willing to bet that the coin is probably fair..
- Matt's prior is centered around 0.5: \(Beta(\alpha = 20, \beta = 20)\).
Chris is a more cautious (cynical?) economist and is holding off judgement until he sees some data.
- Chris has a "flat" prior: \(Beta(\alpha = 1, \beta = 1)\).

It turns out that the coin is actually unfair… It only has a 25% chance of landing on heads.

Conjugate priors (cont.)

Now let's simulate some coin flips in R and see how the players update their beliefs…

We can simulate this is in R

coin_sim <- function(p, N, M_alpha, M_beta, C_alpha, C_beta) {
  ## Simulate coin outcomes in advance. "Success" = heads.
  outcomes <- sample(1:0, N, prob = c(p, 1 - p), replace = TRUE)
  success <- cumsum(outcomes)
  ## Matt prior
  curve(dbeta(x, M_alpha, M_beta), 
        col = "red", lwd = 2, lty = 2, xlim = c(0, 1), ylim = c(0, 20), 
        xlab = "P(Heads)", ylab = "Density")
  ## Chris prior
  curve(dbeta(x, C_alpha, C_beta),
        col = "dodgerblue", lwd = 2, lty = 2, add = TRUE)
  ## True probability of coin
  lines(x = c(p, p), y = c(0, 20), lty = 2, lwd = 1,col = "grey60")
  ## Matt posterior
  curve(dbeta(x, M_alpha+success[N], M_beta+(N-success[N])),
        add = TRUE, col = "red", lwd = 2)
  ## Chris posterior
  curve(dbeta(x, C_alpha+success[N], C_beta+(N-success[N])),
        add = TRUE, col = "dodgerblue", lwd = 2)
  ##Legend  
  legend('topleft', lty = 1, col = c("red", "dodgerblue"),
         legend = c(paste0("Matt ~ Beta(", M_alpha , ", " , M_beta , ")"),
                    paste0("Chris ~ Beta(", C_alpha , ", " , C_beta , ")")))
  text(0.75, 17, label = paste(N, "flips:", success[N], "heads,", N-success[N], "tails"))
}

We can simulate this is in R

coin_sim(p = 0.25, N = 1, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         )

We can simulate this is in R

coin_sim(p = 0.25, N = 10, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         )

We can simulate this is in R

coin_sim(p = 0.25, N = 50, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         )

We can simulate this is in R

coin_sim(p = 0.25, N = 100, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         )

We can simulate this is in R

coin_sim(p = 0.25, N = 200, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         )

Some important, generalisable points

The posterior can be thought of as a weighted average of the prior and data (through the likelihood function).
- We update our priors in combination with new data.
Which component dominates (prior versus data) will depend on their relative uncertainties.
- The more confident we are in our prior, the more we will tend to stick to it.
With "enough" data, we generally – though not always – get posterior convergence regardless of prior.

Bayesian regression

Bayesian regression equation

\[ p(\theta|X) = \frac{p(X|\theta)p(\theta)}{p(X)} \]

\(p(\theta|X)\) ~ Posterior probability. "How probable are our parameters \((\theta)\), given the data \((X)\)?"
\(p(X|\theta)\) ~ Likelihood function. "How likely are the data for a given set of parameters or state of the world (e.g. if we assume normal distribution)?"
\(p(\theta)\) ~ Prior probability. "What do we know about our parameters before we see the data?"
Ignoring the normalising constant \(p(X)\), we are left with…

Bayesian regression equation

(cont.)

\[ p(\theta|X) \propto p(X|\theta)p(\theta) \]

"The posterior is proportional to the likelihood times the prior!"
Calculating the posterior probability in a Bayesian setup is really just a matter of specifying a prior and then estimating the likelihood function (i.e. running a regression).
Sounds simple! So what's the catch?
- Choice of prior can be contentious. Do we use subjective priors or noninformative ones?
- Deriving an analytical solution for the posterior density is often impossible (non-conjugate). Luckily, we are able to solve for this using Markov Chain Monte Carlo (MCMC) simulation.

MCMC, JAGS and rjags

JAGS (Just Another Gibbs Sampler) is the workhorse programme for Bayesian MCMC simulation. Interacts with R through the rjags package.
- Built off the legacy BUGS (BRugs) package.
Other options include STAN (RStan), MCMCpack, LearnBayes, etc.
- All work in similar ways, but I'd advise sticking with rjags for newbies (v. flexible, good support and documentation, etc.)
Let's walk through a simple example.

JAGS example

Say we have data that is decribed by a linear model: \[y_i = 2 + 4x_i + \epsilon_i,\] where \(\epsilon_i \thicksim \mathcal{N}(0, \tau)\) and \(\tau = \frac{1}{\sigma^2}\).
We estimate this model (using noninformative priors) with rjags as follows.

## Load packages
require(rjags)
require(dplyr) ## For manipulating data frames
require(tidyr) ## For tidying data frames
require(ggplot2) ## For plotting

## Create some fake data
rm(list=ls())
set.seed(123)
x <- runif(50)
y <- 2 + 4*x + rnorm(50, mean = 0, sd = 1)

my_data <- data.frame(y, x)

JAGS example (cont.)

plot(my_data$x, my_data$y)

JAGS example (cont.)

JAGS needs to call a .txt file with instructions about the model.
Good practice to write this in the R file itself so that everything is self-contained.

model_string <- paste(
  "model {

  for (i in 1:N){
  
  y[i] ~ dnorm (mu.y[i], tau.y) ## Our likelihood function (i.e. OLS)
  mu.y[i] <- alpha + beta*x[i]
  
  }

  ## Noninformative priors for all parameters (e.g. mean zero, large variance)
  alpha ~ dnorm( 0, 0.001 ) 
  beta ~ dnorm( 0, 0.001 )
  tau.y ~ dgamma(0.001, 0.001)  
  
  }" 
  )  

writeLines(model_string, con = "jags_example.txt" )

JAGS example (cont.)

## Run the JAGS model
N <- length(x)

my_list <- list("N" = N, "y" = my_data$y, "x" = my_data$x)
jags_inits <- function() {list (alpha = 0, beta = 0, tau.y = 1)}
params <- c("alpha", "beta", "tau.y")

jags_mod <- jags.model(file = "jags_example.txt", data = my_list, 
                       inits = jags_inits, n.chains = 1, n.adapt = 1000)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 50
##    Unobserved stochastic nodes: 3
##    Total graph size: 210
## 
## Initializing model

update(jags_mod, n.iter = 1000) ## burn in
jags_samples <- coda.samples(jags_mod, variable.names = params, n.iter = 15000)

JAGS example (cont.)

## Some model diagnostics
geweke.diag(jags_samples)

## [[1]]
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##   alpha    beta   tau.y 
## -1.7784  1.7243 -0.9083

heidel.diag(as.mcmc(jags_samples))

##                                     
##       Stationarity start     p-value
##       test         iteration        
## alpha passed       1501      0.0859 
## beta  passed       1501      0.1311 
## tau.y passed          1      0.7417 
##                               
##       Halfwidth Mean Halfwidth
##       test                    
## alpha passed    1.86 0.01271  
## beta  passed    4.39 0.02117  
## tau.y passed    1.13 0.00383

raftery.diag(as.mcmc(jags_samples))

## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
##                                              
##        Burn-in  Total Lower bound  Dependence
##        (M)      (N)   (Nmin)       factor (I)
##  alpha 10       11942 3746         3.19      
##  beta  10       11104 3746         2.96      
##  tau.y 2        3844  3746         1.03

JAGS example (cont.)

summary(jags_samples)

## 
## Iterations = 1001:16000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## alpha 1.852 0.2808 0.002293       0.006218
## beta  4.392 0.4670 0.003813       0.010279
## tau.y 1.131 0.2297 0.001875       0.001954
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%   50%   75% 97.5%
## alpha 1.3020 1.6645 1.851 2.038 2.415
## beta  3.4653 4.0853 4.398 4.703 5.303
## tau.y 0.7278 0.9698 1.116 1.276 1.633

JAGS example (cont.)

reg_df <- tbl_df(as.data.frame(jags_samples[[1]])) 
reg_df <- reg_df %>% gather(coef, value)
ggplot(reg_df, aes(x = value)) + geom_density() + facet_wrap(~ coef, scales = "free")

JAGS example (cont.)

Results
- Our simple model does a decent enough job of estimating the model parameters.
- The true values of \(\alpha=2\) and \(\beta=4\) fall comfortably inside the relevant posterior densities.
- The model fit should improve with more data points.
- On the other hand, we could also try subjective (informative) priors instead of noninformative ones and see how that affects the final result.