SFG, 13 November 2015

Highlights

  • Give you a flavour of Bayesian thinking.
  • Go through some simple examples (including code) of how to do this in R.
  • But first, a question…

If all of your friends jumped off a bridge, would you jump too?

  • Bayesians don't ignore prior information!

Introduction

Bayes' rule is very easily derived

\(P(A|B) = \frac{P(A \cap B)}{P(B)}\)

\(P(B|A) = \frac{P(A \cap B)}{P(A)}\)

\(\Rightarrow P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)\)

\(\Rightarrow P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)

  • The general structure provided by this rule can be applied to any statistical problem.

A canonical application: Mammograms and false positives

  • What is probability that a women has breast cancer (A) if she get a positive mammogram (B)?
  • We have \(P(A) = 1.4\%\), \(P(\overline{A}) = 98.6\%\), \(P(B|A) = 75\%\), \(P(B|\overline{A}) = 10\%\).
  • Plug into Bayes' theorem: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\overline{A})P(\overline{A})} \] \[ \therefore P(A|B) = \frac{0.75 * 0.014}{0.75 * 0.014 + 0.1 * 0.986} = 10\% \]

Bayesian versus frequentist

  • Bayesians
    • Probability is a degree of (subjective) belief.
    • Parameters (e.g. regression coefficients) are treated as random variables.
  • Frequentists
    • Probability is the limiting frequency in a large number of repeated draws.
    • Parameters are treated as fixed, but unknown.
  • Bayesian concepts are intuitive, matching our everday understanding of "probability" or "belief in a hypothesis".
    • Compare confusion over p-values and confidence intervals in frequentist paradigm.
    • Bayesian methods are also very flexible and can accomodate any type of distribution.

Conjugate priors

  • In some cases — when the prior and data (i.e. likelihood function) are simple enough and come from the same "family" of distributions — we can analytically derive the posterior. We call these conjugate priors.
  • Example: The binomial distribution is conjugate with the beta distribution.
    • Binomial: \(f(k|\theta) = \binom{N}{K}\theta^k (1-\theta)^{N-k}\)
    • Beta: \(f(\theta|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\)
    • Posterior: \(f(\theta|k,\alpha,\beta) \propto \theta^{k+\alpha-1}(1-\theta)^{N-k+\beta-1}\)
  • We can apply these distributions to game of coin flips. We are interested in whether a coin is "fair" or not.

Conjugate priors (cont.)

  • We have two participants. Let's call them Matt and Chris.
  • Matt is a trusting ecologist from Canada. He is willing to bet that the coin is probably fair..
    • Matt's prior is centered around 0.5: \(Beta(\alpha = 20, \beta = 20)\).
  • Chris is a more cautious (cynical?) economist and is holding off judgement until he sees some data.
    • Chris has a "flat" prior: \(Beta(\alpha = 1, \beta = 1)\).
  • It turns out that the coin is actually unfair… It only has a 25% chance of landing on heads.

Conjugate priors (cont.)

  • Now let's simulate some coin flips in R and see how the players update their beliefs…

We can simulate this is in R

coin_sim <- function(p, N, M_alpha, M_beta, C_alpha, C_beta) {
  ## Simulate coin outcomes in advance. "Success" = heads.
  outcomes <- sample(1:0, N, prob = c(p, 1 - p), replace = TRUE)
  success <- cumsum(outcomes)
  ## Matt prior
  curve(dbeta(x, M_alpha, M_beta), 
        col = "red", lwd = 2, lty = 2, xlim = c(0, 1), ylim = c(0, 20), 
        xlab = "P(Heads)", ylab = "Density")
  ## Chris prior
  curve(dbeta(x, C_alpha, C_beta),
        col = "dodgerblue", lwd = 2, lty = 2, add = TRUE)
  ## True probability of coin
  lines(x = c(p, p), y = c(0, 20), lty = 2, lwd = 1,col = "grey60")
  ## Matt posterior
  curve(dbeta(x, M_alpha+success[N], M_beta+(N-success[N])),
        add = TRUE, col = "red", lwd = 2)
  ## Chris posterior
  curve(dbeta(x, C_alpha+success[N], C_beta+(N-success[N])),
        add = TRUE, col = "dodgerblue", lwd = 2)
  ##Legend  
  legend('topleft', lty = 1, col = c("red", "dodgerblue"),
         legend = c(paste0("Matt ~ Beta(", M_alpha , ", " , M_beta , ")"),
                    paste0("Chris ~ Beta(", C_alpha , ", " , C_beta , ")")))
  text(0.75, 17, label = paste(N, "flips:", success[N], "heads,", N-success[N], "tails"))
}

We can simulate this is in R

coin_sim(p = 0.25, N = 1, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         ) 

We can simulate this is in R

coin_sim(p = 0.25, N = 10, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         ) 

We can simulate this is in R

coin_sim(p = 0.25, N = 50, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         ) 

We can simulate this is in R

coin_sim(p = 0.25, N = 100, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         ) 

We can simulate this is in R

coin_sim(p = 0.25, N = 200, 
         M_alpha = 20, M_beta = 20,
         C_alpha = 1, C_beta = 1
         ) 

Some important, generalisable points

  • The posterior can be thought of as a weighted average of the prior and data (through the likelihood function).
    • We update our priors in combination with new data.
  • Which component dominates (prior versus data) will depend on their relative uncertainties.
    • The more confident we are in our prior, the more we will tend to stick to it.
  • With "enough" data, we generally – though not always – get posterior convergence regardless of prior.

Bayesian regression

Bayesian regression equation

\[ p(\theta|X) = \frac{p(X|\theta)p(\theta)}{p(X)} \]

  • \(p(\theta|X)\) ~ Posterior probability. "How probable are our parameters \((\theta)\), given the data \((X)\)?"
  • \(p(X|\theta)\) ~ Likelihood function. "How likely are the data for a given set of parameters or state of the world (e.g. if we assume normal distribution)?"
  • \(p(\theta)\) ~ Prior probability. "What do we know about our parameters before we see the data?"
  • Ignoring the normalising constant \(p(X)\), we are left with…

Bayesian regression equation

(cont.)

\[ p(\theta|X) \propto p(X|\theta)p(\theta) \]

  • "The posterior is proportional to the likelihood times the prior!"
  • Calculating the posterior probability in a Bayesian setup is really just a matter of specifying a prior and then estimating the likelihood function (i.e. running a regression).
  • Sounds simple! So what's the catch?
    • Choice of prior can be contentious. Do we use subjective priors or noninformative ones?
    • Deriving an analytical solution for the posterior density is often impossible (non-conjugate). Luckily, we are able to solve for this using Markov Chain Monte Carlo (MCMC) simulation.

MCMC, JAGS and rjags

  • JAGS (Just Another Gibbs Sampler) is the workhorse programme for Bayesian MCMC simulation. Interacts with R through the rjags package.
    • Built off the legacy BUGS (BRugs) package.
  • Other options include STAN (RStan), MCMCpack, LearnBayes, etc.
    • All work in similar ways, but I'd advise sticking with rjags for newbies (v. flexible, good support and documentation, etc.)
  • Let's walk through a simple example.

JAGS example

  • Say we have data that is decribed by a linear model: \[y_i = 2 + 4x_i + \epsilon_i,\] where \(\epsilon_i \thicksim \mathcal{N}(0, \tau)\) and \(\tau = \frac{1}{\sigma^2}\).
  • We estimate this model (using noninformative priors) with rjags as follows.
## Load packages
require(rjags)
require(dplyr) ## For manipulating data frames
require(tidyr) ## For tidying data frames
require(ggplot2) ## For plotting

## Create some fake data
rm(list=ls())
set.seed(123)
x <- runif(50)
y <- 2 + 4*x + rnorm(50, mean = 0, sd = 1)

my_data <- data.frame(y, x)

JAGS example (cont.)

plot(my_data$x, my_data$y)

JAGS example (cont.)

  • JAGS needs to call a .txt file with instructions about the model.
  • Good practice to write this in the R file itself so that everything is self-contained.
model_string <- paste(
  "model {

  for (i in 1:N){
  
  y[i] ~ dnorm (mu.y[i], tau.y) ## Our likelihood function (i.e. OLS)
  mu.y[i] <- alpha + beta*x[i]
  
  }

  ## Noninformative priors for all parameters (e.g. mean zero, large variance)
  alpha ~ dnorm( 0, 0.001 ) 
  beta ~ dnorm( 0, 0.001 )
  tau.y ~ dgamma(0.001, 0.001)  
  
  }" 
  )  

writeLines(model_string, con = "jags_example.txt" )

JAGS example (cont.)

## Run the JAGS model
N <- length(x)

my_list <- list("N" = N, "y" = my_data$y, "x" = my_data$x)
jags_inits <- function() {list (alpha = 0, beta = 0, tau.y = 1)}
params <- c("alpha", "beta", "tau.y")

jags_mod <- jags.model(file = "jags_example.txt", data = my_list, 
                       inits = jags_inits, n.chains = 1, n.adapt = 1000)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 50
##    Unobserved stochastic nodes: 3
##    Total graph size: 210
## 
## Initializing model
update(jags_mod, n.iter = 1000) ## burn in
jags_samples <- coda.samples(jags_mod, variable.names = params, n.iter = 15000)

JAGS example (cont.)

## Some model diagnostics
geweke.diag(jags_samples)
## [[1]]
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##   alpha    beta   tau.y 
## -1.7784  1.7243 -0.9083
heidel.diag(as.mcmc(jags_samples))
##                                     
##       Stationarity start     p-value
##       test         iteration        
## alpha passed       1501      0.0859 
## beta  passed       1501      0.1311 
## tau.y passed          1      0.7417 
##                               
##       Halfwidth Mean Halfwidth
##       test                    
## alpha passed    1.86 0.01271  
## beta  passed    4.39 0.02117  
## tau.y passed    1.13 0.00383
raftery.diag(as.mcmc(jags_samples))
## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
##                                              
##        Burn-in  Total Lower bound  Dependence
##        (M)      (N)   (Nmin)       factor (I)
##  alpha 10       11942 3746         3.19      
##  beta  10       11104 3746         2.96      
##  tau.y 2        3844  3746         1.03

JAGS example (cont.)

summary(jags_samples)
## 
## Iterations = 1001:16000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 15000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## alpha 1.852 0.2808 0.002293       0.006218
## beta  4.392 0.4670 0.003813       0.010279
## tau.y 1.131 0.2297 0.001875       0.001954
## 
## 2. Quantiles for each variable:
## 
##         2.5%    25%   50%   75% 97.5%
## alpha 1.3020 1.6645 1.851 2.038 2.415
## beta  3.4653 4.0853 4.398 4.703 5.303
## tau.y 0.7278 0.9698 1.116 1.276 1.633

JAGS example (cont.)

reg_df <- tbl_df(as.data.frame(jags_samples[[1]])) 
reg_df <- reg_df %>% gather(coef, value)
ggplot(reg_df, aes(x = value)) + geom_density() + facet_wrap(~ coef, scales = "free") 

JAGS example (cont.)

  • Results
    • Our simple model does a decent enough job of estimating the model parameters.
    • The true values of \(\alpha=2\) and \(\beta=4\) fall comfortably inside the relevant posterior densities.
    • The model fit should improve with more data points.
    • On the other hand, we could also try subjective (informative) priors instead of noninformative ones and see how that affects the final result.