17 July 2018

## What is Stan

• Stan is a turing complete, probabilistic programming language that is used mostly for Bayesian inference
• Stan is a two stage compiler: you write your model in the Stan language, which is translated into C++ and then compiled to the native platform
• Stan includes a matrix and math library that supports auto-differentiation
• Stan includes interfaces to R and other languages
• There are post-inference plotting libraries that can be used with Stan like bayesplot1
• Stan has thousands of users and growing. Stan is used in clinical drug trials, genomics and cancer biology, population dynamics, psycholinguistics, social networks, finance and econometrics, professional sports, publishing, recommender systems, educational testing, climate models, and many more
• There companies that are building commercial products on top of Stan

## What is Known

• As a matter of notation we call:
• Observed data $$\boldsymbol{y}$$
• Unobserved but observable data $$\boldsymbol{\widetilde{y}}$$
• Unobserved and unobservable parameters $$\boldsymbol{\theta}$$
• Covariates $$\boldsymbol{x}$$
• Probability distribution (density) $$\boldsymbol{p(\cdot)}$$
• Estimation is the process of figuring out the unknowns, i.e. unobserved quantities
• In classical machine learning and frequentist inference (including prediction), the problem is framed in terms of the most likely value of $$\boldsymbol{\theta}$$
• Bayesians are extremely greedy people: they are not satisfied with the maximum, the want the whole thing

## What is Bayes

$p(\theta\mid y, X) = \frac{p(y \mid X, \theta) * p(\theta)}{p(y)} = \frac{p(y \mid X, \theta) * p(\theta)}{\int p(y \mid X, \theta) * p(y) \space d\theta} \propto p(y \mid X, \theta) * p(\theta)$

• Bayesian inference is an approach to figuring out the updated $$\boldsymbol{p(\theta)}$$ after observing $$\boldsymbol{y}$$ and $$\boldsymbol{X}$$
• When $$\boldsymbol{p(y \mid X, \theta)}$$ is evaluated at each value of $$\boldsymbol{y}$$, it is called a likelihood function - this is our data generating process
• An MCMC algorithm draws from an implied probability distribution $$\boldsymbol{p(\theta \mid y, X)}$$
• In Stan we specify: $\log[p(\theta) * p(y \mid X, \theta)] = \log[p(\theta)] + \sum_{i=1}^{N}\log[p(y_i \mid x_i, \theta)]$
• As you can see, Bayes is not an approach to modeling

## You Do Not Need Stan to Specify a Model

• Here we specify a Bernoulli model in R and sample from the posterior to infer the proportion $$\theta$$. You can do same in pretty much any language.
log_p <- function(theta, data) {
lp <- 0
for (i in 1:data$N) lp <- lp + log(theta) * data$y[i] + log(1 - theta) * (1 - data\$y[i])
return(lp)
}
data <- c(0, 1, 0, 1, 1); theta <- seq(0.001, 0.999, length.out = 250)
log_lik <- log_p(theta = theta, data)
log_prior <- log(dbeta(theta, 1, 1))
log_posterior <- log_lik + log_prior
posterior <- exp(log_posterior)
posterior <- posterior / sum(posterior)
post_draws <- sample(theta, size = 1e5, replace = TRUE, prob = posterior)

## Inference vs Making Decisions

• In academia we often care about a estimating some unknown quantity
• In industry, we are often interested in making a decision, not just estimating unknowns
• Decisions need to account for decision maker's preference for risk and take into account exogenous costs and benefits
• Decision problem is logically separate from the inference task. Why?
• You do not want to evaluate utility at the average values of the parameters; instead evaluate the utility at each value of the parameters and then average it. Why?

## Stan Manual

• Let's take a look at the manual

## Linear Regression (Section 9.1)

• Linear regression can be written in many different ways.
• Linear predictor plus Normal noise $y_n = \alpha + \beta x_n + \epsilon, \space \mathrm{where} \space \epsilon \sim \mathrm{Normal(0, \sigma)}$

• It can also be written in terms of mean and variance

\begin{aligned} y_n &\sim \mathrm{Normal}(\alpha + \beta X_n, \ \sigma) \\ \end{aligned}

• What is missing?

## Linear Regression, Fully Specified

\begin{aligned} y_n &\sim \mathrm{Normal}(\alpha + \beta X_n, \ \sigma) \\ \alpha &\sim \mathrm{Normal}(0, \ 10) \\ \beta &\sim \mathrm{Normal}(0, \ 10) \\ \sigma &\sim \mathrm{Cauchy_+}(0, \ 2.5) \end{aligned}

• Why did we choose these priors?

## Simple Linear Regression in Stan

data {
int<lower=1> N;
vector[N] y;
vector[N] x;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
alpha ~ normal(0, 10);    // not in the manual
beta ~ normal(0, 10);     // but you should do it
sigma ~ cauchy(0, 2.5);
y ~ normal(alpha + beta * x, sigma);
}

## Linear Regression in Stan (Bonus!)

...
generated quantities {
vector[N] y_rep;
for(n in 1:N) {
y_rep[n] = normal_rng(alpha + beta * x[n], sigma);
}
}