Processing math: 100%

Module 8 - Gaussian processes for time series analysis

Doug McNeall & Andrew Parnell

12th May 2016

Bayesian time series analysis

We can start by thinking of time series analysis as a regression problem, with

$y(x) = f(x) + \epsilon$

Where $y(x)$ is the output of interest, $f(x)$ is some function and $\epsilon$ is an additive white noise process.

We would like to:

Evaluate $f(x)$
Find the probability distribution of $y^*$ for some $x^*$

We make the assumption that $y$ is ordered by $x$ , we fit a curve to the points and extrapolate.

Bayesian time series analysis

But! Bayes' theorem naturally encodes Occam's razor:

"Among competing hypotheses, the one with the fewest assumptions should be selected."

The solutions with tighter curves are more complex and contain more assumptions.

Bayesian time series analysis

What is a Gaussian process?

A distribution conditioned on observed data
Formally, a Gaussian process generates data located throughout some domain such that any finite subset of the range follows a multivariate Gaussian distribution (Ebden 2008) http://www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf

The multivariate Normal distribution for GPs

Start with a 2-dimensional normal distribution, with the shape defined by a 2x2 covariance matrix.

The multivariate Normal distribution for GPs

An observation on one dimension changes distribution of the other (and reduces uncertainty).
The conditional distribution $p(x_{2}|x_{1}=x)$ is different from the marginal.

The multivariate Normal distribution for GPs

Extend to a two-observation time series

An observation on $x_1$ changes the conditional ditribution for $x_2$

Or for a longer time series

We can extend to continuous time

The covariance function

The GP relies on the covariance kernel function, which has the general form $k(x_{1}, x_{2})$

We might choose something like

$k(x_{1}, x_{2}) = \tau^2 exp[- \rho(x_{1}-x_{2})^2]$

So that the covariance reduces as the distance between $x_1$ and $x_2$ increases, depending on length parameter $\rho$ .

The maths

$y$ is the vector of observations of $y_t$ , a response variable at time $t$ . We can model $y$ as drawn from a multivariate Normal distribution:

$y \sim MVN(\mu, \Sigma)$ $\mu$ is some mean function and $\Sigma$ is a covariance matrix where $\Sigma_{ij} = k(x_{1}, x_{2}) = \tau^2 e^{-\rho(t_{i} - t_{j})^{2}}$ if $i \neq j$

The covariance matrix

Choices of covariance function

The choice of covaraiance function encodes our prior knowledge about the process - smoothness, stationarity, periodicity.
We can build up a covariance function by parts, for example adding a linear or sinusoidal part.
There is some good guidance on choosing correlation functions in the MUCM toolkit

The nugget term

The diagonal $(i=j)$ of the covariance matrix contains the information about the uncertainty relationship between a point and itself!
If there is NO uncertainty at a point, the mean function is constrained to go through the point (this is useful for deterministic computer models).
We can add a nugget, so that the mean function isn't constrained.

if $i \neq j$ $\Sigma_{ij} = \tau^2 +\sigma^2$ if $i=j$ (i.e. on the diagonal).

The nugget term

Advantages of using GPs

Freer in form than some models
Highly flexible, general and can apply to many problems
Can get some very good predictions
Lets the data "speak for itself"

An example in regression

Disadvantages of using GPs

Can be dependent on good priors
Can be more difficlut to interpret parameters to understand the underlying model
Slow for large data sets (including in JAGS)

Fitting GPs in JAGS

Notation

# y(t): Response variable at time t, defined on continuous time
# y: vector of all observations
# alpha: Overall mean parameter
# sigma: residual standard deviation parameter (sometimes known in the GP world as the nugget)
# rho: decay parameter for the GP autocorrelation function
# tau: GP standard deviation parameter

Sampling from A GP

# Likelihood:
# y ~ MVN(Mu, Sigma)
# where MVN is the multivariate normal distribution and
# Mu[t] = alpha
# Sigma is a covariance matrix with:
# Sigma_{ij} = tau^2 * exp( -rho * (t_i - t_j)^2 ) if i != j
# Sigma_{ij} = tau^2 + sigma^2 if i=j
# The part exp( -rho * (t_i - t_j)^2 ) is known as the autocorrelation function

# Prior
# alpha ~ N(0,100)
# sigma ~ U(0,10)
# tau ~ U(0,10)
# rho ~ U(0.1, 5) # Need something reasonably informative here

Simulate from the GP

T = 20 # default is 20 can take to e.g T = 100 but fitting gets really slow ...
alpha = 0 # default is 0
sigma = 0.03 # default is 0.03
tau = 1 # default is 1
rho = 1# default is 1
set.seed(123) # ensure reproducablility
t = sort(runif(T))
Sigma = sigma^2 * diag(T) + tau^2 * exp( - rho * outer(t,t,'-')^2 )
y = mvrnorm(1,rep(alpha,T), Sigma)

Simulate from the GP

Fit a GP model to the data using JAGS

# Jags code to fit the model to the simulated data
model_code = '
model
{
  # Likelihood
  y ~ dmnorm(Mu, Sigma.inv)
  Sigma.inv <- inverse(Sigma)
  
  # Set up mean and covariance matrix
  for(i in 1:T) {
    Mu[i] <- alpha
    Sigma[i,i] <- pow(sigma, 2) + pow(tau, 2)
  
    for(j in (i+1):T) {
      Sigma[i,j] <- pow(tau, 2) * exp( - rho * pow(t[i] - t[j], 2) )
      Sigma[j,i] <- Sigma[i,j]
    }
  }
  
  alpha ~ dnorm(0, 0.01) # default dnorm(0, 0.01)
  sigma ~ dunif(0, 10) # default dunif(0,10)
  tau ~ dunif(0, 10) # default dunif(0, 10)
  rho ~ dunif(0.1, 5) # default dunif(0.1, 5)
  
} 
'

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 1
##    Unobserved stochastic nodes: 4
##    Total graph size: 1411
## 
## Initializing model

Make predictions using the GP

Resources

This excellent primer on GPs for time series inspired many of the diagrams in this lecture. http://www.robots.ox.ac.uk/~sjrob/Pubs/philTransA_2012.pdf
A useful GP primer from the same group. http://www.robots.ox.ac.uk/~mebden/reports/GPtutorial.pdf
The MUCM (Managing Uncertainty in Complex Models) toolkit is a useful practical reference. http://mucm.aston.ac.uk/MUCM/MUCMToolkit/index.php?page=MetaHomePage.html
Rasmussen and Williams (2006) is a great GP book http://www.gaussianprocess.org/gpml/chapters/

Module 8 - Gaussian processes for time series analysis

Bayesian time series analysis

Bayesian time series analysis

Bayesian time series analysis

Bayesian time series analysis

Bayesian time series analysis

What is a Gaussian process?

The multivariate Normal distribution for GPs

The multivariate Normal distribution for GPs

The multivariate Normal distribution for GPs

The multivariate Normal distribution for GPs

Extend to a two-observation time series

An observation on x1x_1 changes the conditional ditribution for x2x_2

Or for a longer time series

We can extend to continuous time

The covariance function

The maths

The covariance matrix

Choices of covariance function

The nugget term

The nugget term

Advantages of using GPs

An example in regression

Disadvantages of using GPs

Fitting GPs in JAGS

Notation

Sampling from A GP

Simulate from the GP

Simulate from the GP

Fit a GP model to the data using JAGS

Make predictions using the GP

Resources

An observation on $x_1$ changes the conditional ditribution for $x_2$