Introduction

The idea of m-step-ahead predictions (m-SAP) for time-series models is to predict \(m\) observations ahead using only past observations of the time-series and no future observations. Doing exact m-SAP requires repeatedly fitting the model for each set of \(m\) observations to be predicted. This is usually very time-intensive in particular for Bayesian models and so we seek to approximate m-SAP using as few refits of the model as possible.

m-step-ahead predictions

Assume we have a time-series of observations \((y_1, y_2, ..., y_N)\), then the goal of m-SAP is to compute the predictive density \[ p(y_{i<m} | y_{<i}) = p(y_i, ..., y_{i + m - 1} | y_{1},...,y_{i-1}) \] for all \(i\) between \(L + 1\) and \(N - m + 1\), where \(L\) is the minimum number of values required to make predictions (e.g., \(L = 10\) if we want to start with predicting the \(11\)th observation). The quantities \(p(y_{i<m} | y_{<i})\) can be computed as follows with the help of the posterior distribution \(p(\theta | y_{<i})\) of the parameters \(\theta\) based on the \(i-1\) earliest observations of the time-series: \[ p(y_{i<m} \,| \, y_{<i}) = \int p(y_{i<m} \,| \, y_{<i}, \theta) \, p(\theta\,|\,y_{<i}) \,d\theta. \] Having obtained \(S\) random draws \(\theta_{<i}^{(s)}\) (\(s = 1,...,S\)) of the posterior distribution \(p(\theta\,|\,y_{<i})\), we can estimate \(p(y_i | y_{<i})\) by means of

\[ p(y_{i<m} \,|\, y_{<i}) \approx \sum_{s=1}^S p(y_{i<m} \,|\, y_{<i}, \theta_{<i}^{(s)}). \]

In the following, we consider factorized models in which the response values are conditionally independent given the parameters and the likelihood can be written in the familiar form

\[ p(y \,|\, \theta) = \prod_{i=1}^N p(y_i \,|\, \theta). \] In this case, \(p(y_{i<m} \,|\, y_{<i}, \theta_{<i})\) reduces to \(\prod_{j = i}^{i + m -1} p(y_j \,|\, \theta_{<i})\).

Approximate m-SAP using importance-sampling

To reduce the number of models that need to be fitted for the purpose of obtaining \(p(y_{i<m} \,|\, y_{<i})\), we propose the following algorithm. Starting at the end of the time-series (i.e. \(i = N - m + 1\)), we compute \(p(y_{i<m} \,|\, y_{<i})\) with approximate leave-m-out cross-validation using importance sampling:

\[ p(y_{i<m} \,|\, y_{<i}) \approx \frac{ \sum_{s=1}^S p(y_{i<m} \,|\, \theta^{(s)}) \, w_i^{(s)}}{ \sum_{s=1}^S w_i^{(s)}}, \]

where \(w_i^{(s)}\) are importance weights and \(\theta^{(s)}\) are draws from the posterior distribution based on all observations. To obtain \(w_i^{(s)}\), we first compute the raw importance ratios

\[ r_i^{(s)} \propto \frac{1}{\prod_{j = i}^N p(y_j \,|\, \,\theta^{(s)})} \]

and then stabilize them using Pareto smoothed importance sampling (PSIS, Vehtari et al, 2017ab). We then decrease \(i\) by \(1\) and repeat the process. At some observation \(i\), the variability of importance ratios \(r_i^{(s)}\) will become too large and importance sampling fails (diagnosed by Pareto-\(k\)-estimates greater than \(0.7\)). Then we refit the model using only observations up to the \(i\)th one and restart the process until we arrived at the \(L+1\)th observation, where \(L\) is the minimum sample size required to make predictions (\(L=0\) if predictions of all observations should be computed).

Autoregressive models

Autoregressive (AR) models are one of the classic time-series models. An AR-model of order \(p\) can be defined as

\[ Y_i = \eta_i + \sum_{k = 1}^p \varphi_k Y_{i - k} + \varepsilon_i, \] where \(\eta_i\) is the linear predictor of observation \(i\), \(\phi_k\) are the autoregressive parameters and \(\varepsilon_i\) are pairwise independent errors, which are usually assumed to be normally distributed with equal variance \(\sigma^2\). The model implies a recursive formula that allows to compute the right-hand side of the above equation for observation \(i\) based on the equations’ solution for all past observations \(1\) to \(i-1\).

Case Study: Annual measurements of the level of Lake Huron

To illustrate the application of m-SAP, we will use the annual measurements of the level (in feet) of Lake Huron 1875–1972 as a case study, which is shipped natively with R.

plot(LakeHuron)

The above plot shows rather strong autocorrelation of the time-series as well as some trend towards lower levels for later points in time. For model fitting, we turn the time-series into a data frame.

N <- length(LakeHuron)
df <- data.frame(y = as.numeric(LakeHuron))
df$time <- 1:N

and then specify an AR(2) model on these data using the brms package.

control <- list(adapt_delta = 0.95)
fit <- brm(
  y ~ 1, data = df, autocor = cor_ar(~time, p = 2), 
  control = control, seed = SEED
)

The model implied predictions along with the observed values can be plotted, which reveals a rather good fit to the data.

pred <- cbind(df, predict(fit))
names(pred)[5:6] <- c("Q2.5", "Q97.5")
ggplot(pred, aes(time, Estimate)) +
  geom_smooth(
    aes(ymin = Q2.5, ymax = Q97.5),
    stat = "identity"
  ) +
  geom_point(aes(time, y), inherit.aes = FALSE)

To allow for reasonable predictions of future values, we will require at least \(L = 15\) observations to make predictions.

L <- 15

we perform approximate leave-one-out cross-validation (LOO-CV), for the purpose of later comparison with exact and approximate 1-SAP.

loo(log_lik(fit)[, (L+1):N])

Computed from 4000 by 83 log-likelihood matrix

         Estimate   SE
elpd_loo    -90.8  6.4
p_loo         2.5  0.6
looic       181.6 12.8
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

1-step-ahead predictions leaving out all future values

The most basic version of m-SAP is 1-SAP in which we predict only one step ahead. In this case, \(y_{i<m}\) simplifies to \(y_{i}\) and the above described algorithm becomes considerably simpler.

Exact 1-step-ahead predictions

As a benchmark to approximate 1-SAP, we are going to compute exact 1-SAP, by first computing the exact log-likelihood values

loglik <- matrix(nrow = nsamples(fit), ncol = N)
for (i in N:max(L + 1, 2)) {
  fit_i <- update(fit, newdata = df[-(i:N), ], recompile = FALSE, chains = 1)
  loglik[, i] <- log_lik(fit_i, newdata = df[1:i, ])[, i]
}

and then compute the exact expected log predictive density (elpd) on that basis.

log_mean_exp <- function(x) {
  # more stable than log(mean(exp(x)))
  max_x <- max(x)
  max_x + log(sum(exp(x - max_x))) - log(length(x))
}
exact_elpds_1sap <- apply(loglik, 2, log_mean_exp)
(exact_elpd_1sap <- sum(exact_elpds_1sap[-(1:L)]))
[1] -95.92756

We see that elpd from 1-step-ahead predictions is lower, which is natural as it is more difficult to predict when only the past observations are available and as most predictions are conditioned on less than \(N-1\) observations available for LOO.

Approximate 1-step-ahead predictions

For illustrationary purposes, we first compute approximate 1-SAP without refitting. This will of course be a poor approximation of exact 1-SAP in particular if the Pareto-\(k\)-estimates increase rather quickly beyond the threshold of \(0.7\) up the which PSIS tends to produce stable results.

approx_elpds_1sap_no_refit <- rep(NA, nrow(df))
loglik <- log_lik(fit)
logr <- matrix(nrow = nsamples(fit), ncol = nrow(df))
for (i in N:(L + 1)) {
  logr[, i] <- - rowSums(loglik[, i:N, drop = FALSE])
  psis_part <- suppressWarnings(psis(logr[, i:N]))
  w_i <- exp(psis_part$log_weights[, 1])
  approx_elpds_1sap_no_refit[i] <- 
    log(sum(exp(loglik[, i]) * w_i) / sum(w_i))
}

The plot below indicates that indeed the Pareto-\(k\)-estimates go over the roof when not refitting the model at some point.

is_na <- apply(logr, 2, anyNA)
plot(psis(logr[, !is_na]))

Consequently, it is not surprising that the resulting approximate 1-SAP elpd is far away for the exact 1-SAP elpd.

(approx_elpd_1sap_no_refit <- sum(approx_elpds_1sap_no_refit, na.rm = TRUE))
[1] -95.07433

Next, we compute approximate 1-SAP with refit at observations where Pareto-\(k\)-estimates exceed the threshold of \(0.7\).

k_thres <- 0.7

The code becomes a little bit more involved to handle the refitting procedure. Note that we compute exact 1-SAP at the refitting points. This comes with no additional computational costs since we had to refit the model anyway.

loglik <- logr <- matrix(nrow = nsamples(fit), ncol = nrow(df))
approx_elpds_1sap <- rep(NA, nrow(df))
fit_part <- fit
i_refit <- N
refits <- NULL
for (i in N:(L + 1)) {
  loglik[, i] <- log_lik(fit_part)[, i]
  logr[, i] <- - rowSums(loglik[, i:i_refit, drop = FALSE])
  psis_part <- suppressWarnings(psis(logr[, i:i_refit]))
  if (any(psis_part$diagnostics$pareto_k > k_thres)) {
    # refit the model based on the first i-1 observations
    i_refit <- i
    refits <- c(refits, i)
    fit_part <- update(
      fit_part, newdata = df[1:(i-1), ], 
      recompile = FALSE, chains = 1
    )
    loglik[, i] <- log_lik(fit_part, newdata = df[1:i, ])[, i]
    logr[, i] <- - rowSums(loglik[, i:i_refit, drop = FALSE])
    approx_elpds_1sap[i] <- log_mean_exp(loglik[, i])
  } else {
    w_i <- exp(psis_part$log_weights[, 1])
    approx_elpds_1sap[i] <- log(sum(exp(loglik[, i]) * w_i) / sum(w_i))
  }
}

We see that the final Pareto-\(k\)-estimates are mostly well below the threshold and that we only needed to refit the model 3 times at observations 75, 53, 38.

is_na <- apply(logr, 2, anyNA)
plot(psis(logr[, !is_na]))