## Introduction

The idea of m-step-ahead predictions (m-SAP) for time-series models is to predict $$m$$ observations ahead using only past observations of the time-series and no future observations. Doing exact m-SAP requires repeatedly fitting the model for each set of $$m$$ observations to be predicted. This is usually very time-intensive in particular for Bayesian models and so we seek to approximate m-SAP using as few refits of the model as possible.

Assume we have a time-series of observations $$(y_1, y_2, ..., y_N)$$, then the goal of m-SAP is to compute the predictive density $p(y_{i<m} | y_{<i}) = p(y_i, ..., y_{i + m - 1} | y_{1},...,y_{i-1})$ for all $$i$$ between $$L + 1$$ and $$N - m + 1$$, where $$L$$ is the minimum number of values required to make predictions (e.g., $$L = 10$$ if we want to start with predicting the $$11$$th observation). The quantities $$p(y_{i<m} | y_{<i})$$ can be computed as follows with the help of the posterior distribution $$p(\theta | y_{<i})$$ of the parameters $$\theta$$ based on the $$i-1$$ earliest observations of the time-series: $p(y_{i<m} \,| \, y_{<i}) = \int p(y_{i<m} \,| \, y_{<i}, \theta) \, p(\theta\,|\,y_{<i}) \,d\theta.$ Having obtained $$S$$ random draws $$\theta_{<i}^{(s)}$$ ($$s = 1,...,S$$) of the posterior distribution $$p(\theta\,|\,y_{<i})$$, we can estimate $$p(y_i | y_{<i})$$ by means of

$p(y_{i<m} \,|\, y_{<i}) \approx \sum_{s=1}^S p(y_{i<m} \,|\, y_{<i}, \theta_{<i}^{(s)}).$

In the following, we consider factorized models in which the response values are conditionally independent given the parameters and the likelihood can be written in the familiar form

$p(y \,|\, \theta) = \prod_{i=1}^N p(y_i \,|\, \theta).$ In this case, $$p(y_{i<m} \,|\, y_{<i}, \theta_{<i})$$ reduces to $$\prod_{j = i}^{i + m -1} p(y_j \,|\, \theta_{<i})$$.

## Approximate m-SAP using importance-sampling

To reduce the number of models that need to be fitted for the purpose of obtaining $$p(y_{i<m} \,|\, y_{<i})$$, we propose the following algorithm. Starting at the end of the time-series (i.e. $$i = N - m + 1$$), we compute $$p(y_{i<m} \,|\, y_{<i})$$ with approximate leave-m-out cross-validation using importance sampling:

$p(y_{i<m} \,|\, y_{<i}) \approx \frac{ \sum_{s=1}^S p(y_{i<m} \,|\, \theta^{(s)}) \, w_i^{(s)}}{ \sum_{s=1}^S w_i^{(s)}},$

where $$w_i^{(s)}$$ are importance weights and $$\theta^{(s)}$$ are draws from the posterior distribution based on all observations. To obtain $$w_i^{(s)}$$, we first compute the raw importance ratios

$r_i^{(s)} \propto \frac{1}{\prod_{j = i}^N p(y_j \,|\, \,\theta^{(s)})}$

and then stabilize them using Pareto smoothed importance sampling (PSIS, Vehtari et al, 2017ab). We then decrease $$i$$ by $$1$$ and repeat the process. At some observation $$i$$, the variability of importance ratios $$r_i^{(s)}$$ will become too large and importance sampling fails (diagnosed by Pareto-$$k$$-estimates greater than $$0.7$$). Then we refit the model using only observations up to the $$i$$th one and restart the process until we arrived at the $$L+1$$th observation, where $$L$$ is the minimum sample size required to make predictions ($$L=0$$ if predictions of all observations should be computed).

## Autoregressive models

Autoregressive (AR) models are one of the classic time-series models. An AR-model of order $$p$$ can be defined as

$Y_i = \eta_i + \sum_{k = 1}^p \varphi_k Y_{i - k} + \varepsilon_i,$ where $$\eta_i$$ is the linear predictor of observation $$i$$, $$\phi_k$$ are the autoregressive parameters and $$\varepsilon_i$$ are pairwise independent errors, which are usually assumed to be normally distributed with equal variance $$\sigma^2$$. The model implies a recursive formula that allows to compute the right-hand side of the above equation for observation $$i$$ based on the equationsâ€™ solution for all past observations $$1$$ to $$i-1$$.

## Case Study: Annual measurements of the level of Lake Huron

To illustrate the application of m-SAP, we will use the annual measurements of the level (in feet) of Lake Huron 1875â€“1972 as a case study, which is shipped natively with R.

plot(LakeHuron)

The above plot shows rather strong autocorrelation of the time-series as well as some trend towards lower levels for later points in time. For model fitting, we turn the time-series into a data frame.

N <- length(LakeHuron)
df <- data.frame(y = as.numeric(LakeHuron))
df$time <- 1:N and then specify an AR(2) model on these data using the brms package. control <- list(adapt_delta = 0.95) fit <- brm( y ~ 1, data = df, autocor = cor_ar(~time, p = 2), control = control, seed = SEED ) The model implied predictions along with the observed values can be plotted, which reveals a rather good fit to the data. pred <- cbind(df, predict(fit)) names(pred)[5:6] <- c("Q2.5", "Q97.5") ggplot(pred, aes(time, Estimate)) + geom_smooth( aes(ymin = Q2.5, ymax = Q97.5), stat = "identity" ) + geom_point(aes(time, y), inherit.aes = FALSE) To allow for reasonable predictions of future values, we will require at least $$L = 15$$ observations to make predictions. L <- 15 we perform approximate leave-one-out cross-validation (LOO-CV), for the purpose of later comparison with exact and approximate 1-SAP. loo(log_lik(fit)[, (L+1):N])  Computed from 4000 by 83 log-likelihood matrix Estimate SE elpd_loo -90.8 6.4 p_loo 2.5 0.6 looic 181.6 12.8 ------ Monte Carlo SE of elpd_loo is 0.0. All Pareto k estimates are good (k < 0.5). See help('pareto-k-diagnostic') for details. ## 1-step-ahead predictions leaving out all future values The most basic version of m-SAP is 1-SAP in which we predict only one step ahead. In this case, $$y_{i<m}$$ simplifies to $$y_{i}$$ and the above described algorithm becomes considerably simpler. ### Exact 1-step-ahead predictions As a benchmark to approximate 1-SAP, we are going to compute exact 1-SAP, by first computing the exact log-likelihood values loglik <- matrix(nrow = nsamples(fit), ncol = N) for (i in N:max(L + 1, 2)) { fit_i <- update(fit, newdata = df[-(i:N), ], recompile = FALSE, chains = 1) loglik[, i] <- log_lik(fit_i, newdata = df[1:i, ])[, i] } and then compute the exact expected log predictive density (elpd) on that basis. log_mean_exp <- function(x) { # more stable than log(mean(exp(x))) max_x <- max(x) max_x + log(sum(exp(x - max_x))) - log(length(x)) } exact_elpds_1sap <- apply(loglik, 2, log_mean_exp) (exact_elpd_1sap <- sum(exact_elpds_1sap[-(1:L)])) [1] -95.92756 We see that elpd from 1-step-ahead predictions is lower, which is natural as it is more difficult to predict when only the past observations are available and as most predictions are conditioned on less than $$N-1$$ observations available for LOO. ### Approximate 1-step-ahead predictions For illustrationary purposes, we first compute approximate 1-SAP without refitting. This will of course be a poor approximation of exact 1-SAP in particular if the Pareto-$$k$$-estimates increase rather quickly beyond the threshold of $$0.7$$ up the which PSIS tends to produce stable results. approx_elpds_1sap_no_refit <- rep(NA, nrow(df)) loglik <- log_lik(fit) logr <- matrix(nrow = nsamples(fit), ncol = nrow(df)) for (i in N:(L + 1)) { logr[, i] <- - rowSums(loglik[, i:N, drop = FALSE]) psis_part <- suppressWarnings(psis(logr[, i:N])) w_i <- exp(psis_part$log_weights[, 1])
approx_elpds_1sap_no_refit[i] <-
log(sum(exp(loglik[, i]) * w_i) / sum(w_i))
}

The plot below indicates that indeed the Pareto-$$k$$-estimates go over the roof when not refitting the model at some point.

is_na <- apply(logr, 2, anyNA)
plot(psis(logr[, !is_na]))

Consequently, it is not surprising that the resulting approximate 1-SAP elpd is far away for the exact 1-SAP elpd.

(approx_elpd_1sap_no_refit <- sum(approx_elpds_1sap_no_refit, na.rm = TRUE))
[1] -95.07433

Next, we compute approximate 1-SAP with refit at observations where Pareto-$$k$$-estimates exceed the threshold of $$0.7$$.

k_thres <- 0.7

The code becomes a little bit more involved to handle the refitting procedure. Note that we compute exact 1-SAP at the refitting points. This comes with no additional computational costs since we had to refit the model anyway.

loglik <- logr <- matrix(nrow = nsamples(fit), ncol = nrow(df))
approx_elpds_1sap <- rep(NA, nrow(df))
fit_part <- fit
i_refit <- N
refits <- NULL
for (i in N:(L + 1)) {
loglik[, i] <- log_lik(fit_part)[, i]
logr[, i] <- - rowSums(loglik[, i:i_refit, drop = FALSE])
psis_part <- suppressWarnings(psis(logr[, i:i_refit]))
if (any(psis_part$diagnostics$pareto_k > k_thres)) {
# refit the model based on the first i-1 observations
i_refit <- i
refits <- c(refits, i)
fit_part <- update(
fit_part, newdata = df[1:(i-1), ],
recompile = FALSE, chains = 1
)
loglik[, i] <- log_lik(fit_part, newdata = df[1:i, ])[, i]
logr[, i] <- - rowSums(loglik[, i:i_refit, drop = FALSE])
approx_elpds_1sap[i] <- log_mean_exp(loglik[, i])
} else {
w_i <- exp(psis_part\$log_weights[, 1])
approx_elpds_1sap[i] <- log(sum(exp(loglik[, i]) * w_i) / sum(w_i))
}
}

We see that the final Pareto-$$k$$-estimates are mostly well below the threshold and that we only needed to refit the model 3 times at observations 75, 53, 38.

is_na <- apply(logr, 2, anyNA)
plot(psis(logr[, !is_na]))