We can outline three general principles guiding model-based inference in science:
Simplicity and parsimony (Occam’s razor). Model selection is a classic bias-variance trade-off.
Multiple working hypotheses (Chamberlin, 1890). At any point in time there must be several hypotheses (models) under consideration, but number of alternatives should be kept small.
Strength of evidence. Providing quantitative information to judge the strength of evidence is central to science (e.g., see Royall, 1997).
Steps of the model selection approach consist in establishing a set of \(R\) relevant models, then, based on data, ranking models, choosing the best model, and making inferences from this best model. To rank models, we need tools that account for the basic principles that makes a model a good model.
In 1951, Kullback and Leibler published a now-famous paper that quantified the meaning of information as related to Fisher’s concept of sufficient statistics (Kullback & Leibler, 1951). They developped the Kullback-Leibler divergence (or K-L information) that measures the information that is lost when approximating a distribution with another distribution.
This divergence can be used to quantifie the distance between our statistical models and the full reality (i.e., the process that generated the data) (Burnham & Anderson, 2004).
Akaike (1973) showed that this distance can be estimated by finding the parameters values that maximises the probability of the data given the model. He used this relationship to derive a criterion known as the Akaike information criterion (AIC):
\[\text{AIC} = -2\log(\mathcal{L}(\hat{\theta}|\text{data}))+2K\]
where \(\theta\) is the set of model parameters, \(\mathcal{L}(\hat{\theta}|\text{data})\) is the likelihood of the candidate model given the data when evaluated at the maximum likelihood estimate of \(\theta\), and \(K\) is the number of estimated parameters in the candidate model. The first component, \(-2log(\mathcal{L}(\hat{\theta}|\text{data}))\) is known as the deviance of the candidate model.
Basically, we can see that the AIC then accounts for the goodness-of-fit of a model (i.e., the strength of evidence for this model), but penalises it for having too much parameters, that is for not being parsimonious. Clearly, the smaller AIC, the better.
The AIC can also be seen as an approximation to the out-of-sample deviance, that is, of the deviance computed on a future datased issued from the same data generation process (the same population).
It is important to acknowledge that when \(n/K\) is superior to about 40, the “small sample AIC” (second-order bias correction), called AICc, should be used (Burnham & Anderson, 2004).
\[\text{AICc} = \text{AIC}+\dfrac{2K(K+1)}{n-K-1}\]
The individual AIC values are not interpretable in absolute terms as they contain arbitrary constants and are much affected by sample size. Then it is imperative to rescale these criteria. Usually, this is done by substracting to the AIC of each model the AIC of the model with the minimum one:
\[\Delta_{AIC} = AIC_{i} - AIC_{min}\]
where \(AIC_{min}\) is the minimum of the \(R\) different \(AIC_{i}\) values. This transformation forces the best model to have \(\Delta_{AIC}=0\), while the rest of the models have positive values.
For the AIC, the simple transformation \(exp(−\Delta_{i}/2)\), for \(i = 1, 2, ..., R\), provides the likelihood of the model given the data. This is a likelihood function over the model set in the sense that \(\mathcal{L}(\theta|data, g_i)\) is the likelihood over the parameter space (for model \(g_i\)) of the parameter \(\theta\), given the data (\(x\)) and the model (\(g_i\)).
It is convenient to normalise the model likelihoods such that they sum to 1 and treat them as probabilities. Hence, we use:
\[w_{i}=\dfrac{exp(-\Delta_{i}/2)}{\sum_{r=1}^{R}exp(-\Delta_{r}/2)}.\]
The \(w_i\), called Akaike weights, are useful as the weight of evidence in favor of model \(g_i(\cdot |\theta)\) as being the actual Kullback-Leibler best model in the set. The ratios \(w_i/w_j\) are identical to the original likelihood ratios, \(\mathcal{L}(g_i|data)/\mathcal{L}(g_j|data)\), and so they are invariant to the model set, but the \(w_i\) values depend on the full model set (because they sum to 1).
Evidence can be judged by the relative likelihood of model pairs as \(\mathcal{L}(g_i|x)/\mathcal{L}(g_j|x)\) or, equivalently, the ratio of Akaike weights \(w_i/w_j\). Such ratios are called evidence ratios and represent the evidence about fitted models as to which is better in a Kullback-Leibler information sense (i.e., the best predictive model).
The term sequential analysis simply refers to the process by which one collects data until to reach a predefined level of evidence. Implementations of this idea are known as the Probability Ratio Test (Wald & Wolfowitz, 1948), or the group sequential designs in the NHST framework (for an overview, see Lakens, 2014). More recently, Schönbrodt et al. (2017) and Schönbrodt & Wagenmakers (2017) have (re)introduced sequential testing procedures using Bayes Factors, in order to avoid the pitfalls of sequential testing in the NHST framework.
The Sequential Bayes Factor (SBF) procedure discussed above is useful in addressing the question of whether we have accumulated enough evidence for a model (relatively to another model), given the data we observed and the prior predictions of each model. Moreover, the SBF is a consistent procedure for model identification. In other words, if the true model is in the set of models, the Bayes Factor (or equivalently, the BIC) will select it when the sample size increases.
However, one could question whether it is sensible to assume a true model (an oxymoron) in “real life” (i.e., except in simulations), especially in the social sciences (e.g., Burnham & Anderson, 2002; 2004). As Findley (1985) notes: “[…] consistency can be an undesirable property in the context of selecting a model”. In other words, the goal of identifying the true model works fine… when there is a true model to be identified.
The AIC is not consistent for model identification, but it is an efficient estimator of the expected K-L information loss.
As a consequence, when using information criteria (e.g., AIC, WAIC) for sequential testing, we try to answer a different question. The question is not anymore when do I have enough evidence for this model over this other model being the true model. Instead, the question is: when do I have enough evidence for this model being the best model ?
The ictab
function allows to compare a set of models fitted to the same data. This function returns either the AIC or the BIC of each model, along with the number of parameters \(k\) and the Akaike weights. The ictab
function can also be used to compare brmsfit
models with pseudo-BMA weights, based on the brms::WAIC
and brms::LOO
functions (see Yao et al., 2017).
library(ESTER)
data(mtcars)
mod1 <- lm(mpg ~ cyl, mtcars)
mod2 <- lm(mpg ~ cyl + disp, mtcars)
mod3 <- lm(mpg ~ cyl * disp, mtcars)
mods <- list(m1 = mod1, m2 = mod2, m3 = mod3)
ictab(mods, aic)
## modnames ic k delta_ic ic_wt
## 3 m3 161.4571 5 0.0000 0.9610
## 2 m2 168.6271 4 7.1700 0.0267
## 1 m1 170.1636 3 8.7065 0.0124
Note that in these two functions, the information criterion has to be passed in lowercase, as aic
and bic
refer to functions from ESTER
.
On the other hand, ESTER
can be used to do sequential testing on your own data. You can study the evolution of sequential ERs using the seqtest
function.
data(mtcars)
mod1 <- lm(mpg ~ cyl, mtcars)
mod2 <- lm(mpg ~ cyl + disp, mtcars)
plot(seqtest(ic = aic, mod1, mod2, nmin = 10) )
In addition, seqtest
allows studying the behavior of sequential ERs computed on your own data, along with sequential ERs computed on permutation samples. This feature might be useful to study to what extent the evolution of evidence ratios you observed on the original sample is dependant to the order of incoming observations.
plot(seqtest(ic = aic, mod1, mod2, nmin = 10, nsims = 10) )
When data involve repeated measures (and so multiple lines by subject), you should help ESTER
to figure out how the data is structured by indicating in which column are stored subject or observations numbers. For instance, below we use data from lme4
, and indicate that the subject number is stored in the "Subject"
column of the dataframe, by passing it to the id
argument of seqtest
.
library(lme4)
data(sleepstudy)
mod1 <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
mod2 <- lmer(Reaction ~ Days + I(Days^2) + (1|Subject), sleepstudy)
plot(seqtest(ic = aic, mod1, mod2, nmin = 10, id = "Subject", nsims = 10) )
If no information is passed to the id
argument, seqtest
will suppose that there is only one observation (i.e., one line) per subject.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Caski (Eds.), Proceedings of the Second International Symposium on Information Theory (pp. 267-281). Budapest: Akademiai Kiado.
Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed). Ecological Modelling.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociological Methods & Research, 33(2), 261–304.
Burnham, K. P., Anderson, D. R., & Huyvaert, K. P. (2011). AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. Behavioral Ecology and Sociobiology, 65(1), 23–35.
Chamberlin, T. (1890). The Method of Multiple Working Hypotheses. Science 148:754-9.
Findley, D.F. (1985). On the unbiasedness property of AIC for exact or approximating linear stochastic time series models. Journal of Time Series Analysis 6, 229–252.
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses. European Journal of Social Psychology, 44, 701–710.
Royall, R. (1997) Statistical Evidence: A likelihood paradigm, Chapman and Hall, CRC Press.
Schönbrodt, F. D., & Wagenmakers, E.-J. (2017). Bayes Factor Design Analysis: Planning for compelling evidence. Psychonomic Bulletin & Review.
Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods, 22, 322–339.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection using Akaike weights. Psychonomic Bulletin & Review, 11(1), 192–196.
Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics, 19, 326– 339.
Yao, Y., Simpson, D., & Gelman, A. (2017). Using stacking to average Bayesian predictive distributions. Retrieved from https://arxiv.org/abs/1704.02030v3