1 Introduction

Since Lindley (1968) there has been a long history of Bayesian model selection using predictive performance comparison (see the extensive review by Vehtari and Ojanen 2012). In this regime, we are interested in identifying a model from a collection of models whose out-of-sample predictive performance is best, either in the hope of generalising to future unseen data (Vehtari and Ojanen 2012), or as a proxy for parameter recoverability (Scholz and Bürkner 2022). When the number of predictors to consider is large, we might also be interested in achieving a smaller model that is capable of replicating the predictive behaviour of a larger model and use this instead for its improved interpretability, or to decrease data collection cost (Piironen et al. 2020). In finite data regimes, however, there can be significant uncertainty in differentiating between models of similar performance.

Following Piironen and Vehtari (2017a), consider a hypothetical utility for a model \(\textrm{M}_k\) estimated on data D which we denote \(u(\textrm{M}_k, D)\). This estimate can be decomposed into two components: the true generalisation utility of the model, and the error associated with the estimate. The authors note that for safe model selection, we do not require that the utility be unbiased. If these utility estimates have high variance, however, then for a single data realisation, choosing the maximum utility model may lead to choosing a sub-optimal model in terms of generalisation utility. When we select a model which performs non-negligibly worse than the oracle asymptotically, we say that we have over-fit in selection. On top of this, the utility estimate for the selected model, say \({\hat{k}}\), is typically optimistic: \(u(\text {M}_{{\hat{k}}}, D) \gg \mathbb {E}_D[{u(\text {M}_{{\hat{k}}})}]\), where we use \(\mathbb {E}_D[{u(\text {M}_{{\hat{k}}})}]\) to denote the expectation over datasets. It is this over-optimism we refer to as selection-induced bias (already discussed by Stone 1974).

While in general this over-optimism appears across a range of utility estimates and criteria (including common information criteria; Burnham and Anderson 2002), we focus our discussion on Bayesian cross-validation (Geisser 1975; Geisser and Eddy 1979).

The kernel of this work is an efficient and actionable selection-induced bias estimation based on order statistics. Concretely, we:

  1. 1.

    discuss over-optimism when making model selection decisions based on noisy cross-validated predictive metrics;

  2. 2.

    propose a lightweight selection-induced bias estimation and correction tool based on order statistics;

  3. 3.

    provide a diagnostic for our estimate to understand when it is liable to be unsafe, and more computationally expensive approaches may be warranted;

  4. 4.

    show empirically that correcting selection-induced bias can help expose non-negligible over-fitting; and,

  5. 5.

    apply our bias correction to real and simulated data examples where we are interested in choosing between multiple candidate models, and in forward search.

1.1 Relation to previous work

Predictive approaches to Bayesian model comparison have been qualitatively and quantitatively reviewed by Vehtari and Ojanen (2012) and Piironen and Vehtari (2017a) respectively. One popular method is choosing the maximum a posteriori (MAP) model, which reduces to choosing the maximum marginal likelihood model in the case of a uniform prior over the model space, and can be performed with Bayes factors (Kass and Raftery 1995; Jeffreys 1998). The computational instability of Bayes factors, and the difficulty they have sampling from the combinatorially exploding model space, however, is well-documented (Han and Carlin 2001; Oelrich et al. 2020). It has also been documented in literature that under the assumption of orthogonal predictors, the most predictive model usually lies closer to the median probability model (Barbieri and Berger 2004). In the context of variable selection, one might otherwise consider investigating the probability of inclusion of an individual predictor (Brown et al. 1998; Narisetty and He 2014). Probability of inclusion approaches, including the median probability approach, have problems in case of collinearity as inclusion probabilities are diluted across the collinear predictors. Most contemporary work in this field has centred on defining intelligent priors over model space and model parameters (O’Hara and Sillanpää 2009). However, Piironen et al. (2020) show how such approaches provide limited understanding of the predictive improvement expected by adding a predictor, and can fail when predictors are collinear. It has further been argued that priors and model selection exist as separate components to a Bayesian workflow (Vehtari and Ojanen 2012; Key et al. 1999; Bernardo and Smith 1994): a view we hold ourselves.

A fully decision theoretical Bayesian model selection procedure was first outlined by Lindley (1968), wherein the statistician is implored to build the largest, most complex model possible which encompasses all uncertainty relating to the modelling task, and then to identify smaller models capable of closely replicating its behaviour. Leamer (1979) motivated the use of the Kullback–Leibler (KL) divergence to quantify the distance between models’ predictive distributions. Goutis (1998) and Dupuis and Robert (2003) went further, and proposed that one should directly project the reference model onto parameter sub-spaces so as to minimise the distance between the induced projected posterior predictive distribution and the posterior predictive distribution of the reference model: essentially fitting sub-models to the fit of the reference model. McLatchie et al. (2023) present practical recommendations for projection predictive model selection, and a more complete overview of recent advances therein. We are interested in cases where either the number of models is too small to justify the added complexity of projection predictive inference, or when we don’t know how to perform the projection efficiently (e.g. in Student-t observational family models, and some spatial models).

As a computationally efficient, intuitive, and non-parametric alternative, and one of the most common in non-Bayesian literature, statisticians compare models based on their out-of-sample performance with either cross-validation (CV; Geisser and Eddy 1979; Shao 1993; Vehtari and Lampinen 2002), or various information criteria (Burnham and Anderson 2002; Spiegelhalter et al. 2002; Gelfand and Ghosh 1998; Marriott et al. 2001; Laud and Ibrahim 1995; Watanabe 2013). Such procedures are able to produce an almost unbiased estimate of a model’s predictive ability (Watanabe 2010), and have been reviewed extensively by Arlot and Celisse (2010), and by Gelman et al. (2014) in a Bayesian context. When using noisy predictive performance estimates such as cross-validation to select a best model, we may be over-optimistic in the utility estimate of the eventually selected model. This selection-induced bias has been discussed in depth by Ambroise and McLachlan (2002), Reunanen (2003), and Cawley and Talbot (2010). Cawley and Talbot (2010), in addition to demonstrating the potential risks of selection-induced bias, emphasised that this bias does not always lead to non-negligible over-fitting. In many cases, when one model is much better than others, or if there is sufficient regularisation (via priors in Bayesian context), the selection-induced bias is negligible even if the risk of selecting the “wrong” model is not small. We do not look to eliminate bias, but identify when it is non-negligible. In particular, we agree with past authors that selection-induced bias is not always a problem, and rather than a zero–one assessment of its existence, we endeavour to talk about negligible and non-negligible bias.

Very little work has dealt with bias correction in model selection, and those procedures that have been previously presented require intense computation. For instance, Stone (1974) proposed a procedure of nested CV (which they called the “double cross”) which is insensitive to sampling and partitioning schemes. This procedure operates with complexity \(\mathcal {O}(K^2)\) in terms of the number of folds K: often fiendishly expensive.Footnote 1 An alternative was provided by Tibshirani and Tibshirani (2009) who made use of the bootstrap (Efron and Tibshirani 1993) to correct CV estimates. While less expensive, the complexity nonetheless scales in \(\mathcal {O}(K B)\) with B denoting the number of bootstrap samples. Our contribution is a lightweight alternative to these procedures, requiring only the sufficient statistics from the standard CV process and no additional computation. In other words, it has linear complexity \(\mathcal {O}(K)\). While one run of cross-validation forward search may require several hours of computation time for the experiments seen later in this paper, and thus nested cross-validation quickly becoming infeasibly expensive, the use of order statistics requires only a matter of seconds or minutes. The bootstrap is an improvement on nested cross-validation, but as previously mentioned will not scale as well as order statistics.

While our proposal is simple, and thus cannot be expected to produce perfect results, it can serve as a useful demonstration of the dynamics of selection-induced bias: when one model is sufficiently better than other candidates then we can proceed more comfortably, and when models are close we must take care and perform a correction.

We provide the statistician with a diagnostic for our procedure, and should one estimate it to not be safe for their task, then one can fall-back on safer albeit more computationally expensive approaches. While we don’t claim that CV estimates are the best way to do the model selection, it can be sometimes computationally convenient and sufficiently good; our approach helps to understand when it is sufficiently good.

1.2 Alternatives to model selection

While predictive performance-based model selection can operate as a convenient non-parametric method of determining whether there is a significant difference between models, simply investigating parameter posteriors can sometimes provide sharper resolution to the problem (Gelman et al. 2014; Wang and Gelman 2015; Gelman et al. 1996). When the model parameters are interpretable, causally consistent, nested, and there is no collinearity between predictors, then we can often investigate their posteriors directly with more success. Since the predictive distribution integrates over the uncertainty over all parameters, it may introduce more variance in the estimator than is necessary. For instance, we could investigate only the posterior of the extra parameter in the larger model and decide whether or not to include this parameter in our analysis based on whether its posterior differs greatly from the prior. These procedures, as we have already discussed, however, become more difficult given predictor collinearity.

When we are not in the nested case, we might follow the suggestions of Yates et al. (2021) and either: (1) perform Bayesian model averaging (BMA; Hoeting et al. 1999) or model stacking (Yao et al. 2018, 2021, 2023); (2) use the largest model with thoroughly-considered priors; or, (3) consider alternative model structures (Garthwaite and Mubwandarikwa 2010).

BMA is an attractive initial first consideration, and has been shown by Le and Clarke (2022) to even be better in terms of prediction than model selection under some conditions. The good performance of BMA in the predictive sense has been covered previously by Hoeting et al. (1999) and Raftery and Zheng (2003). Bayesian model averaging weights, however, can be expensive to compute, and Yao et al. (2018) introduce pseudo-BMA weights as a lightweight alternative (we discuss the connection between pseudo-BMA weights and model selection based on elpd in Appendix A). The authors discuss that when none of the models considered is the true model, this method can be flawed, and motivate stacking of the models’ posterior predictive distributions as more robust alternative. This consists of identifying model weights such that the combination of densities minimises the LOO-CV mean-squared error. More recently, Yao et al. (2023) have extended this paradigm to include more exotic model combinations.

1.3 Structure of this paper

We begin by discussing how model selection decisions themselves are liable to induce over-confidence in the expected performance of a selected model in Sect. 2. We demonstrate how the magnitude of selection-induced bias grows with the number of candidate models in Sect. 3, wherein we present our first contribution: a novel order statistic-based tool to determine when a model selection decision is liable to induce non-negligible bias. In Sect. 4 we demonstrate how, while a single decision might introduce negligible bias, when we make multiple sequential decisions, and with specific reference to forward search, the bias is likely to compound. We then extend our first contribution to correct for selection-induced bias in forward search, and use it as a stopping rule. We manifest the reliability, computational tractability, and flexibility of our correction on both simulated and real-world data exercises, wherein we compare our stopping rule with traditional alternatives.Footnote 2

2 Predictive model assessment, selection, and comparison

The focus of our paper is on developing a framework for modelling the predictive difference estimates between models, and specifically identifying when models are sufficiently different to safely make a model selection decision based on sufficient statistics.

In the case of predictor collinearity the inclusion probabilities are diluted, and it is often safer to select models based on their predictive performance. We proceed considering the log score as the default utility best suited for evaluating the predictive density (Bernardo and Smith 1994), and compute it with cross-validation (Geisser and Eddy 1979).

2.1 Predictive model assessment

Vehtari et al. (2017) define the expected log point-wise predictive density (elpd) as a measure of predictive performance for a new dataset of n observationsFootnote 3:

$$\begin{aligned} {\textrm{elpd}\bigr (\textrm{M}_k \mid y\bigl )} = \sum _{i=1}^n \int p_t(\tilde{y}_i) \log p_k(\tilde{y}_i\mid y)\,\textrm{d}\tilde{y}_i, \end{aligned}$$
(1)

where \(p_t(\tilde{y}_i)\) denotes the density of the true data-generating process for \(\tilde{y}_i\), and \(p_k(\tilde{y}_i\mid y) = \int p_k(\tilde{y}_i\mid \theta )p_k(\theta \mid y)\,\textrm{d}\theta \) is the posterior predictive distribution of model \(\textrm{M}_k\). In practice, we do not have access to \(p_t(\cdot )\) and approximate the \({\textrm{elpd}}\) with its leave-one-out cross-validated (LOO-CV) estimate (Geisser and Eddy 1979),

$$\begin{aligned} \textrm{elpd}_\mathrm {\scriptscriptstyle LOO}\left( \textrm{M}_k\mid y\right)&{:}{=}&\sum _{i=1}^n \log p_k(y_i\mid y_{-i}) \nonumber \\= & {} \sum _{i=1}^n \log \int p_{k}(y_i \mid \theta ) p_{k}(\theta \mid y_{-i}) \,\textrm{d}\theta . \nonumber \\ \end{aligned}$$
(2)

If the data y are conditionally independent then Gelfand et al. (1992), and later Gelfand (1996), demonstrate how this quantity can be efficiently estimated with importance sampling by re-weighting posterior samples \(\theta ^s\), for \(s = 1,\dotsc ,S\) indexing samples from the posterior, with weights

$$\begin{aligned} r_i^s = \frac{1}{ p_k(y_i\mid \theta ^s)} \propto \frac{p_k(\theta ^s \mid y_{-i})}{p_k(\theta ^s \mid y)} \end{aligned}$$
(3)

leading to the importance sampling estimate

$$\begin{aligned} {\widehat{\textrm{elpd}}_\textrm{LOO}\bigr (\textrm{M}_k{\mid }y\bigl )} = \sum _{i = 1}^n \log \left( \frac{\sum _{s = 1}^S r_i^s p_k(y_i\mid \theta ^s)}{\sum _{s = 1}^S r_i^s} \right) . \end{aligned}$$
(4)

Vehtari et al. (2017) further demonstrate how this estimator can be stabilised and self-diagnosed with Pareto-smoothed importance sampling (Vehtari et al. 2022). Here we focused on LOO-CV, but our approach can also be used in the context of other cross-validation approaches, for example leave-future-out (Bürkner et al. 2020; Cooper et al. 2023) and leave-one-group-out (Merkle et al. 2019).

As well as the LOO-CV elpd point estimate, we can estimate the uncertainty in the elpd estimate under the normal approximation with standard error (SE) given by

$$\begin{aligned} \widehat{\textrm{SE}}_{\textrm{LOO}}(\textrm{M}_k\mid y)= & {} \left( \frac{n}{n - 1} \sum _{i=1}^n\left( \widehat{\textrm{elpd}}_{\textrm{LOO}, i}(\textrm{M}_k\mid y)\right. \right. \nonumber \\{} & {} \quad \left. \left. - \frac{1}{n} \sum _{j=1}^n \widehat{\textrm{elpd}}_{\textrm{LOO}, j}(\textrm{M}_k\mid y)\right) ^2 \right) ^{1/2} \end{aligned}$$
(5)

where we write \(\widehat{\textrm{elpd}}_{\textrm{LOO}, i}(\textrm{M}_k\mid y) = \log p_k(y_i\mid y_{-i})\). Sivula et al. (2022) discuss under which conditions this approximation is accurate.

Watanabe (2010) demonstrated the equivalence between cross-validation and the widely applicable information criterion (WAIC). However, in finite data regimes the error in the WAIC can be non-negligible. As such, we don’t consider the latter any further.

2.2 Selection-induced bias

We can then use these predictive estimates to select one model from a pool of candidates. Even if each predictive estimate is almost unbiased for each model, however, when we use these for model selection the estimate for the selected model may incur some selection-induced bias.

Selection-induced bias emerges from selecting a model from a collection of models based on a noisy CV predictive estimate which could have been shown by the true elpd to be sub-optimal with respect to the models considered.Footnote 4 For the rest of this paper, we suppose that we choose the model with the highest LOO-CV elpd point estimate from a collection of K candidate models,

$$\begin{aligned} k^*= \mathop {\mathrm {arg\,max}}\limits _{k = 1,\dotsc ,K}\,{\widehat{\textrm{elpd}}_\textrm{LOO}\bigr (\textrm{M}_k{\mid }y\bigl )}, \end{aligned}$$
(6)

and define selection-induced bias concretely as

$$\begin{aligned} {\textrm{bias}\bigr (\textrm{M}_1,\dots ,\textrm{M}_{K} \mid y\bigl )}= & {} {\widehat{\textrm{elpd}}_\textrm{LOO}\bigr (\textrm{M}_{k^*}{\mid }y\bigl )} \nonumber \\{} & {} - {\textrm{elpd}\bigr (\textrm{M}_{k^*} \mid y\bigl )}. \end{aligned}$$
(7)

We then wish to understand under which situations choosing the model with the best LOO-CV elpd estimate may lead to non-negligible over-optimism in expected LOO-CV elpd (selection-induced bias), and how bad this bias is as a function of model similarity and the number of candidate models.

If the bias is small compared to the SE of the normal approximation to the elpd LOO estimate (Eq. 5), we say that there is negligible selection-induced bias. However, Sivula et al. (2022) show that the standard deviation of the normal approximation may be under-estimated when the models are very similar. As such, extra care should be taken when the absolute difference in elpd between models is small (smaller than 4, say; see Appendix A). Depending on the application, there may be additional information pertaining to what is a practically relevant magnitude of the difference between model performances.

Rather than measure the performance of any model selection decision in terms of whether or not we choose the correct model (since the notion of a true model is difficult to justify in practice), we instead discuss how far we are from selecting the oracle model (in terms of minimal excess loss in expectation, as defined by Arlot and Celisse 2010 we will return to this discussion in Sect. 3). In a word, we interest ourselves in the expected selection-induced bias injected by a given model selection decision, aim to minimise this as much as possible, and correct it when it is non-negligible. When we expect a model selection decision to induce only negligible amounts of selection-induced bias, we consider it a “safe” decision.

2.3 When is one model clearly better than another?

We begin by illustrating the \(K = 2\) regime, in which we have only two candidate models. A typical question arises: is one of these models better than the other? These two models may be achieved with different priors, be nested (as in one iteration of forward search, say), or may be completely non-nested. Not only are we interested in discerning when it is safe to select one of the two models over the other, we would also like to be able to say something about the potential danger associated with simply selecting the better model in terms of LOO-CV elpd. In this section, we firstly present the probability of selecting the wrong model under some oracle knowledge, before discussing the selection-induced bias arising from this decision when the models are indistinguishable. We interest ourselves later on in this paper primarily in discerning when models are indistinguishable in terms of their predictive performance.

Consider two models, \(\textrm{M}_a\) and \(\textrm{M}_b\), and suppose we know that \(\textrm{M}_a\) is the oracle model. We follow the definition of Arlot and Celisse (Section 2.2; 2010) in that the oracle model is the model having the minimum excess loss in expectation over multiple datasets from a collection of models. In particular, it is possible for the oracle model to differ from the model containing the truly relevant predictor set (for simplicity we call that the “true model”) in cases where that model can over-fit. In general, we don’t always need to choose the oracle model; the excess loss can just be negligible. It is often more useful to discuss this difference in terms of a continuous scale, rather than focus on the (often unrepresentative) case of either selecting the oracle model or not.

The structure of these two models does not influence model selection since we only consider their elpds. Denote the difference between the models’ theoretical elpds as

$$\begin{aligned} \Delta {\textrm{elpd}\bigr (\textrm{M}_a,\textrm{M}_b \mid y\bigl )} = {\textrm{elpd}\bigr (\textrm{M}_a \mid y\bigl )} - {\textrm{elpd}\bigr (\textrm{M}_b \mid y\bigl )}. \end{aligned}$$
(8)

As previously mentioned, we are most interested in the situation where no model is truly better than any other, that is that \(\Delta {\textrm{elpd}\bigr (\textrm{M}_a,\textrm{M}_b \mid y\bigl )} = 0\). When this is the case, the estimated \(\Delta {\widehat{\textrm{elpd}}_\textrm{LOO}\bigr (\textrm{M}_a,\textrm{M}_b{\mid }y\bigl )}\) represents pure selection-induced bias, regardless of the model chosen.

Consider the following (purely illustrative) setting: assume the distribution of elpd point estimates, \(\Delta {\widehat{\textrm{elpd}}_\textrm{LOO}\bigr (\textrm{M}_a,\textrm{M}_b{\mid }y\bigl )}\), is a Gaussian centred on the mean \(\mu \) and having variance \(\sigma ^2 = 1\) (similar to the setup employed by Sivula et al. 2022). From this distribution, we “sample” one observation of the LOO-CV elpd difference, denotedFootnote 5

$$\begin{aligned} \Delta {\widehat{\textrm{elpd}}_\textrm{LOO}\bigr (\textrm{M}_a,\textrm{M}_b{\mid }y\bigl )}&{:}{=}&\Delta \widehat{\textrm{elpd}}_{a,b} \sim p(\Delta \widehat{\textrm{elpd}}_{a,b}) \nonumber \\\equiv & {} \textrm{normal}(\mu , 1^2). \end{aligned}$$
(9)

When this observed elpd difference \(\Delta \widehat{\textrm{elpd}}_{a,b}\) is negative and we subsequently select \(\textrm{M}_b\) over \(\textrm{M}_a\) based on the LOO-CV estimate then we are at risk of inducing non-negligible selection-induced bias. Naturally, since \(\textrm{M}_a\) is the oracle model, the true elpd difference is non-negative.

Fig. 1
figure 1

The left plot shows illustrative LOO-CV elpd difference distributions betwee the true model \(\textrm{M}_a\) and alternative model \(\textrm{M}_b\), as described in Eq. 9. If we observe a LOO-CV estimated elpd difference the region coloured in red, then we are at risk of selecting the wrong model. The right plot shows this probability across different hypothetical mean values as computed in Eq. 10. (Color figure online)

In this toy example, we can compute the theoretical probability of selecting the sub-optimal model, with the knowledge that \(\textrm{M}_a\) is the oracle model, as

$$\begin{aligned}{} & {} \mathbb {P}(\text {select }\textrm{M}_b\text { over }\textrm{M}_a) \nonumber \\{} & {} \qquad = \int _{-\infty }^0 \Delta \widehat{\textrm{elpd}}_{a,b} \cdot p(\Delta \widehat{\textrm{elpd}}_{a,b}) \,\textrm{d}(\Delta \widehat{\textrm{elpd}}_{a,b}), \end{aligned}$$
(10)

where \(p(\Delta \widehat{\textrm{elpd}}_{a,b})\) is the oracle distribution of the elpd difference between the two models (omitting the dependence on the data y for brevity). We visualise this theoretical illustration in Fig. 1, where we let the mean slowly grow away from zero, keeping the standard deviation the same. The probability of selecting the wrong model then is greatest for a given \(\sigma \) when the two models are most similar (\(\mu \) closest to zero), and decays more slowly with an increasing mean of the elpd difference distribution.

The selection-induced bias is likely to be non-negligible when it is much larger than the variation in the LOO-CV point estimates. In terms of Fig. 1, if the selection-induced bias is less than the illustrative \(\sigma ^2 = 1\), then it is less critical to model selection than if it were much larger. In a word, selection-induced bias is said to be negligible if it is dominated by the variation in the LOO-CV point estimates.

Nonetheless, in our experience (outwith this paper) we are able in most cases to simply choose the more complex model fitted with reasonable priors since the bias is negligible in the \(K=2\) case. However, as the number of effectively equivalent models grows (e.g., if the number of candidate models grows exponentially with the number of observations n, or if the number of predictors p is much larger than \(\log (n)\) as described by Gelman et al. 2013) the potential bias grows too. While selection-induced bias remains negligible in the two model case, it can balloon as K grows. The specifics of the rate shown in Fig. 2 is described later in this paper. This many model regime is the more interesting case.

Fig. 2
figure 2

Theoretical illustration. The hypothetical expected selection-induced bias in the presence of K candidate models. While the bias is negligible for \(K=2\), it can grow rapidly with the number of candidate models. We discuss this trend later

3 Bias grows with the number of candidate models

In the regime of many candidate models, when none is truly more predictive than any other, the probability of one model appearing better than the rest grows with the number of models being compared. This is true even if independent test data would show that there is in fact no difference from it to the true model (or that it is in fact worse). As this probability grows, so too does the magnitude of selection-induced bias. In this section, we first show how these high risk regions behave as the number of models grows, and propose a selection-induced bias estimation scheme based on order statistics to identify when the bias is no longer negligible.

3.1 Empirical demonstration

We fit a series of nested linear models to \(n = 100\) data points sampled from

$$\begin{aligned} y&= X\beta + \epsilon \end{aligned}$$
(11)
$$\begin{aligned} \epsilon&\sim \textrm{normal}(0,\,\sigma ^2 I), \end{aligned}$$
(12)

where I is the \(n\times n\) identity matrix, and \(0\in \mathbb {R}^n\) is the n-dimensional zero vector. The data \(X_i = (1,X_{i,1},\dotsc ,X_{i,K - 1})\) are K dimensional, with \(X_{i,1},\dotsc ,X_{i,K - 1}\overset{\text { iid}}{\sim }\textrm{normal}(0,\,1)\), and the true regression coefficients take the form \(\beta = (1,\beta _\Delta ,0,\dotsc ,0)\in \mathbb {R}^{K}\). The variance of the residuals, \(\sigma \), is chosen to vary to the signal-to-noise with \(\beta _\Delta \) while keeping the marginal variance constant. In particular, we only consider \(\beta _\Delta < 1\), and set \(\sigma ^2 = 1 - \beta _\Delta ^2\). We will then fit a baseline model comprising only the intercept term and denoted \(\textrm{M}_{\textrm{base}}:\,y_i\mid \beta _1,\tau \sim \textrm{normal}(\beta _1,\, \tau ^2)\). We compare this to the remaining \(K-1\) one-predictor models of the form

$$\begin{aligned} \textrm{M}_{k}:\,y_i\mid \beta _1,\beta _{k},\tau \sim \textrm{normal}(\beta _1 + X_{i,k}\beta _{k},\, \tau ^2), \end{aligned}$$
(13)

for \(k = 2,\dotsc ,K\) so that model \(\textrm{M}_2\) alone contains the two predictor set, and the scale factor \(\tau \) is treated as an unknown parameter across all models. Naturally, letting \(\beta _\Delta = 0\) collapses all models to be equivalent.Footnote 6

We vary the true \(\beta _\Delta \) an \(\epsilon \) distance from zero. For each value of \(\beta _\Delta \), we investigate the LOO-CV elpd difference point estimates of the \(K - 1\) models from the baseline model, and identify the model with the highest (best) cross-validated score. This is then compared to its test elpd, and to that of the model with the known true predictor set (which we call the “true model”, even if we do not use the true values of the predictors). We show this for different numbers of candidate models K and along \(\beta _\Delta \) in Fig. 3. As previously mentioned, the selection-induced bias (difference between the LOO-CV elpd and the test elpd of the selected model) is negligible when \(K = 2\), but grows with K. For \(K=100\), it would be unsafe to make a decision based on LOO-CV elpd point estimates up until the theoretical \(\beta _\Delta \) grows above \(\approx 0.3\), as the selection-induced bias is non-negligible. There is no formal threshold for non-negligible bias, but in Appendix A we show how the difference between elpd estimates can be connected to the notion of non-zero weights in model averaging. The magnitude of the bias also grows with K, seen in the gap between the LOO-CV elpd and test elpd of the selected model.

Fig. 3
figure 3

Simulated nested linear regression example. The red lines indicate the best LOO-CV elpd difference of K candidates from a baseline model (the horizontal zero line), and the black lines are the test elpd difference. In grey is the test elpd of the model fit with the true predictors. The region between the red and black curves is selection-induced bias, whereas the region between the oracle model (this is the zero line up to \(\beta _\Delta \approx 0.3\), after which it is the grey line) and black curves is what we refer to as over-fitting. The magnitude of the selection-induced bias grows with K. (Color figure online)

When the true value of \(\beta _\Delta \) is low, the model containing the known true predictors does not out-perform the baseline model in terms of test elpd. It is in these situations where the difference between the oracle model and so-called true model becomes apparent. In this case, the baseline model has the lowest excess loss in expectation over data realisations. When the true model performs worse than the oracle model, but we proceed to select the former, we have over-fitting. This excess loss from the oracle model is the difference from the baseline model to the selected model’s test elpd. As \(\beta _\Delta \) grows sufficiently large to consistently select the true model, this tends to zero, as too does selection-induced bias. The region between the LOO-CV and test elpd of the selected model is the selection-induced bias, while the region between the oracle elpd and the test elpd of the selected model is over-fitting. In this example, the magnitude of over-fitting is negligible across all values of \(\beta _\Delta \).

Since there exist situations where the true model performs worse than the oracle model, we find some justification for our previous discussion in Sect. 2.2. Namely, we should be more interested in identifying models which are close to the oracle model rather than retrieving the true model (since the two are not equivalent). Indeed, finding the true predictors is not useful if it worsens prediction. These true predictors we “fail” to select are likely to have a coefficient magnitude below the detection limit, as illustrated when we increase K for a small (but non-zero) \(\beta _\Delta \).

Knowing then that the risk of selection-induced bias exists, and not knowing the true value of \(\beta _\Delta \), we presently concern ourselves with producing a selection-induced bias estimation procedure geared towards identifying high risk regimes in the presence of many candidate models by modelling the distribution of LOO-CV elpd difference estimates.

3.2 Our proposal: modelling the elpd differences

Suppose now that we have K candidate models, none of which is better than any other. The best LOO-CV elpd difference point estimate from a baseline model may appear, purely by chance, to be greater than zero. By modelling the underlying distribution of the elpd difference point estimates as iid draws from a Gaussian distribution centred on zero (so that none is truly better than the baseline model) with some variance \(\sigma _K^2\), \(\Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(i)} \sim \textrm{normal}(0,\sigma _K),\,i = 1,\dotsc ,K\), we can leverage order statistics to produce a selection-induced bias estimate.

Blom (1960) posited an approximation to the expected maximum from K iid samples from a standard normal distribution, denoted \(S^{(K)}\) (later developed by Royston 1982; Harter 1961):

$$\begin{aligned} S^{(K)}=\Phi ^{-1}\left( \frac{K - \alpha }{K - 2\alpha + 1} \right) . \end{aligned}$$
(14)

The formulation allows for an approximation parameter, \(\alpha \). In the limit as \(n\rightarrow \infty \), the expectation of the maximum order statistic lies within the interval between levels \(\alpha = 0.39\) and \(\alpha = 0.5\). As such, for maximum safety we will use \(\alpha = 0.5\) herein.

In the presence of K candidate models, they are practically equivalent if the maximum LOO-CV elpd difference point estimate to the baseline model is less than

$$\begin{aligned} S^{(K)} \hat{\sigma }_K. \end{aligned}$$
(15)

We estimate \(\hat{\sigma }_K\) as the maximum likelihood estimator of the standard deviation of a half-normal distribution fit to the upper half-tail of elpd difference point estimates (greater than their median, \(\hat{m}_K\)),

$$\begin{aligned} \hat{\sigma }_K = \sqrt{\frac{2}{K}\sum _{k=1}^{K}\textbf{1}_{{\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)} \ge \hat{m}_K}\left( \Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)}\right) \cdot \left( \Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)} - \hat{m}_K\right) ^2}.\nonumber \\ \end{aligned}$$
(16)

We denote by \(\textbf{1}_A(x)\) the indicator function over a set \(A \subset X\) at \(x\in X\). This is not yet an estimate of the selection-induced bias, but the highest LOO-CV elpd difference expected from a collection of K models, none of them truly more predictive than any other. In particular, this is an estimate of the expected distance between the zero line and the red line in Fig. 3. We discuss later in Sect. 4.1 how one might use this as the basis for such an estimate of the selection-induced bias (the difference between red and black line in Fig. 3).

We fit a half-normal distribution to the upper half of the observed elpd difference point estimates instead of a fitting a full normal distribution over all of them because (even if none of them is truly relevant) the distribution of elpd difference point estimates is likely to be slightly skewed. A similar result has previously been noted by Sivula et al. (2022), who showed the full elpd difference distribution (not just the point estimates) to be more skewed the more similar the models being compared are. We interest ourselves only in the right tail of the distribution of point estimates as it is the more relevant to the maximum order statistic. We discuss the validity of this stabilisation technique and how to diagnose it in Sect. 3.3.

In Fig. 4, we fix \(\beta _\Delta = 0\) and vary K in the experiment design previously described by Eqs. 1112. For each value of K, we investigate the best observed LOO-CV elpd difference to the baseline model. Since we have fixed \(\beta _\Delta = 0\), we know that no model is theoretically better than any other. As such, these empirical LOO-CV elpd differences are just random noise. We repeat the experiment 100 times for each value of K and show the mean of the empirically observed best LOO-CV elpd difference in each regime compared to the expected best observed LOO-CV elpd difference as computed according to Eq. 15. This theoretical boundary grows quickly for K less than 10, before slowing. Our order statistics-based equation closely matches empirical observations in this example.

Fig. 4
figure 4

Simulated nested linear regression example. The mean best LOO-CV elpd difference of K candidate models from the baseline model (black dots) over 100 data realisations when \(\beta _\Delta = 0\), and the expected best observed LOO-CV elpd difference of K models as computed according to Eq. 15 (dashed grey line)

A second empirical demonstration of Eq. 15 is provided in Appendix B with reference to a real data study on the effect of a poverty reduction intervention on infant brain activity.

3.3 Diagnosing the theoretic assumptions

We have made a number of assumptions thus far in our analysis. In particular, we have assumed that the elpd difference distributions of models of equivalent predictive performance are:

  1. 1.

    normally distributed around zero (or at least half-normal above zero); and,

  2. 2.

    independent and identically distributed.

It is known that the empirical elpd difference distributions are slightly skewed, with heavier right tails (this was seen in empirical results not shown here, and has previously been discussed by Sivula et al. 2022). This behaviour becomes more pronounced as models become closer together (through the use of regularising priors for example), and such correlation may appear as models make more similar predictions. It is possible then that the distribution of elpd difference point estimates (which we interest ourselves in) may also be slightly skewed. As such, the normal approximation may not be representative of the underlying distribution, and the half-normal approximation is better able to capture the characteristics of the right tail. Indeed, it is this right tail that is of most interest to us, as it contributes most to the maximum order statistics.

It would be possible to fit a sup-Gaussian distribution to this tail with more parameters, e.g. a half-Student-t, but as shown empirically, the half-normal works sufficiently well. Since it has fewer parameters we expect it to be more robust, and from the half-normal we get information from the bulk of the distribution. For example we could also have fit a Pareto distribution to the tail as done by Vehtari et al. (2022), or a Student-t distribution. We would, however, then have to estimate more moments of the distribution, which we found to be unstable in our experiments. If the utility distributions are truly skewed, we expect our normal approximation to inject some bias, but if we use a more flexible approximation there would be higher variance. In general there is no obvious choice of optimal approximation to the utility distributions, and indeed this represents an interesting direction of future investigation.

We are also able to diagnose when our half-normal approximation is likely to fail: when the right tail of the empirical distribution has infinite variance. We follow Vehtari et al. (2022) and first identify a tail cut-off point of the empirical elpd difference point estimate distribution in line with Scarrott and MacDonald (2012), and subsequently fit a generalised Pareto distribution to this tail. Zhang and Stephens (2009) present an efficient Bayesian procedure for estimating the parameters of the distribution. In particular, the \(\hat{k}\) parameter of the generalised Pareto distribution is related to its number of fractional moments. We follow the rule of thumb presented by (Section 3.2.4., Equation 13 Vehtari et al. 2022) in establishing the minimum required value of \(\hat{k}\) to be confident that the right tail has finite variance when comparing 10 or more models.Footnote 7

When this Pareto \({\hat{k}}\) value is too large, it suggests that the elpd difference point estimates are likely from a distribution with infinite variance. It is at this stage that one should fall back on safer, albeit more computationally expensive approaches: the double-cross (Stone 1974) or the bootstrap (Tibshirani and Tibshirani 2009).

Elsewhere, we know that the distribution of the maximum order statistic (of which we interest ourselves only in the mean) is slightly skewed. Because of this skewed-ness, it is possible that the mean alone (which we have denoted \(S^{(K)}\)) under-estimates the maximum of LOO-CV elpd difference estimates.

When the model space grows too large to consider each possible model at once, and if the models can be nested, then one might favour some efficient search through the model space. Such searches come with their own risks. We presently move from the realm of single decisions towards sequential model selection.

4 Compounding the bias through forward search

While selection-induced bias might remain negligible when we need only select one model from a group of models once, when we compound decisions, the magnitude of the bias is likely to compound too. For instance, in forward search, we begin with a baseline model (usually the intercept-only model) and then consider all models containing one predictor and the intercept term. Having fit all so-called “size-one” candidate models, we choose the one with the highest LOO-CV elpd point estimate. We now perform another forward step, and fit all models containing the intercept, the predictor chosen previously, and a second additional predictor. Once more we select the size-two model based on the highest LOO-CV elpd point estimate. We repeat this either until either some stopping criterion is met, or we have exhausted all available predictors.

We can easily implement various other search heuristics, such as exhaustive search (Galatenko et al. 2015), backward search (Nilsson 1998), and stochastic search (George and McCulloch 1993; Ntzoufras et al. 2000). However, we proceed with forward search alone as it affords the possibility of an efficient stopping rule (Harrell 2001) and intuitive compounded bias estimation (which we present in Sect. 4.1).

At the beginning of this search, it is likely that there exist at least some predictors that improve on the predictive performance of the baseline model, but there comes a point after which the addition of further predictors affords the statistics no theoretical predictive performance. We will refer to this as the “point of predictive saturation” herein. We wish to identify this point, and estimate the bias induced by continuing the search beyond it. Under appropriate prior consideration, the full model is considered the oracle model. If one witnesses a sub-model having better test performance than the full model, then we recommend they reconsider their priors. In practice, we do not need to identify the oracle model size, only a sub-model having negligible excess loss compared to the oracle model.

When we observe a separation of the LOO-CV and test elpd along the search path, we can conclude that there is non-negligible selection induced bias in the performance estimates: purely due to the predictor ordering achieved by LOO-CV based search, we are led to believe that some sub-models can outperform larger models. This is seen by the elpd estimate at the bulge being greater than that of the final full model. In practice, McLatchie et al. (2023) argue that should one observe such a bulge in the forward search path, then one should cross-validate over search paths (something akin to the double-cross), but only up to the model size inducing this bulge. This is because cross-validating over forward search paths can be expensive, and we understand that the model size of maximal selection-induced bias is likely greater than the model size of theoretic predictive saturation.

The dynamics of forward search can be seen in the relationship between the predictive criterion (the elpd, say) and the model sizes along the so-called “search path”. The plot typically shows an increasing elpd curve, indicating that the addition of further predictors improves predictive performance. We expect the curve to flatten after a given model size, indicating that the addition of further predictors does not improve the criterion any longer: the point of predictive saturation. However, it is possible that in the finite data regime, and under model misspecification, we instead witness a bulge along this search path, such as that shown in Fig. 5.

Fig. 5
figure 5

Illustrative sketch. In red we show the LOO-CV elpd difference to the full (reference) model along growing model size, and in grey the out-of-sample elpd difference. Selection-induced bias is the divergence of the red line from the grey. Over-fitting is the drop in predictive performance with increased model size following the point of predictive saturation. (Color figure online)

The test curve in Fig. 5 reaches the point of predictive equivalence and then dips down with increased model size. This is not necessarily selection-induced bias, but could also be over-fitting. Indeed, we aim to correct selection-induced bias to identify non-negligible over-fitting in forward search.

In order to correct selection-induced bias and estimate the location of the point of predictive saturation (and the bias induced by continuing beyond it) we introduce the following application of our proposal from Sect. 3.2.

4.1 Correcting for selection-induced bias in forward search

We can estimate the selection-induced bias at model size k in a forward search path, and correct the LOO-CV elpd difference estimate, denoted \(\Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)}\), accordingly by

$$\begin{aligned}{} & {} \Delta {\widehat{\textrm{elpd}}_\textrm{corrected}}^{(k)}\nonumber \\{} & {} \quad = \left\{ \begin{array}{ll} \Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)} - \widehat{\text {bias}}^{(k)},&{}\quad \textrm{if }\,\vert \Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)}\vert < S^{(k)} \hat{\sigma }_k \\ \Delta {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(k)},&{}\quad \textrm{otherwise}, \end{array}\right. \nonumber \\ \end{aligned}$$
(17)

where \(S^{(K)} \hat{\sigma }_K\) is as defined in Sect. 3.2. Intuitively, if we detect non-negligible selection-induced bias, then we correct the LOO-CV elpd difference by an estimate of the selection-induced bias induced by step k.

We produce this estimate of selection induced bias, denoted \(\widehat{\text {bias}}^{(k)}\), building on Sect. 3.2 as \(\widehat{\text {bias}}^{(k)} = 1.5\times S^{(k)} \hat{\sigma }_k\). The nature of this estimate stems from (Table 3 Watanabe 2010), who discussed how cross-validated scores and the average Bayes generalisation loss can exhibit strong negative correlation (albeit demonstrated with only one example). Theoretically this is due the removal of an (influential) observation from the training set skewing the posterior predictive distribution away from the true data-generating distribution, and its re-appearance in the test set seeming like an outlier. As such, the correction of LOO-CV elpd estimates warrants not only a correction to the median (the difference between red line and zero line in Fig. 3), but a negatively-correlated correction (the difference between red and black lines in Fig. 3). In a word, the corrected elpd difference value should be a reflection across the median of candidate elpd differences. The choice of magnitude of 1.5 is reflective of the magnitude of the negative correlation of approximately \(-0.8\) noted by Watanabe (2010). However, since this number was achieved only over one example, we understand that there can be instances where this correlation can be closer to 0 or \(-1\). As such, we show repetitions of some later experiments using values of \(\{1, 1.5, 2\}\) when estimating the bias in Appendix C.

Cumulatively summing these corrected difference estimates,

$$\begin{aligned} {\widehat{\textrm{elpd}}_\textrm{corrected}}^{(k)} = {\widehat{\textrm{elpd}}_\textrm{LOO}}^{(0)} + \sum _{i=1}^k \Delta {\widehat{\textrm{elpd}}_\textrm{corrected}}^{(i)}, \end{aligned}$$
(18)

starting at the base LOO-CV elpd estimate at model size 0, denoted \({\widehat{\textrm{elpd}}_\textrm{LOO}}^{(0)}\), allows us to estimate the bias induced by continuing beyond the point of predictive saturation. As such, our proposition comprises two parts: an order statistics-based estimation of selection-induced bias (from Sect. 3.2), and an associated correction term.

4.2 A simulated example

We can visualise the performance of our correction through a simple simulated data experiment: we generate data according to

$$\begin{aligned} x&\sim \textrm{normal}(0,R) \end{aligned}$$
(19)
$$\begin{aligned} y&\sim \textrm{normal}(w^Tx,\sigma ^2), \end{aligned}$$
(20)

where the number of predictors is set to \(p = 100\), \(\textrm{normal}(\mu , \Sigma )\) denotes a p-dimensional normal distribution with mean vector \(\mu \) and covariance matrix \(\Sigma \). The matrix \(R \in \mathbb {R}^{p\times p}\) is block diagonal, with each block having dimension \(5 \times 5\). Each predictor has mean zero and unit variance and is correlated with the other four predictors in its block with coefficient \(\rho \), and uncorrelated with the predictors in all other blocks. We will investigate regimes of zero and high correlation: \(\rho =\{0, 0.9\}\). Further, the weights w are such that only the first 15 predictors influence the target y with weights \((w_{1:5},w_{6:10},w_{11:15}) = (\xi ,0.5 \xi ,0.25 \xi )\), and zero otherwise. We set \(\xi = 0.59\) and \(\sigma ^2=1\) to fix \(R^2 = 0.7\). We simulate \(n=\{100, 200, 400\}\) data points according to this data-generating process (DGP). Across all six n and \(\rho \) regimes, we perform forward search based on the maximum LOO-CV elpd point estimate of the candidate model, fitting each model using independent Gaussian priors over the regression coefficients (which we understand to be liable to over-fit), and comparing them to a reference model fit with the sparsity-inducing R2D2 prior (which is discussed in more detail in Sect. 4.3 and Appendix D).

In Fig. 6a we visualise the performance of our bias correction across these six regimes in terms of the mean log predictive density (mlpd) computed on \(\tilde{n}\) independent test observations, \(\tilde{y}_{1:\tilde{n}}\), as

$$\begin{aligned} \textrm{mlpd}\left( \textrm{M}_k\mid \tilde{y}\right) = \frac{1}{\tilde{n}}\sum _{i=1}^{\tilde{n}} \log p_k(\tilde{y}_i\mid \theta ), \end{aligned}$$
(21)

which admits a LOO-CV difference estimate similar to the elpd (Piironen and Vehtari 2017a). We do not correct the LOO-CV differences beyond their LOO-CV bulge since at this point it is already clear that we have over-fit the data.

We observe from these plots two positive results and one negative:

  1. 1.

    the corrected mlpd does not greatly surpass the mlpd of the of the reference model, contrary to the LOO-CV mlpd;

  2. 2.

    the corrected mlpd estimates the point of predictive saturation close to the true point of predictive saturation on test data; however,

  3. 3.

    the corrected mlpd of the search path can under-estimate the true selection-induced bias slightly in cases of severe over-fitting (most notably when \(n = p\)).

This suggests that a simple order statistic model is able to represent selection-induced bias well enough to be useful. Naturally, the quality of our correction improves with data size and the number of candidate models.

The approximation performs best in the larger data settings. It would further appear that, in this example, once \(n > 4p\) the degree of selection-induced bias is negligible. This could be due in part to the negative correlation between Bayes cross-validation and the average Bayes generalisation loss (Watanabe 2010): even if we know the underlying distribution of LOO-CV elpd estimates, the asymptotic theoretical elpd distribution is likely negatively correlated with it (and worse in absolute value). Such correlation is naturally exaggerated in smaller data regimes, as the relative influence of data points grows, and our proposal under-estimates the true selection-induced bias. Despite under-estimating the bias slightly in the \(n = p\) case, it appears to correct enough of the selection-induced bias to accurately and efficiently identify over-fitting across all regimes.

As previously mentioned, it is not clear how one would motivate any particular selection-induced bias correction beyond what we have proposed (see Sect. 6 for more discussion, and Appendix C for a repetition of the experiments using different levels of correction), and we rather recommend that one endeavours instead to mitigate against non-negligible over-fitting in the first instance. One such way is with a more principled application of the prior.

4.3 Reducing selection-induced bias with priors

Gelman et al. (2017) reason that prior distributions in Bayesian inference should be considered generatively in terms of the potential observations consistent with the prior, and predictively in terms of how these observations align with the data we truly collect. Cawley and Talbot (2010) also discuss the benefits of using regularising techniques to avoid over-fitting in selection; generative priors are such a regulariser.

Fig. 6
figure 6

Forward search simulation experiment. The red lines show the LOO-CV mlpd difference to the reference model over 20 iterations, the grey lines show the test mlpd difference along those same paths, and the black show the corrected mlpd difference, produced using our correction from Sect. 4.1. The vertical dashed red and black lines indicate the point of maximum LOO-CV and corrected LOO-CV mlpd difference respectively. (Color figure online)

Joint shrinkage priors are a modern class of such generative priors which encode the statistician’s belief on a predictive metric and filter this information through the model structure to define prior definitions on the model parameters. For instance, the R2D2(M2) prior (Zhang et al. 2022; Aguilar and Bürkner 2023) encodes a prior over the model’s \(R^2\) which is then mapped to a joint prior over the regression coefficients and the residual variance. By positing a lower expected predictive performance in terms of explained variance than what is implied by independent Gaussian priors, we shrink irrelevant regression coefficients more. This posterior shrinkage and reduced posterior uncertainty then naturally feeds through to the predictive distribution and thus the LOO-CV elpd estimate in Eq. 2. This means that candidate models make more similar predictions, the empirical distribution of elpd difference point estimates is more concentrated, and the magnitude of selection-induced bias is reduced. See Appendix D for more discussion on the specifics of the priors and the hyper-parameters used throughout this paper.

Repeating our example from Sect. 4.2, but using a more regularising generative prior, we can reduce the degree of selection-induced bias along the search path. In Fig. 6(b), we find that once more our heuristic identifies the correct model size after which no more predictors truly aid predictive performance, and recovers the shape of the test curve. We do, however, under-estimate the bias slightly in the \(n = p\) regimes. This is likely due to the models making more similar predictions, and thus our proposed Eq. 15 will likely under-estimate the magnitude of the bias. However, the identified points of inflection are similar under both priors, reassuring us that our selection-induced bias correction is stable across different bias regimes.

Priors are not the only way one might consider reducing selection-induced bias in forward search; projection predictive inference is one information-theoretic approach to the task that has been proven to be very effective.

4.4 Projection predictive inference

We will later compare the model sizes selected following our bias correction to projection predictive inference. The projection approach is known to greatly reduce the variance in the relative performance estimates, leading to much smaller selection-induced bias (see, e.g. Piironen and Vehtari 2017a).

Projection predictive inference was first introduced concretely by Goutis (1998) and Dupuis and Robert (2003), but follows the decision-theoretical idea presented previously by Lindley (1968). In this paradigm, one begins with a large reference model one believes to be well-suited to prediction and subsequently constructs small models capable of approximately replicating its predictive performance; essentially, fitting sub-models to the fit of the reference model.

We call the model we project the reference model onto the “restricted model” or “sub-model”, and denote its parameter space by \(\Theta ^\perp \) (which may be completely unrelated to the parameter space of the reference model, \(\Theta ^*\)). Further denote the predictor-conditional predictive distribution of the restricted model on \(\tilde{n}\) unobserved data observations \(\tilde{y} = (\tilde{y}_1,\dotsc ,\tilde{y}_{\tilde{n}})^T\) by \(q(\tilde{y}_i\mid \theta ^\perp )\), for \(\theta ^\perp \in \Theta ^\perp \) and \(i \in \{1, \dotsc , n\}\), and that of the reference model by \(p(\tilde{y}_i\mid \theta ^*)\). Then we compute the projection draw-wise on \(s \in \{1, \dotsc , S\}\) posterior draws from the reference model as

$$\begin{aligned} \theta ^\perp _s = \mathop {\mathrm {arg\,min}}\limits _{\theta ^\perp \in \Theta ^\perp } \frac{1}{n} \sum _{i = 1}^n \mathbb{K}\mathbb{L}\left\{ \, p(\tilde{y}_i\mid \theta ^*_s) \,\Vert \, q(\tilde{y}_i\mid \theta ^\perp ) \,\right\} . \end{aligned}$$
(22)

For implementation details see the papers by Catalina et al. (2021, 2020), Weber and Vehtari (2023), and Piironen et al. (2020). McLatchie et al. (2023) outline further workflow steps to minimise the computational burden while achieving robust results.

Projective forward search begins by projecting the reference model onto the intercept-only model to establish the root of the search tree. Subsequently, the reference model is projected onto size-one models, from which we select the model whose posterior predictive distribution is most similar to the reference model’s posterior predictive distribution by the KL divergence in line with the work of Leamer (1979). The search then continues as before. Because of the use of the KL divergence over LOO-CV, our order statistics-based bias estimation and correction can not be immediately translated to work with projection predictive inference. It is the disjoint between the KL divergence and the predictive performance noted by Robert (2014), however, which is partially responsible for the reduced selection-induced bias in projection predictive inference.

As such, if the reference model (and sub-models) can be constructed in such a way that conforms with the requirements of projection predictive inference, then we recommend one begins with this. For observational families whose KL divergence can not be easily computed, or for model structures that can not be easily decomposed into additive components, cross-validation forward search (as described in the beginning of Sect. 4) can be a useful alternative.

4.5 Real world data

To manifest the performance of our selection-induced bias correction beyond simulated settings, we compare it with classical stopping criteria, including projection predictive inference, across four publicly available real world datasets: one regression task (Redmond and Baveja 2002) and three binary classification (Gorman and Sejnowski 1988; Sigillito et al. 1989; Cios and Kurgan 2001) summarised in Table 1 (in Appendix D) after any necessary pre-processing.Footnote 8 For the regression task we fit a linear regression reference model with a Gaussian observation family, and for the binary classification a Bernoulli generalised linear model as implemented in brms (Bürkner 2017) using the probit link function. For each dataset, we perform forward search based on LOO-CV elpd, fitting sub-models with independent standard Gaussian priors over the regression coefficients in the first instance, and then again with sparse priors (the R2D2 for regression tasks, and the regularised horseshoe for the classification tasks; see Appendix D for details on the hyper-parameters used for each task). We split all data into 10 cross-validation folds, leaving \(10\%\) of the data for testing in each fold. In Fig. 7 we show the cross-validated and test search paths for each of these, along with the corrected search paths as computed using Sect. 4.1.

Fig. 7
figure 7

Real world datasets. The red lines show the LOO-CV mlpd difference to the reference model over CV folds, while the grey lines show the test mlpd difference along those same folds. In black we see the corrected mlpd difference, produced using our correction from Sect. 4.1. (Color figure online)

Across all real data experiments, our corrected LOO-CV mlpd did not significantly exceed that of the reference model, except for the heart data. This indicates our correction effectively addresses selection-induced bias without falsely suggesting better performance than the reference model. In some cases, our correction appeared to under-correct for small model sizes but eventually recovered the true test mlpd. In the sonar data we witnessed the opposite: our proposal over-corrected bias. Even when our correction over- or under-estimates bias, it provides valuable insights into test mlpd behaviour.

We presently extend our selection-induced bias correction to produce a stopping criterion: we select the model size inducing the maximum corrected mlpd along the search path as computed by Sect. 4.1. We compare this to simply stopping at the point of maximum LOO-CV (which we call the “bulge” rule), and projection predictive inference. Again, we do not look to identify the oracle model size, but rather we hope to select a model size having negligible excess loss compared to the oracle model. Alongside the bulge, we consider the so-called \(2\sigma \) rule, which proposes that one should select the smallest model that is essentially indistinguishable from the model with the best LOO-CV performance, as judged by the two standard error simultaneous confidence interval.

In projection predictive model selection, we first achieved a search path inline with the previous discussion in Sect. 4.4. We then choose the smallest model size k whose (projected sub-model) elpd difference point estimate to the reference model (for which we use the full model fitted by standard Bayesian inference) is less than four. This is in line with the recommendation of (and discussed more in Appendix A McLatchie et al. 2023). We compare the model sizes chosen and their stability across forward search cross-validation folds in Fig. 8. We make three primary observations:

Fig. 8
figure 8

Real world datasets. In the bottom row we show the selected model sizes in four real world datasets across different heuristics when using standard independent Gaussian priors (in black), the sparsity-inducing priors (in red). The top row shows the test mlpd difference to the reference model of the model sizes selected by each heuristic. (Color figure online)

  1. 1.

    projection predictive inference selects smaller model sizes than the most naïve approaches and does so stably;

  2. 2.

    our correction achieves slightly smaller model sizes than the projective approach with similar mlpd, and requires fewer iterations of forward search than the bulge and \(2\sigma \) rules (while achieving better mlpd than the latter two); and,

  3. 3.

    regardless of model size selection heuristic, the use of shrinkage priors results in smaller and more stable model size selection.

Across all data, the \(2\sigma \) rule chooses the smallest sub-model sizes and the bulge rule chooses the largest. The model sizes selected by our correction are, by contrast, slightly smaller to those selected by projection predictive inference. There is more variation in the model size selected by the bulge rule. In fact, model size selection is generally more variable when using the independent Gaussian priors irrespective of the heuristic. Our correction is similarly stable to projection predictive, which is known to be very stable under the KL forward search heuristic, providing confidence in our own tool. Indeed, the good performance of projection predictive inference has been extensively covered by Piironen and Vehtari (2017a). Our correction identifies models with better predictive performance than the more naïve \(2\sigma \) and bulge rules, and with better stability.

The model size and mlpd difference estimated by our correction do not seem to be sensitive to the prior, unlike LOO-CV bulge model sizes; this is most prevalent in the heart data where the use of Gaussian priors results in vastly reduced predictive performance when using the bulge rule. Sparsity-inducing priors result in smaller and more stable model sizes being chosen across all data examples. This is likely related to our previous discussion on generative predictive priors: since they regularise in favour of predictive performance, they result in forward search paths that are less variable and less likely to introduce selection-induced bias (although may lead us to slightly under-estimate bias). Their predictive regularisation means that we are likely to more quickly and more stably identify predictive predictors, and ultimately identify more predictive predictor subsets.

The importance of identifying small models is particularly clear in this forward search example: when we must fit all models up to the bulge model size, this requires approximately \(30^2 = 900\) model fits in the sonar and crime case studies. When these models are large, this can mean days of computation time. While the \(2\sigma \) heuristic can identify small models with comparable predictive performance, it still requires fitting up to the bulge model size and then working back. In a word, it does not reduce the computational cost. As such, our bias correction is particularly attractive to those who are able to implement online stopping rules.

It would also have been possible to construct a stopping rule based on the inclusion of zero in the two standard error range of the LOO-CV elpd difference estimate. However, such heuristics struggle in the case of many weakly-relevant predictors. In particular, they are liable to identify too-early stopping points and achieve sub-optimal out-of-sample predictive performance when many predictors have weak (but non-zero) predictive information (e.g. in the ionosphere example). This is further discussed in Appendix E.

While our heuristic was not motivated from the perspective of model size selection, we find that even correcting just some of the selection-induced bias along forward search paths can reveal the behaviour of test performance. Using this corrected LOO-CV mlpd as a proxy for test mlpd is reasonable across these real data examples. As such, model size selection likely does not depend greatly on the assumed asymptotic correlation between LOO-CV and Bayes generalisation loss which we discussed in Sect. 4.1.

5 Recommendations

Cross-validation of predictive metrics is a common non-parametric method of making model selection decisions. It is particularly well-suited to cases when robust alternatives such as projection predictive inference are unavailable. Given the uncertainty present in LOO-CV estimates, it is possible that they are not always sufficient to make model selection decisions, or that they might encourage an incorrect decision based purely on noise. We discussed the risk of selecting between two candidate models (Scenario 1), presented a selection-induced bias estimation scheme based on order statistics in the case of selecting many candidate models (Scenario 2), and leveraged it to build a lightweight bias correction method in the forward search setting (Scenario 3).

Scenario 1: the two model case

In the case where we compare two models only, we can safely make a model selection decision if the elpd difference point estimate between them is greater than four in absolute value (see Appendix A for more discussion). Should the elpd difference point estimate be less than four in absolute value we have three options: (1) if the models are not nested, combine them by model averaging or stacking; (2) ensure the models’ respective priors are reasonable and select the more complex of the two; or, (3) keep them both as a set of best models.

This scenario is the least concerning of the three, and in most cases we can safely select the most predictive of the two provided its priors are reasonable.

Scenario 2: the many model case In the many model case all previous recommendations apply, and we might further use our proposal from Sect. 3.2 to determine whether all models are equivalent in terms of predictive performance using order statistics. If we detect non-negligible selection-induced bias, then we do not make a model selection decision and instead perform model averaging or stacking (in the non-nested case). Our estimation procedure was shown to closely match empirical observations, and requires no further computation beyond the sufficient statistics of standard CV estimates. We further proposed a diagnostic to determine when the underlying assumptions of our proposition break down, and more computationally expensive approaches can be used instead.

Scenario 3: forward search When the model space is large and we consider nested models, then we should first try projection predictive inference for model size selection. If the reference model is too difficult to fit, or if it belongs to an observation family which can not easily be implemented in the projective framework, then our lightweight selection-induced bias correction can have a direct application: by performing CV forward search and correcting the incremental elpd differences along the search path according to Sect. 4.1, we can recover an accurate estimate of the out-of-sample performance along the search path. And we do so at a significantly reduced cost to more computationally expensive procedures. In cases of severe over-fitting, and occasionally with shrinkage priors, it under-estimates the true bias slightly. However, we found that we did not need to correct the selection-induced bias perfectly to be able to accurately identify the point of predictive saturation in forward search. Indeed, just recovering the shape of the test elpd was sufficient for stable model size selection. Our procedure was found to effectively and reliably identify the point of over-fitting, and provides an accurate estimate of the bias, closely matching empirical observations.

6 Discussion

The use of order statistics in estimating selection-induced bias was introduced here in its most simple form, and is not without its limitations. We presently discuss those aspects which could be developed, and motivate the importance of future work in this area.

6.1 Limitations

Naturally, our selection-induced bias correction hinges on a number of assumptions which may not always be met in practice. For instance, should models produce correlated predictions then the distributions the elpd point estimates are sampled from may not be independent leading to some error in the order statistic. We already witnessed this to a degree when using shrinkage priors. Likewise, the use of \(\alpha = 0.5\) in the order statistic approximation is known to be conservative, although seemed to fit empirical data well.

Further, while the elpd difference distribution of models with no extra predictive information from the baseline may theoretically be centred on zero, this is unlikely to be the case in practice. Indeed, this centering may in fact be slightly below zero. The choice of selection-induced bias correction based on the negative correlation results from Watanabe (2010) is also heuristic, and may not be representative of the general case. It is not immediately clear how the correction could be alternatively motivated in a robust manner, and we invite future work on the subject given the positive results we were able to achieve using a very simple model. Despite these limitations, our correction performed well across all simulated and real data exercise.

6.2 Future directions

It might be possible to improve on our simple correction by modelling the candidate models’ LOO-CV elpd differences with a hierarchical model, and making decisions based on the posterior of the meta-model rather than the raw LOO-CV estimates (Gelman et al. 2012; Schmitt et al. 2023). As well as trying to collapse all models into one group of in-differentiable models, as is the aim of partially-pooled models, one might produce a correlation-aware hierarchical model capable of identifying clusters of models exhibiting similar predictive performance. For example, one might consider a multivariate model with a similarity-based covariance matrix.

A more attainable extension is perhaps soft-thresholding the point at which we correct the LOO-CV elpd along the search path. The Fisher-Tippett-Gnedenko theorem tells us that the limiting distribution of properly-normalised maxima from a sequence of iid standard Gaussian observations follows a generalised extreme value distribution. Using this, or posterior weights from some meta-model, we might achieve some probability of having observed a truly predictive model from a set of candidates. This, as opposed to using the order statistic’s mean as a hard threshold, would allow us to weight the correction we apply. While we did not observe any issue with the hard threshold we proposed, it is possible that in instances of many weakly-relevant predictors it is harder to separate between only negligibly relevant models, and our correction rule may be too rigid.