Efficient estimation and correction of selection-induced bias with order statistics

Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in step-wise selection), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in estimating both selection-induced bias and over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area.


Introduction
Since Lindley (1968) there has been a long history of Bayesian model selection using predictive performance comparison (see the extensive review by Vehtari and Ojanen, 2012).In this regime, we are interested in identifying a model from a collection of models whose out-of-sample predictive performance is best, either in the hope of generalising to future unseen data (Vehtari and Ojanen, 2012), or as a proxy for parameter recoverability (Scholz and Bürkner, 2022).When the number of predictors to consider is large, we might also be interested in achieving a smaller model that is capable of replicating the predictive behaviour of a larger model and use this instead for its improved interpretability, or to decrease data collection cost (Piironen et al., 2020).In finite data regimes, however, there can be significant uncertainty in differentiating between models of similar performance.
Following Piironen and Vehtari (2017a), consider a hypothetical utility for a model M  estimated on data  which we denote (M  , ).This estimate can be decomposed into two components: the true generalisation utility of the model, and the error associated with the estimate.The authors note that for safe model selection, we do not require that the utility be unbiased.If these utility estimates have high variance, however, then for a single data realisation, choosing the maximum utility model may lead to choosing a sub-optimal model in terms of generalisation utility.When we select a model which performs non-negligibly worse than the oracle asymptotically, we say that we have over-fit in selection.On top of this, the utility estimate for the selected model, say k, is typically optimistic: (M k , ) ≫ E  (M k ) , where we use E  (M k ) to denote the expectation over datasets.It is this over-optimism we refer to as selection-induced bias (already discussed by Stone, 1974).
While in general this over-optimism appears across a range of utility estimates and criteria (including common information criteria; Burnham and Anderson, 2002), we focus our discussion on Bayesian cross-validation (Geisser, 1975;Geisser and Eddy, 1979).
The kernel of this work is an efficient and actionable selection-induced bias estimation based on order statistics.Concretely, we: 1. discuss over-optimism when making model selection decisions based on noisy crossvalidated predictive metrics; 2. propose a lightweight selection-induced bias estimation and correction tool based on order statistics; 3. provide a diagnostic for our estimate to understand when it is liable to be unsafe, and more computationally expensive approaches may be warranted; 4. show empirically that correcting selection-induced bias can help expose non-negligible over-fitting; and, 5. apply our bias correction to real and simulated data examples where we are interested in choosing between multiple candidate models, and in forward search.

Relation to previous work
Predictive approaches to Bayesian model comparison have been qualitatively and quantitatively reviewed by Vehtari and Ojanen (2012) and Piironen and Vehtari (2017a) respectively.One popular method is choosing the maximum a posteriori (MAP) model, which reduces to choosing the maximum marginal likelihood model in the case of a uniform prior over the model space, and can be performed with Bayes factors (Jeffreys, 1998;Kass and Raftery, 1995).The computational instability of Bayes factors, and the difficulty they have sampling from the combinatorially exploding model space, however, is well-documented (Han and Carlin, 2001;Oelrich et al., 2020).It has also been documented in literature that under the assumption of orthogonal predictors, the most predictive model usually lies closer to the median probability model (Barbieri and Berger, 2004).In the context of variable selection, one might otherwise consider investigating the probability of inclusion of an individual predictor (Brown et al., 1998;Narisetty and He, 2014).Probability of inclusion approaches, including the median probability approach, have problems in case of collinearity as inclusion probabilities are diluted across the collinear predictors.Most contemporary work in this field has centred on defining intelligent priors over model space and model parameters (O'Hara and Sillanpää, 2009).However, Piironen et al. (2020) show how such approaches provide limited understanding of the predictive improvement expected by adding a predictor, and can fail when predictors are collinear.It has further been argued that priors and model selection exist as separate components to a Bayesian workflow (Bernardo and Smith, 1994;Key et al., 1999;Vehtari and Ojanen, 2012): a view we hold ourselves.
A fully decision theoretical Bayesian model selection procedure was first outlined by Lindley (1968), wherein the statistician is implored to build the largest, most complex model possible which encompasses all uncertainty relating to the modelling task, and then to identify smaller models capable of closely replicating its behaviour.Leamer (1979) motivated the use of the Kullback-Leibler (KL) divergence to quantify the distance between models' predictive distributions.Goutis (1998) and Dupuis and Robert (2003) went further, and proposed that one should directly project the reference model onto parameter sub-spaces so as to minimise the distance between the induced projected posterior predictive distribution and the posterior predictive distribution of the reference model: essentially fitting sub-models to the fit of the reference model.McLatchie et al. (2024) present practical recommendations for projection predictive model selection, and a more complete overview of recent advances therein.We are interested in cases where either the number of models is too small to justify the added complexity of projection predictive inference, or when we don't know how to perform the projection efficiently (e.g. in Student- observational family models, and some spatial models).
As a computationally efficient, intuitive, and non-parametric alternative, and one of the most common in non-Bayesian literature, statisticians compare models based on their out-of-sample performance with either cross-validation (CV; Geisser and Eddy, 1979;Shao, 1993;Vehtari and Lampinen, 2002), or various information criteria (Burnham and Anderson, 2002;Gelfand and Ghosh, 1998;Laud and Ibrahim, 1995;Marriott et al., 2001;Spiegelhalter et al., 2002;Watanabe, 2013).Such procedures are able to produce an almost unbiased estimate of a model's predictive ability (Watanabe, 2010), and have been reviewed extensively by Arlot and Celisse (2010), and by Gelman et al. (2014) in a Bayesian context.When using noisy predictive performance estimates such as cross-validation to select a best model, we may be over-optimistic in the utility estimate of the eventually selected model.This selection-induced bias has been discussed in depth by Ambroise and McLachlan (2002), Reunanen (2003), and Cawley and Talbot (2010).Cawley and Talbot (2010), in addition to demonstrating the potential risks of selection-induced bias, emphasised that this bias does not always lead to non-negligible over-fitting.In many cases, when one model is much better than others, or if there is sufficient regularisation (via priors in Bayesian context), the selection-induced bias is negligible even if the risk of selecting the "wrong" model is not small.We do not look to eliminate bias, but identify when it is non-negligible.In particular, we agree with past authors that selection-induced bias is not always a problem, and rather than a zero-one assessment of its existence, we endeavour to talk about negligible and non-negligible bias.
Very little work has dealt with bias correction in model selection, and those procedures that have been previously presented require intense computation.For instance, Stone (1974) proposed a procedure of nested CV (which they called the "double cross") which is insensitive to sampling and partitioning schemes.This procedure operates with complexity O ( 2 ) in terms of the number of folds : often fiendishly expensive.1An alternative was provided by Tibshirani and Tibshirani (2009) who made use of the bootstrap (Efron and Tibshirani, 1993) to correct CV estimates.While less expensive, the complexity nonetheless scales in O ( ) with  denoting the number of bootstrap samples.Our contribution is a lightweight alternative to these procedures, requiring only the sufficient statistics from the standard CV process and no additional computation.In other words, it has linear complexity O ().While one run of cross-validation forward search may require several hours of computation time for the experiments seen later in this paper, and thus nested cross-validation quickly becoming infeasibly expensive, the use of order statistics requires only a matter of seconds or minutes.The bootstrap is an improvement on nested cross-validation, but as previously mentioned will not scale as well as order statistics.
While our proposal is simple, and thus cannot be expected to produce perfect results, it can serve as a useful demonstration of the dynamics of selection-induced bias: when one model is sufficiently better than other candidates then we can proceed more comfortably, and when models are close we must take care and perform a correction.
We provide the statistician with a diagnostic for our procedure, and should one estimate it to not be safe for their task, then one can fall-back on safer albeit more computationally expensive approaches.While we don't claim that CV estimates are the best way to do the model selection, it can be sometimes computationally convenient and sufficiently good; our approach helps to understand when it is sufficiently good.

Alternatives to model selection While predictive performance-based model selection can
operate as a convenient non-parametric method of determining whether there is a significant difference between models, simply investigating parameter posteriors can sometimes provide sharper resolution to the problem (Gelman et al., 2014(Gelman et al., , 1996;;Wang and Gelman, 2015).When the model parameters are interpretable, causally consistent, nested, and there is no collinearity between predictors, then we can often investigate their posteriors directly with more success.Since the predictive distribution integrates over the uncertainty over all parameters, it may introduce more variance in the estimator than is necessary.For instance, we could investigate only the posterior of the extra parameter in the larger model and decide whether or not to include this parameter in our analysis based on whether its posterior differs greatly from the prior.These procedures, as we have already discussed, however, become more difficult given predictor collinearity.
BMA is an attractive initial first consideration, and has been shown by Le and Clarke (2022) to even be better in terms of prediction than model selection under some conditions.The good performance of BMA in the predictive sense has been covered previously by Hoeting et al. (1999) and Raftery and Zheng (2003).Bayesian model averaging weights, however, can be expensive to compute, and Yao et al. (2018) introduce pseudo-BMA weights as a lightweight alternative (we discuss the connection between pseudo-BMA weights and model selection based on elpd in Appendix A).The authors discuss that when none of the models considered is the true model, this method can be flawed, and motivate stacking of the models' posterior predictive distributions as more robust alternative.This consists of identifying model weights such that the combination of densities minimises the LOO-CV mean-squared error.More recently, Yao et al. (2023) have extended this paradigm to include more exotic model combinations.

Structure of this paper
We begin by discussing how model selection decisions themselves are liable to induce over-confidence in the expected performance of a selected model in Section 2. We demonstrate how the magnitude of selection-induced bias grows with the number of candidate models in Section 3, wherein we present our first contribution: a novel order statistic-based tool to determine when a model selection decision is liable to induce non-negligible bias.In Section 4 we demonstrate how, while a single decision might introduce negligible bias, when we make multiple sequential decisions, and with specific reference to forward search, the bias is likely to compound.We then extend our first contribution to correct for selection-induced bias in forward search, and use it as a stopping rule.We manifest the reliability, computational tractability, and flexibility of our correction on both simulated and real-world data exercises, wherein we compare our stopping rule with traditional alternatives.2

Predictive model assessment, selection, and comparison
The focus of our paper is on developing a framework for modelling the predictive difference estimates between models, and specifically identifying when models are sufficiently different to safely make a model selection decision based on sufficient statistics.
In the case of predictor collinearity the inclusion probabilities are diluted, and it is often safer to select models based on their predictive performance.We proceed considering the log score as the default utility best suited for evaluating the predictive density (Bernardo and Smith, 1994), and compute it with cross-validation (Geisser and Eddy, 1979).Vehtari et al. (2017) define the expected log point-wise predictive density (elpd) as a measure of predictive performance for a new dataset of  observations:

Predictive model assessment
where   ( ỹ ) denotes the density of the true data-generating process for ỹ , and   ( ỹ | ) = ∫ (LOO-CV) estimate (Geisser and Eddy, 1979), (2) If the data  are conditionally independent then Gelfand et al. (1992), andlater Gelfand (1996), demonstrate how this quantity can be efficiently estimated with importance sampling by reweighting posterior samples   , for  = 1, . . .,  indexing samples from the posterior, with weights leading to the importance sampling estimate (4) Vehtari et al. (2017) further demonstrate how this estimator can be stabilised and self-diagnosed with Pareto-smoothed importance sampling (Vehtari et al., 2022).Here we focused on LOO-CV, but our approach can also be used in the context of other cross-validation approaches, for example leave-future-out (Bürkner et al., 2020;Cooper et al., 2023) and leave-one-group-out (Merkle et al., 2019).
As well as the LOO-CV elpd point estimate, we can estimate the uncertainty in the elpd estimate under the normal approximation with standard error (SE) given by (5) where we write elpd LOO, (M  | ) = log   (  |  − ).Sivula et al. (2022) discuss under which conditions this approximation is accurate.Watanabe (2010) demonstrated the equivalence between cross-validation and the widely applicable information criterion (WAIC).However, in finite data regimes the error in the WAIC can be non-negligible.As such, we don't consider the latter any further.

Selection-induced bias
We can then use these predictive estimates to select one model from a pool of candidates.Even if each predictive estimate is almost unbiased for each model, however, when we use these for model selection the estimate for the selected model may incur some selection-induced bias.
Selection-induced bias emerges from selecting a model from a collection of models based on a noisy CV predictive estimate which could have been shown by the true elpd to be sub-optimal with respect to the models considered.4For the rest of this paper, we suppose that we choose the model with the highest LOO-CV elpd point estimate from a collection of  candidate models, and define selection-induced bias concretely as bias M 1 , . . ., We then wish to understand under which situations choosing the model with the best LOO-CV elpd estimate may lead to non-negligible over-optimism in expected LOO-CV elpd (selection-induced 4Most of the noisiness of these estimates is borne from finite data, and not the importance sampling scheme employed. bias), and how bad this bias is as a function of model similarity and the number of candidate models.
If the bias is small compared to the SE of the normal approximation to the elpd LOO estimate (Equation 5), we say that there is negligible selection-induced bias.However, Sivula et al. (2022) show that the standard deviation of the normal approximation may be under-estimated when the models are very similar.As such, extra care should be taken when the absolute difference in elpd between models is small (smaller than 4, say; see Appendix A).Depending on the application, there may be additional information pertaining to what is a practically relevant magnitude of the difference between model performances.
Rather than measure the performance of any model selection decision in terms of whether or not we choose the correct model (since the notion of a true model is difficult to justify in practice), we instead discuss how far we are from selecting the oracle model (in terms of minimal excess loss in expectation, as defined by Arlot and Celisse, 2010, we will return to this discussion in Section 3).In a word, we interest ourselves in the expected selection-induced bias injected by a given model selection decision, aim to minimise this as much as possible, and correct it when it is non-negligible.When we expect a model selection decision to induce only negligible amounts of selection-induced bias, we consider it a "safe" decision.

When is one model clearly better than another?
We begin by illustrating the  = 2 regime, in which we have only two candidate models.A typical question arises: is one of these models better than the other?These two models may be achieved with different priors, be nested (as in one iteration of forward search, say), or may be completely non-nested.Not only are we interested in discerning when it is safe to select one of the two models over the other, we would also like to be able to say something about the potential danger associated with simply selecting the better model in terms of LOO-CV elpd.In this section, we firstly present the probability of selecting the wrong model under some oracle knowledge, before discussing the selection-induced bias arising from this decision when the models are indistinguishable.We interest ourselves later on in this paper primarily in discerning when models are indistinguishable in terms of their predictive performance.
Consider two models, M  and M  , and suppose we know that M  is the oracle model.We follow the definition of Arlot and Celisse (Section 2.2; 2010) in that the oracle model is the model having the minimum excess loss in expectation over multiple datasets from a collection of models.In particular, it is possible for the oracle model to differ from the model containing the truly relevant predictor set (for simplicity we call that the "true model") in cases where that model can over-fit.In general, we don't always need to choose the oracle model; the excess loss can just be negligible.It is often more useful to discuss this difference in terms of a continuous scale, rather than focus on the (often unrepresentative) case of either selecting the oracle model or not.
The structure of these two models does not influence model selection since we only consider their elpds.Denote the difference between the models' theoretical elpds as As previously mentioned, we are most interested in the situation where no model is truly better than any other, that is that Δelpd M  , M  |  = 0.When this is the case, the estimated Consider the following (purely illustrative) setting: assume the distribution of elpd point estimates, Δ elpd LOO M  , M  |  , is a Gaussian centred on the mean  and having variance  2 = 1 (similar to the setup employed by Sivula et al., 2022).From this distribution, we "sample" one observation of the LOO-CV elpd difference, denoted5 The left plot shows illustrative LOO-CV elpd difference distributions betwee the true model M  and alternative model M  , as described in Equation 9.If we observe a LOO-CV estimated elpd difference the region coloured in red, then we are at risk of selecting the wrong model.The right plot shows this probability across different hypothetical mean values as computed in Equation 10.
When this observed elpd difference Δ elpd , is negative and we subsequently select M  over M  based on the LOO-CV estimate then we are at risk of inducing non-negligible selection-induced bias.Naturally, since M  is the oracle model, the true elpd difference is non-negative.
In this toy example, we can compute the theoretical probability of selecting the sub-optimal model, with the knowledge that M  is the oracle model, as where (Δ elpd , ) is the oracle distribution of the elpd difference between the two models (omitting the dependence on the data  for brevity).We visualise this theoretical illustration in Figure 1, where we let the mean slowly grow away from zero, keeping the standard deviation the same.The probability of selecting the wrong model then is greatest for a given  when the two models are most similar ( closest to zero), and decays more slowly with an increasing mean of the elpd difference distribution.
The selection-induced bias is likely to be non-negligible when it is much larger than the variation in the LOO-CV point estimates.In terms of Figure 1, if the selection-induced bias is less than the illustrative  2 = 1, then it is less critical to model selection than if it were much larger.In a word, selection-induced bias is said to be negligible if it is dominated by the variation in the LOO-CV point estimates.
Nonetheless, in our experience (outwith this paper) we are able in most cases to simply choose the more complex model fitted with reasonable priors since the bias is negligible in the  = 2 case.However, as the number of effectively equivalent models grows (e.g., if the number of candidate models grows exponentially with the number of observations , or if the number of predictors  is much larger than log() as described by Gelman et al., 2013) the potential bias grows too.While selection-induced bias remains negligible in the two model case, it can balloon as  grows.The specifics of the rate shown in Figure 2 is described later in this paper.This many model regime is the more interesting case.

Bias grows with the number of candidate models
In the regime of many candidate models, when none is truly more predictive than any other, the probability of one model appearing better than the rest grows with the number of models being compared.This is true even if independent test data would show that there is in fact no difference from it to the true model (or that it is in fact worse).As this probability grows, so too does the magnitude of selection-induced bias.In this section, we first show how these high risk regions behave as the number of models grows, and propose a selection-induced bias estimation scheme based on order statistics to identify when the bias is no longer negligible.

Empirical demonstration
We fit a series of nested linear models to  = 100 data points sampled from where  is the  ×  identity matrix, and 0 ∈ R  is the -dimensional zero vector.The data   = (1,  ,1 , . . .,  , −1 ) are  dimensional, with  ,1 , . . .,  , −1 iid ∼ normal(0, 1), and the true regression coefficients take the form  = (1,  Δ , 0, . . ., 0) ∈ R  .The variance of the residuals, , is chosen to vary to the signal-to-noise with  Δ while keeping the marginal variance constant.In particular, we only consider  Δ < 1, and set  2 = 1 −  2 Δ .We will then fit a baseline model comprising only the intercept term and denoted M base :   |  1 ,  ∼ normal( 1 ,  2 ).We compare this to the remaining  − 1 one-predictor models of the form for  = 2, . . .,  so that model M 2 alone contains the two predictor set, and the scale factor  is treated as an unknown parameter across all models.Naturally, letting  Δ = 0 collapses all models to be equivalent.6 We vary the true  Δ an  distance from zero.For each value of  Δ , we investigate the LOO-CV elpd difference point estimates of the  − 1 models from the baseline model, and identify the model with the highest (best) cross-validated score.This is then compared to its test elpd, and to that of the model with the known true predictor set (which we call the "true model", even if we do not use the true values of the predictors).We show this for different numbers of candidate models  and along  Δ in Figure 3.As previously mentioned, the selection-induced bias (difference between the LOO-CV elpd and the test elpd of the selected model) is negligible when  = 2, but grows with .For  = 100, it would be unsafe to make a decision based on LOO-CV elpd point estimates up until the theoretical  Δ grows above ≈ 0.3, as the selection-induced bias is 6We fit the model using independent normal(0, 10 2 ) priors over the regression coefficients, and a half-normal prior over  2 with mean zero and unit variance.non-negligible.There is no formal threshold for non-negligible bias, but in Appendix A we show how the difference between elpd estimates can be connected to the notion of non-zero weights in model averaging.The magnitude of the bias also grows with , seen in the gap between the LOO-CV elpd and test elpd of the selected model.
When the true value of  Δ is low, the model containing the known true predictors does not out-perform the baseline model in terms of test elpd.It is in these situations where the difference between the oracle model and so-called true model becomes apparent.In this case, the baseline model has the lowest excess loss in expectation over data realisations.When the true model performs worse than the oracle model, but we proceed to select the former, we have over-fitting.This excess loss from the oracle model is the difference from the baseline model to the selected model's test elpd.As  Δ grows sufficiently large to consistently select the true model, this tends to zero, as too does selection-induced bias.The region between the LOO-CV and test elpd of the selected model is the selection-induced bias, while the region between the oracle elpd and the test elpd of the selected model is over-fitting.In this example, the magnitude of over-fitting is negligible across all values of  Δ .
Since there exist situations where the true model performs worse than the oracle model, we find some justification for our previous discussion in Section 2.2.Namely, we should be more interested in identifying models which are close to the oracle model rather than retrieving the true model (since the two are not equivalent).Indeed, finding the true predictors is not useful if it worsens prediction.These true predictors we "fail" to select are likely to have a coefficient magnitude below the detection limit, as illustrated when we increase  for a small (but non-zero)  Δ .
Knowing then that the risk of selection-induced bias exists, and not knowing the true value of  Δ , we presently concern ourselves with producing a selection-induced bias estimation procedure geared towards identifying high risk regimes in the presence of many candidate models by modelling the distribution of LOO-CV elpd difference estimates.

Our proposal: modelling the elpd differences
Suppose now that we have  candidate models, none of which is better than any other.The best LOO-CV elpd difference point estimate from a baseline model may appear, purely by chance, to be greater than zero.By modelling the underlying distribution of the elpd difference point estimates as iid draws from a Gaussian distribution centred on zero (so that none is truly better than the baseline model) with some variance  2  , Δ elpd LOO ∼ normal(0,   ),  = 1, . . ., , we can leverage order statistics to produce a selection-induced bias estimate.Blom (1960) posited an approximation to the expected maximum from  iid samples from a standard normal distribution, denoted  ( ) (later developed by Harter, 1961;Royston, 1982): The formulation allows for an approximation parameter, .In the limit as  → ∞, the expectation of the maximum order statistic lies within the interval between levels  = 0.39 and  = 0.5.As such, for maximum safety we will use  = 0.5 herein.
In the presence of  candidate models, they are practically equivalent if the maximum LOO-CV elpd difference point estimate to the baseline model is less than We estimate σ as the maximum likelihood estimator of the standard deviation of a half-normal distribution fit to the upper half-tail of elpd difference point estimates (greater than their median, m ), σ = 2 We denote by 1  () the indicator function over a set  ⊂  at  ∈ .This is not yet an estimate of the selection-induced bias, but the highest LOO-CV elpd difference expected from a collection of  models, none of them truly more predictive than any other.In particular, this is an estimate of the expected distance between the zero line and the red line in Figure 3.We discuss later in Section 4.1 how one might use this as the basis for such an estimate of the selection-induced bias (the difference between red and black line in Figure 3).We fit a half-normal distribution to the upper half of the observed elpd difference point estimates instead of a fitting a full normal distribution over all of them because (even if none of them is truly relevant) the distribution of elpd difference point estimates is likely to be slightly skewed.A similar result has previously been noted by Sivula et al. (2022), who showed the full elpd difference distribution (not just the point estimates) to be more skewed the more similar the models being compared are.We interest ourselves only in the right tail of the distribution of point estimates as it is the more relevant to the maximum order statistic.We discuss the validity of this stabilisation technique and how to diagnose it in Section 3.3.
In Figure 4, we fix  Δ = 0 and vary  in the experiment design previously described by Equations 11-12.For each value of , we investigate the best observed LOO-CV elpd difference to the baseline model.Since we have fixed  Δ = 0, we know that no model is theoretically better than any other.As such, these empirical LOO-CV elpd differences are just random noise.We repeat the experiment 100 times for each value of  and show the mean of the empirically observed best LOO-CV elpd difference in each regime compared to the expected best observed LOO-CV elpd difference as computed according to Equation 15.This theoretical boundary grows quickly for  less than 10, before slowing.Our order statistics-based equation closely matches empirical observations in this example.
A second empirical demonstration of Equation 15 is provided in Appendix B with reference to a real data study on the effect of a poverty reduction intervention on infant brain activity.

Diagnosing the theoretic assumptions
We have made a number of assumptions thus far in our analysis.In particular, we have assumed that the elpd difference distributions of models of equivalent predictive performance are: 1. normally distributed around zero (or at least half-normal above zero); and, 2. independent and identically distributed.It is known that the empirical elpd difference distributions are slightly skewed, with heavier right tails (this was seen in empirical results not shown here, and has previously been discussed by Sivula et al., 2022).This behaviour becomes more pronounced as models become closer together (through the use of regularising priors for example), and such correlation may appear as models make more similar predictions.It is possible then that the distribution of elpd difference point estimates (which we interest ourselves in) may also be slightly skewed.As such, the normal approximation may not be representative of the underlying distribution, and the half-normal approximation is better able to capture the characteristics of the right tail.Indeed, it is this right tail that is of most interest to us, as it contributes most to the maximum order statistics.
It would be possible to fit a sup-Gaussian distribution to this tail with more parameters, e.g. a half-Student-, but as shown empirically, the half-normal works sufficiently well.Since it has fewer parameters we expect it to be more robust, and from the half-normal we get information from the bulk of the distribution.For example we could also have fit a Pareto distribution to the tail as done by Vehtari et al. (2022), or a Student- distribution.We would, however, then have to estimate more moments of the distribution, which we found to be unstable in our experiments.If the utility distributions are truly skewed, we expect our normal approximation to inject some bias, but if we use a more flexible approximation there would be higher variance.In general there is no obvious choice of optimal approximation to the utility distributions, and indeed this represents an interesting direction of future investigation.
We are also able to diagnose when our half-normal approximation is likely to fail: when the right tail of the empirical distribution has infinite variance.We follow Vehtari et al. (2022) and first identify a tail cut-off point of the empirical elpd difference point estimate distribution in line with Scarrott and MacDonald (2012), and subsequently fit a generalised Pareto distribution to this tail.Zhang and Stephens (2009) present an efficient Bayesian procedure for estimating the parameters of the distribution.In particular, the k parameter of the generalised Pareto distribution is related to its number of fractional moments.We follow the rule of thumb presented by Vehtari et al. (2022, Section 3.2.4., Equation 13) in establishing the minimum required value of k to be confident that the right tail has finite variance when comparing 10 or more models.7 When this Pareto k value is too large, it suggests that the elpd difference point estimates are likely from a distribution with infinite variance.It is at this stage that one should fall back on safer, albeit more computationally expensive approaches: the double-cross (Stone, 1974) or the bootstrap (Tibshirani and Tibshirani, 2009).
Elsewhere, we know that the distribution of the maximum order statistic (of which we interest ourselves only in the mean) is slightly skewed.Because of this skewed-ness, it is possible that the mean alone (which we have denoted  ( ) ) under-estimates the maximum of LOO-CV elpd difference estimates.
When the model space grows too large to consider each possible model at once, and if the models can be nested, then one might favour some efficient search through the model space.Such searches come with their own risks.We presently move from the realm of single decisions towards sequential model selection.

Compounding the bias through forward search
While selection-induced bias might remain negligible when we need only select one model from a group of models once, when we compound decisions, the magnitude of the bias is likely to compound too.For instance, in forward search, we begin with a baseline model (usually the intercept-only model) and then consider all models containing one predictor and the intercept term.Having fit all so-called "size-one" candidate models, we choose the one with the highest LOO-CV elpd point estimate.We now perform another forward step, and fit all models containing the intercept, the predictor chosen previously, and a second additional predictor.Once more we select the size-two model based on the highest LOO-CV elpd point estimate.We repeat this either until either some stopping criterion is met, or we have exhausted all available predictors.
We can easily implement various other search heuristics, such as exhaustive search (Galatenko et al., 2015), backward search (Nilsson, 1998), and stochastic search (George and McCulloch, 1993;Ntzoufras et al., 2000).However, we proceed with forward search alone as it affords the possibility of an efficient stopping rule (Harrell, 2001) and intuitive compounded bias estimation (which we present in Section 4.1).
At the beginning of this search, it is likely that there exist at least some predictors that improve on the predictive performance of the baseline model, but there comes a point after which the addition of further predictors affords the statistics no theoretical predictive performance.We will refer to this as the "point of predictive saturation" herein.We wish to identify this point, and estimate the bias induced by continuing the search beyond it.Under appropriate prior consideration, the full model is considered the oracle model.If one witnesses a sub-model having better test performance than the full model, then we recommend they reconsider their priors.In practice, we do not need to identify the oracle model size, only a sub-model having negligible excess loss compared to the oracle model.
When we observe a separation of the LOO-CV and test elpd along the search path, we can conclude that there is non-negligible selection induced bias in the performance estimates: purely due to the predictor ordering achieved by LOO-CV based search, we are led to believe that some sub-models can outperform larger models.This is seen by the elpd estimate at the bulge being greater than that of the final full model.In practice, McLatchie et al. (2024) argue that should one observe such a bulge in the forward search path, then one should cross-validate over search paths (something akin to the double-cross), but only up to the model size inducing this bulge.This is because cross-validating over forward search paths can be expensive, and we understand that the model size of maximal selection-induced bias is likely greater than the model size of theoretic predictive saturation.The dynamics of forward search can be seen in the relationship between the predictive criterion (the elpd, say) and the model sizes along the so-called "search path".The plot typically shows an increasing elpd curve, indicating that the addition of further predictors improves predictive performance.We expect the curve to flatten after a given model size, indicating that the addition of further predictors does not improve the criterion any longer: the point of predictive saturation.However, it is possible that in the finite data regime, and under model misspecification, we instead witness a bulge along this search path, such as that shown in Figure 5.The test curve in Figure 5 reaches the point of predictive equivalence and then dips down with increased model size.This is not necessarily selection-induced bias, but could also be over-fitting.Indeed, we aim to correct selection-induced bias to identify non-negligible over-fitting in forward search.
In order to correct selection-induced bias and estimate the location of the point of predictive saturation (and the bias induced by continuing beyond it) we introduce the following application of our proposal from Section 3.2.

Correcting for selection-induced bias in forward search
We can estimate the selectioninduced bias at model size  in a forward search path, and correct the LOO-CV elpd difference estimate, denoted Δ elpd where  ( )  σ is as defined in Section 3.2.Intuitively, if we detect non-negligible selectioninduced bias, then we correct the LOO-CV elpd difference by an estimate of the selection-induced bias induced by step .
We produce this estimate of selection induced bias, denoted bias ( ) , building on Section 3.2 as bias ( ) = 1.5 ×  ( )  σ .The nature of this estimate stems from Watanabe (2010, Table 3), who discussed how cross-validated scores and the average Bayes generalisation loss can exhibit strong negative correlation (albeit demonstrated with only one example).Theoretically this is due the removal of an (influential) observation from the training set skewing the posterior predictive distribution away from the true data-generating distribution, and its re-appearance in the test set seeming like an outlier.As such, the correction of LOO-CV elpd estimates warrants not only a correction to the median (the difference between red line and zero line in Figure 3), but a negatively-correlated correction (the difference between red and black lines in Figure 3).In a word, the corrected elpd difference value should be a reflection across the median of candidate elpd differences.The choice of magnitude of 1.5 is reflective of the magnitude of the negative correlation of approximately −0.8 noted by Watanabe (2010).However, since this number was achieved only over one example, we understand that there can be instances where this correlation can be closer to 0 or −1.As such, we show repetitions of some later experiments using values of {1, 1.5, 2} when estimating the bias in Appendix C.
Cumulatively summing these corrected difference estimates, elpd starting at the base LOO-CV elpd estimate at model size 0, denoted elpd LOO , allows us to estimate the bias induced by continuing beyond the point of predictive saturation.As such, our proposition comprises two parts: an order statistics-based estimation of selection-induced bias (from Section 3.2), and an associated correction term.

A simulated example
We can visualise the performance of our correction through a simple simulated data experiment: we generate data according to where the number of predictors is set to  = 100, normal(, Σ) denotes a -dimensional normal distribution with mean vector  and covariance matrix Σ.The matrix  ∈ R ×  is block diagonal, with each block having dimension 5 × 5.Each predictor has mean zero and unit variance and is correlated with the other four predictors in its block with coefficient , and uncorrelated with the predictors in all other blocks.We will investigate regimes of zero and high correlation:  = {0, 0.9}.Further, the weights  are such that only the first 15 predictors influence the target  with weights ( 1:5 ,  6:10 ,  11:15 ) = (, 0.5, 0.25), and zero otherwise.We set  = 0.59 and  2 = 1 to fix  2 = 0.7.We simulate  = {100, 200, 400} data points according to this data-generating process (DGP).Across all six  and  regimes, we perform forward search based on the maximum LOO-CV elpd point estimate of the candidate model, fitting each model using independent Gaussian priors over the regression coefficients (which we understand to be liable to over-fit), and comparing them to a reference model fit with the sparsity-inducing R2D2 prior (which is discussed in more detail in Section 4.3 and Appendix D).
In Figure 6(a) we visualise the performance of our bias correction across these six regimes in terms of the mean log predictive density (mlpd) computed on ñ independent test observations, ỹ1: ñ, as which admits a LOO-CV difference estimate similar to the elpd (Piironen and Vehtari, 2017a).We do not correct the LOO-CV differences beyond their LOO-CV bulge since at this point it is already clear that we have over-fit the data.We observe from these plots two positive results and one negative: 1. the corrected mlpd does not greatly surpass the mlpd of the of the reference model, contrary to the LOO-CV mlpd; 2. the corrected mlpd estimates the point of predictive saturation close to the true point of predictive saturation on test data; however, 3. the corrected mlpd of the search path can under-estimate the true selection-induced bias slightly in cases of severe over-fitting (most notably when  = ).This suggests that a simple order statistic model is able to represent selection-induced bias well enough to be useful.Naturally, the quality of our correction improves with data size and the number of candidate models.
The approximation performs best in the larger data settings.It would further appear that, in this example, once  > 4 the degree of selection-induced bias is negligible.This could be due in part to the negative correlation between Bayes cross-validation and the average Bayes generalisation loss (Watanabe, 2010): even if we know the underlying distribution of LOO-CV elpd estimates, the asymptotic theoretical elpd distribution is likely negatively correlated with it (and worse in absolute value).Such correlation is naturally exaggerated in smaller data regimes, as the relative influence of data points grows, and our proposal under-estimates the true selection-induced bias.Despite under-estimating the bias slightly in the  =  case, it appears to correct enough of the selection-induced bias to accurately and efficiently identify over-fitting across all regimes.
As previously mentioned, it is not clear how one would motivate any particular selectioninduced bias correction beyond what we have proposed (see Section 6 for more discussion, and Appendix C for a repetition of the experiments using different levels of correction), and we rather recommend that one endeavours instead to mitigate against non-negligible over-fitting in the first instance.One such way is with a more principled application of the prior.Gelman et al. (2017) reason that prior distributions in Bayesian inference should be considered generatively in terms of the potential observations consistent with the prior, and predictively in terms of how these observations align with the data we truly collect.Cawley and Talbot (2010) also discuss the benefits of using regularising techniques to avoid over-fitting in selection; generative priors are such a regulariser.

Reducing selection-induced bias with priors
Joint shrinkage priors are a modern class of such generative priors which encode the statistician's belief on a predictive metric and filter this information through the model structure to define prior definitions on the model parameters.For instance, the R2D2(M2) prior (Aguilar and Bürkner, 2023;Zhang et al., 2022) encodes a prior over the model's  2 which is then mapped to a joint prior over the regression coefficients and the residual variance.By positing a lower expected predictive performance in terms of explained variance than what is implied by independent Gaussian priors, we shrink irrelevant regression coefficients more.This posterior shrinkage and reduced posterior uncertainty then naturally feeds through to the predictive distribution and thus the LOO-CV elpd estimate in Equation 2. This means that candidate models make more similar predictions, the empirical distribution of elpd difference point estimates is more concentrated, and the magnitude of selection-induced bias is reduced.See Appendix D for more discussion on the specifics of the priors and the hyper-parameters used throughout this paper.
Repeating our example from Section 4.2, but using a more regularising generative prior, we can reduce the degree of selection-induced bias along the search path.In Figure 6(b), we find that once more our heuristic identifies the correct model size after which no more predictors truly aid predictive performance, and recovers the shape of the test curve.We do, however, under-estimate the bias slightly in the  =  regimes.This is likely due to the models making more similar predictions, and thus our proposed Equation 15 will likely under-estimate the magnitude of the bias.However, the identified points of inflection are similar under both priors, reassuring us that our selection-induced bias correction is stable across different bias regimes.
Priors are not the only way one might consider reducing selection-induced bias in forward search; projection predictive inference is one information-theoretic approach to the task that has been proven to be very effective.

Projection predictive inference
We will later compare the model sizes selected following our bias correction to projection predictive inference.The projection approach is known to  greatly reduce the variance in the relative performance estimates, leading to much smaller selection-induced bias (see, e.g.Piironen and Vehtari, 2017a).Projection predictive inference was first introduced concretely by Goutis (1998) and Dupuis and Robert (2003), but follows the decision-theoretical idea presented previously by Lindley (1968).In this paradigm, one begins with a large reference model one believes to be well-suited to prediction and subsequently constructs small models capable of approximately replicating its predictive performance; essentially, fitting sub-models to the fit of the reference model.
We call the model we project the reference model onto the "restricted model" or "sub-model", and denote its parameter space by Θ ⊥ (which may be completely unrelated to the parameter space of the reference model, Θ * ).Further denote the predictor-conditional predictive distribution of the restricted model on ñ unobserved data observations ỹ = ( ỹ1 , . . ., ỹ ñ)  by ( ỹ |  ⊥ ), for  ⊥ ∈ Θ ⊥ and  ∈ {1, . . ., }, and that of the reference model by ( ỹ |  * ).Then we compute the projection draw-wise on  ∈ {1, . . ., } posterior draws from the reference model as For implementation details see the papers by Catalina et al. (2021Catalina et al. ( , 2020)), Weber and Vehtari (2023), and Piironen et al. (2020).McLatchie et al. (2024) outline further workflow steps to minimise the computational burden while achieving robust results.Projective forward search begins by projecting the reference model onto the intercept-only model to establish the root of the search tree.Subsequently, the reference model is projected onto size-one models, from which we select the model whose posterior predictive distribution is most similar to the reference model's posterior predictive distribution by the KL divergence in line with the work of Leamer (1979).The search then continues as before.Because of the use of the KL divergence over LOO-CV, our order statistics-based bias estimation and correction can not be immediately translated to work with projection predictive inference.It is the disjoint between the KL divergence and the predictive performance noted by Robert (2014), however, which is partially responsible for the reduced selection-induced bias in projection predictive inference.
As such, if the reference model (and sub-models) can be constructed in such a way that conforms with the requirements of projection predictive inference, then we recommend one begins with this.For observational families whose KL divergence can not be easily computed, or for model structures that can not be easily decomposed into additive components, cross-validation forward search (as described in the beginning of Section 4) can be a useful alternative.

Real world data
To manifest the performance of our selection-induced bias correction beyond simulated settings, we compare it with classical stopping criteria, including projection predictive inference, across four publicly available real world datasets: one regression task (Redmond and Baveja, 2002) and three binary classification (Gorman and Sejnowski, 1988;Krzysztof Cios, 2001;Sigillito et al., 1989) summarised in Table 1 (in Appendix D) after any necessary pre-processing.8For the regression task we fit a linear regression reference model with a Gaussian observation family, and for the binary classification a Bernoulli generalised linear model as implemented in brms (Bürkner, 2017) using the probit link function.For each dataset, we perform forward search based on LOO-CV elpd, fitting sub-models with independent standard Gaussian priors over the regression coefficients in the first instance, and then again with sparse priors (the R2D2 for regression tasks, and the regularised horseshoe for the classification tasks; see Appendix D for details on the hyper-parameters used for each task).We split all data into 10 cross-validation folds, leaving 10% of the data for testing in each fold.In Figure 7 we show the cross-validated and test search paths for each of these, along with the corrected search paths as computed using Section 4.1.
Across all real data experiments, our corrected LOO-CV mlpd did not significantly exceed that of the reference model, except for the heart data.This indicates our correction effectively addresses selection-induced bias without falsely suggesting better performance than the reference model.In some cases, our correction appeared to under-correct for small model sizes but eventually recovered the true test mlpd.In the sonar data we witnessed the opposite: our proposal 8The crime, ionosphere, and sonar data are freely available from the UCI machine learning benchmarks database: https: //archive.ics.uci.edu/ml/index.php; the SPECT heart data are available from: http://odds.cs.stonybrook.edu/.Pre-processing is as described by Piironen and Vehtari (2017a).over-corrected bias.Even when our correction over-or under-estimates bias, it provides valuable insights into test mlpd behaviour.We presently extend our selection-induced bias correction to produce a stopping criterion: we select the model size inducing the maximum corrected mlpd along the search path as computed by Section 4.1.We compare this to simply stopping at the point of maximum LOO-CV (which we call the "bulge" rule), and projection predictive inference.Again, we do not look to identify the oracle model size, but rather we hope to select a model size having negligible excess loss compared to the oracle model.Alongside the bulge, we consider the so-called 2 rule, which proposes that one should select the smallest model that is essentially indistinguishable from the model with the best LOO-CV performance, as judged by the two standard error simultaneous confidence interval.
In projection predictive model selection, we first achieved a search path inline with the previous discussion in Section 4.4.We then choose the smallest model size  whose (projected sub-model) elpd difference point estimate to the reference model (for which we use the full model fitted by standard Bayesian inference) is less than four.This is in line with the recommendation of McLatchie et al. (2024, and discussed more in Appendix A).We compare the model sizes chosen and their stability across forward search cross-validation folds in Figure 8.We make three primary observations: 1. projection predictive inference selects smaller model sizes than the most naïve approaches and does so stably; 2. our correction achieves slightly smaller model sizes than the projective approach with similar mlpd, and requires fewer iterations of forward search than the bulge and 2 rules (while achieving better mlpd than the latter two); and, 3. regardless of model size selection heuristic, the use of shrinkage priors results in smaller and more stable model size selection.Across all data, the 2 rule chooses the smallest sub-model sizes and the bulge rule chooses the largest.The model sizes selected by our correction are, by contrast, slightly smaller to those selected by projection predictive inference.There is more variation in the model size selected by the bulge rule.In fact, model size selection is generally more variable when using the independent Gaussian priors irrespective of the heuristic.Our correction is similarly stable to projection predictive, which is known to be very stable under the KL forward search heuristic, providing confidence in our own tool.Indeed, the good performance of projection predictive inference has been extensively covered by Piironen and Vehtari (2017a).Our correction identifies models with better predictive performance than the more naïve 2 and bulge rules, and with better stability.The model size and mlpd difference estimated by our correction do not seem to be sensitive to the prior, unlike LOO-CV bulge model sizes; this is most prevalent in the heart data where the use of Gaussian priors results in vastly reduced predictive performance when using the bulge rule.Sparsity-inducing priors result in smaller and more stable model sizes being chosen across all data examples.This is likely related to our previous discussion on generative predictive priors: since they regularise in favour of predictive performance, they result in forward search paths that are less variable and less likely to introduce selection-induced bias (although may lead us to slightly under-estimate bias).Their predictive regularisation means that we are likely to more quickly and more stably identify predictive predictors, and ultimately identify more predictive predictor subsets.
The importance of identifying small models is particularly clear in this forward search example: when we must fit all models up to the bulge model size, this requires approximately 30 2 = 900 model fits in the sonar and crime case studies.When these models are large, this can mean days of computation time.While the 2 heuristic can identify small models with comparable predictive performance, it still requires fitting up to the bulge model size and then working back.In a word, it does not reduce the computational cost.As such, our bias correction is particularly attractive to those who are able to implement online stopping rules.
It would also have been possible to construct a stopping rule based on the inclusion of zero in the two standard error range of the LOO-CV elpd difference estimate.However, such heuristics struggle in the case of many weakly-relevant predictors.In particular, they are liable to identify too-early stopping points and achieve sub-optimal out-of-sample predictive performance when many predictors have weak (but non-zero) predictive information (e.g. in the ionosphere example).This is further discussed in Appendix E.
While our heuristic was not motivated from the perspective of model size selection, we find that even correcting just some of the selection-induced bias along forward search paths can reveal the behaviour of test performance.Using this corrected LOO-CV mlpd as a proxy for test mlpd is reasonable across these real data examples.As such, model size selection likely does not depend greatly on the assumed asymptotic correlation between LOO-CV and Bayes generalisation loss which we discussed in Section 4.1.

Recommendations
Cross-validation of predictive metrics is a common non-parametric method of making model selection decisions.It is particularly well-suited to cases when robust alternatives such as projection predictive inference are unavailable.Given the uncertainty present in LOO-CV estimates, it is possible that they are not always sufficient to make model selection decisions, or that they might encourage an incorrect decision based purely on noise.We discussed the risk of selecting between two candidate models (Scenario 1), presented a selection-induced bias estimation scheme based on order statistics in the case of selecting many candidate models (Scenario 2), and leveraged it to build a lightweight bias correction method in the forward search setting (Scenario 3).
Scenario 1: the two model case In the case where we compare two models only, we can safely make a model selection decision if the elpd difference point estimate between them is greater than four in absolute value (see Appendix A for more discussion).Should the elpd difference point estimate be less than four in absolute value we have three options: (1) if the models are not nested, combine them by model averaging or stacking; (2) ensure the models' respective priors are reasonable and select the more complex of the two; or, (3) keep them both as a set of best models.This scenario is the least concerning of the three, and in most cases we can safely select the most predictive of the two provided its priors are reasonable.
Scenario 2: the many model case In the many model case all previous recommendations apply, and we might further use our proposal from Section 3.2 to determine whether all models are equivalent in terms of predictive performance using order statistics.If we detect non-negligible selection-induced bias, then we do not make a model selection decision and instead perform model averaging or stacking (in the non-nested case).Our estimation procedure was shown to closely match empirical observations, and requires no further computation beyond the sufficient statistics of standard CV estimates.We further proposed a diagnostic to determine when the underlying assumptions of our proposition break down, and more computationally expensive approaches can be used instead.
Scenario 3: forward search When the model space is large and we consider nested models, then we should first try projection predictive inference for model size selection.If the reference model is too difficult to fit, or if it belongs to an observation family which can not easily be implemented in the projective framework, then our lightweight selection-induced bias correction can have a direct application: by performing CV forward search and correcting the incremental elpd differences along the search path according to Section 4.1, we can recover an accurate estimate of the out-of-sample performance along the search path.And we do so at a significantly reduced cost to more computationally expensive procedures.In cases of severe over-fitting, and occasionally with shrinkage priors, it under-estimates the true bias slightly.However, we found that we did not need to correct the selection-induced bias perfectly to be able to accurately identify the point of predictive saturation in forward search.Indeed, just recovering the shape of the test elpd was sufficient for stable model size selection.Our procedure was found to effectively and reliably identify the point of over-fitting, and provides an accurate estimate of the bias, closely matching empirical observations.

Discussion
The use of order statistics in estimating selection-induced bias was introduced here in its most simple form, and is not without its limitations.We presently discuss those aspects which could be developed, and motivate the importance of future work in this area.
6.1.Limitations Naturally, our selection-induced bias correction hinges on a number of assumptions which may not always be met in practice.For instance, should models produce correlated predictions then the distributions the elpd point estimates are sampled from may not be independent leading to some error in the order statistic.We already witnessed this to a degree when using shrinkage priors.Likewise, the use of  = 0.5 in the order statistic approximation is known to be conservative, although seemed to fit empirical data well.
Further, while the elpd difference distribution of models with no extra predictive information from the baseline may theoretically be centred on zero, this is unlikely to be the case in practice.Indeed, this centering may in fact be slightly below zero.The choice of selection-induced bias correction based on the negative correlation results from Watanabe (2010) is also heuristic, and may not be representative of the general case.It is not immediately clear how the correction could be alternatively motivated in a robust manner, and we invite future work on the subject given the positive results we were able to achieve using a very simple model.Despite these limitations, our correction performed well across all simulated and real data exercise.

Future directions
It might be possible to improve on our simple correction by modelling the candidate models' LOO-CV elpd differences with a hierarchical model, and making decisions based on the posterior of the meta-model rather than the raw LOO-CV estimates (Gelman et al., 2012;Schmitt et al., 2023).As well as trying to collapse all models into one group of in-differentiable models, as is the aim of partially-pooled models, one might produce a correlationaware hierarchical model capable of identifying clusters of models exhibiting similar predictive performance.For example, one might consider a multivariate model with a similarity-based covariance matrix.
A more attainable extension is perhaps soft-thresholding the point at which we correct the LOO-CV elpd along the search path.The Fisher-Tippett-Gnedenko theorem tells us that the limiting distribution of properly-normalised maxima from a sequence of iid standard Gaussian observations follows a generalised extreme value distribution.Using this, or posterior weights from some meta-model, we might achieve some probability of having observed a truly predictive model from a set of candidates.This, as opposed to using the order statistic's mean as a hard threshold, would allow us to weight the correction we apply.While we did not observe any issue with the hard threshold we proposed, it is possible that in instances of many weakly-relevant predictors it is harder to separate between only negligibly relevant models, and our correction rule may be too rigid.Δ elpd LOO Evidence for more complex model 3), the pseudo-BMA+ weights (Equation A.4), and the probability of one model being better than another based on the normal approximation of the elpd difference distribution (Equation A.2). We fix the standard error of the elpd difference normal approximation to be se(Δ elpd LOO ) = 2 for demonstration purposes.For this choice of standard error, Equation A.2 acts very similarly to the probit function (naturally, as it is equivalent to the Gaussian cumulative density function computed up to zero).This probit is precisely the pseudo-BMA weight of the more complex model, and mildly altered becomes the pseudo-BMA+ weights.At the point Δ elpd LOO ≈ 4, and the probability of it being better than the less complex model is approximately 98%.

A. Connection between elpd difference and model averaging
Choosing a model based on elpd point estimate, as we have proposed throughout this paper, has a deeper connection to Bayesian model averaging.Suppose we are in the  = 2 case, we might say that one model is sufficiently better than another to feel safe making a model selection decision if This was initially considered by Sivula et al. (2022) in terms of the reliability of the standard error estimate: when the elpd difference point estimate is this low, it is likely that the elpd difference standard error estimate is too small.McLatchie et al. (2024) later used this flavour of selection heuristic in the context of projection predictive inference.Suppose we have once more the situation where we have two models M  and M  .If we consider a normal approximation of the LOO-CV elpd difference estimate, then the probability that the more predictive model, say M  , is truly better than the other is Suppose the standard error is two for demonstration purposes, then when the point estimate is 4, the probability of one model being better than another is approximately 98%.This is shown in EEG cash incentive data.We show in the point ranges the LOO-CV elpd differences and difference standard errors of 10 models with respect to the baseline model (chosen as the model with the median LOO-CV elpd, and located at index 6).We plot the expected highest elpd difference under the assumption that no model is better than the baseline with order statistics in the red dashed line.This identifies at least one of the models (the tenth) to be significantly better than the baseline model.

C. Bias estimation under different assumed asymptotic cross-validated correlation regimes
As previously mentioned, the origin of our bias correction lies in the work of Watanabe (2010, Table 3).They discuss a notable negative correlation between cross-validated scores and average Bayes generalisation loss, observed in just one data instance.This arises due to excluding an influential observation from training, leading to a skewed posterior predictive distribution and the observation's seemingly anomalous appearance in the test set.The correction of LOO-CV elpd estimates thus necessitates a negatively-correlated correction.In essence, the adjusted elpd difference should be mirrored across the median of candidate differences.The choice of magnitude 1.5 in our bias correction reflects a negative correlation around −0.8 noted by Watanabe (2010), albeit from a single data instance.Recognising this limitation, we repeat the correction in the simulated forward search experiments with values {1, 1.5, 2}, corresponding to correlations {0, −0.5, −1}.
In Figure C.1 we see how these different values perform across different  and  regimes within our simulated forward search experiment.In the case of independent Gaussian priors over the regression coefficients, and in this specific case, it seems the asymptotic correlation may be closer to −1, as using a bias correction with magnitude 2 fits the data best across all  and  regimes.Using smaller values (including the compromise value 1.5 we eventually chose) are liable to under-estimate the bias in the case  = .The difference between magnitudes lessens as  grows, and we could already be comfortable using any of them once  > 4 (again, in this lone data example).When the R2D2 prior is used the results are similar.The main difference we observe is that the difference between the three magnitudes is much smaller, regardless of the  regime.The use of sparsity-inducing priors remedies selection-induced bias, and we do not have to be as concerned as in the high-risk regimes of independent Gaussian priors.
It is not clear how the magnitude in our bias estimate can be constructed generally in a principled manner.Equivalently, it is not clear how to estimate the asymptotic correlation between

D. Sparsity-inducing priors
Fitting high-dimensional models to sparse data is generally challenging.When using joint shrinkage priors, practitioners explicit a prior belief that applies to both the parameter space and the predictive space.One such prior is the R2D2 prior (Aguilar and Bürkner, 2023;Yanchenko et al., 2023a,b;Zhang et al., 2022), which allows statisticians to express prior beliefs about the model's  2 , and the number of predictors required to achieve it.By doing so, the risk of over-fitting the data is reduced compared to using simpler independent Gaussians for regression coefficients, which implicitly encode a (surprisingly) high a priori  2 .The R2D2 model for a linear regression is: We employ a mean and a pseudo-precision parameterisation of the beta distribution as opposed to shape parameters ,  > 0. The Dirichlet concentration parameter , which is constant across all predictors, controls the sparsity in regression coefficients.Larger values of  result in more uniformly distributed regression coefficients (in terms of magnitude), usually with more predictors explaining the response's variance.Alongside the R2D2 prior, Piironen and Vehtari (2017b) propose the regularised horseshoe (RHS) prior.Here, we define the model as 2  ∼ Cauchy + (0, 1).(D.10) Across all binary classification case studies in this paper, we set the degrees of freedom of Student- prior of the global and local shrinkage parameters to be one, and the degrees of freedom of the slab to be 100 so as to approach the horseshoe prior of Carvalho et al. (2009).Rather than define our prior over  and , we instead prefer to consider the number of effective non-zero parameters, denoted  eff , or rather the proportion of effectively non-zero parameters,  eff /.Table 1 presents the hyper-parameters (for both the R2D2 and RHS priors) used in the real world case studies.9

E. Alternative stopping rules
The previously presented stopping rules which involve choosing the "bulge" model, its associated 2 rule, and projection predictive inference all suffer from the same flaw: they require running the forward search up to (at least) the bulge model.A simple alternative would be to run the forward search until none of the candidates improves predictive performance as judged by the The selected model sizes in three real world datasets across different heuristics when using standard independent Gaussian priors (in black) and the R2D2 prior (in red).Our correction identifies larger model sizes that the 2 Δ and 3 Δ , but with higher mlpd point estimates.The incremental stopping rules are liable to select too-small models, resulting in sub-optimal out-of-sample performance and reduced intuition.where M previous denotes the model chosen in the previous step.Or using a magnitude of 3 rather than 2 in Equation E.1 to be even more conservative.Let us call these two rules 2 Δ and 3 Δ respectively.
In Figure E.1 we show the model sizes they select, along with the test mlpd of those models selected.We find that these rules identify much smaller model sizes than our proposal, although do so at the cost of predictive performance.Our proposal finds models with higher mlpd point estimates across all datasets, and with sizes only slightly larger.Further, we find that the 2 Δ and 3 Δ rules can be volatile, and less stable than our proposal.
It is known that there types of heuristics (the 2 Δ and 3 Δ rules) struggle under collinearity of predictors, and in the presence of weakly (but truly) relevant predictors.This is seen in these real data cases, where the heuristics choose model sizes that, while being computationally beneficial, may be overly-sparse.For these reasons, we do not consider them to sufficiently robust (in a qualitative sense) for use in real data experiments where the structure of the predictors is unknown.

Figure 2 .
Figure2.Theoretical illustration.The hypothetical expected selection-induced bias in the presence of  candidate models.While the bias is negligible for  = 2, it can grow rapidly with the number of candidate models.We discuss this trend later.

Figure 3 .
Figure3.Simulated nested linear regression example.The red lines indicate the best LOO-CV elpd difference of  candidates from a baseline model (the horizontal zero line), and the black lines are the test elpd difference.In grey is the test elpd of the model fit with the true predictors.The region between the red and black curves is selection-induced bias, whereas the region between the oracle model (this is the zero line up to  Δ ≈ 0.3, after which it is the grey line) and black curves is what we refer to as over-fitting.The magnitude of the selection-induced bias grows with .

Figure 4 .
Figure 4. Simulated nested linear regression example.The mean best LOO-CV elpd difference of  candidate models from the baseline model (black dots) over 100 data realisations when  Δ = 0, and the expected best observed LOO-CV elpd difference of  models as computed according to Equation 15 (dashed grey line).

Figure 5 .
Figure 5. Illustrative sketch.In red we show the LOO-CV elpd difference to the full (reference) model along growing model size, and in grey the out-of-sample elpd difference.Selection-induced bias is the divergence of the red line from the grey.Over-fitting is the drop in predictive performance with increased model size following the point of predictive saturation.
Using independent Gaussian priors over regression coefficients.Using the R2D2 prior over regression coefficients.

Figure 6 .
Figure 6.Forward search simulation experiment.The red lines show the LOO-CV mlpd difference to the reference model over 20 iterations, the grey lines show the test mlpd difference along those same paths, and the black show the corrected mlpd difference, produced using our correction from Section 4.1.The vertical dashed red and black lines indicate the point of maximum LOO-CV and corrected LOO-CV mlpd difference respectively.

Figure 7 .
Figure7.Real world datasets.The red lines show the LOO-CV mlpd difference to the reference model over CV folds, while the grey lines show the test mlpd difference along those same folds.In black we see the corrected mlpd difference, produced using our correction from Section 4.1.

Figure 8 .
Figure8.Real world datasets.In the bottom row we show the selected model sizes in four real world datasets across different heuristics when using standard independent Gaussian priors (in black), the sparsity-inducing priors (in red).The top row shows the test mlpd difference to the reference model of the model sizes selected by each heuristic.

Figure A. 1 .
Figure A.1.Theoretical connection between elpd difference and BMA.The relationship between pseudo-BMA weights (Equation A.3), the pseudo-BMA+ weights (Equation A.4), and the probability of one model being better than another based on the normal approximation of the elpd difference distribution (Equation A.2). We fix the standard error of the elpd difference normal approximation to be se(Δ elpd LOO ) = 2 for demonstration purposes.For this choice of standard error, Equation A.2 acts very similarly to the probit function (naturally, as it is equivalent to the Gaussian cumulative density function computed up to zero).This probit is precisely the pseudo-BMA weight of the more complex model, and mildly altered becomes the pseudo-BMA+ weights.At the point Δ elpd LOO ≈ 4, and the probability of it being better than the less complex model is approximately 98%.
Figure B.1.EEG cash incentive data.We show in the point ranges the LOO-CV elpd differences and difference standard errors of 10 models with respect to the baseline model (chosen as the model with the median LOO-CV elpd, and located at index 6).We plot the expected highest elpd difference under the assumption that no model is better than the baseline with order statistics in the red dashed line.This identifies at least one of the models (the tenth) to be significantly better than the baseline model.