1 Introduction

Model selection is one of the fundamental problems in statistical modeling. The often adopted view for model selection is not to identify the true underlying model but rather to find a model which is useful. Typically the usefulness of a model is measured by its ability to make predictions about future or yet unseen observations. Even though the prediction would not be the most important part concerning the modelling problem at hand, the predictive ability is still a natural measure for figuring out whether the model makes sense or not. If the model gives poor predictions, there is not much point in trying to interpret the model parameters. We refer to the model selection based on assessing the predictive ability of the candidate models as predictive model selection.

Numerous methods for Bayesian predictive model selection and assessment have been proposed and the various approaches and their theoretical properties have been extensively reviewed by Vehtari and Ojanen (2012). This paper is a follow-up to their work. The review of Vehtari and Ojanen (2012) being qualitative, our contribution is to compare many of the different methods quantitatively in practical model selection problems, discuss the differences, and give recommendations about the preferred approaches. We believe this study will give useful insights because the comparisons to the existing techniques are often inadequate in the original articles presenting new methods. Our experiments will focus on variable subset selection on linear models for regression and classification, but the discussion is general and the concepts apply also to other model selection problems.

A fairly popular method in Bayesian literature is to select the maximum a posteriori (MAP) model which, in the case of a uniform prior on the model space, reduces to maximizing the marginal likelihood and the model selection can be performed using Bayes factors (e.g., Kass and Raftery 1995; Han and Carlin 2001). In the input variable selection context, also the marginal probabilities of the variables have been used (e.g., Brown et al. 1998; Barbieri and Berger 2004; Narisetty and He 2014). In fact, often in the Bayesian variable selection literature, the selection is assumed to be performed using one of these approaches, and the studies focus on choosing priors for the model space and parameters of the individual models (see the review by, O’Hara and Sillanpää 2009). These studies stem from the fact that sampling the high dimensional model space is highly nontrivial, and because it is well known that both the Bayes factors and the marginal probabilities are sensitive to the prior choices (e.g., Kass and Raftery 1995; O’Hara and Sillanpää 2009).

However, we want to make a distinction between prior and model selection. More specifically, we want to address the question, given our prior beliefs, how should we choose the model? In our view, the prior choice should reflect our genuine beliefs about the unknown quantities, such as the number of relevant input variables. For this reason our goal is not to compare and review the vast literature on the priors and computation strategies, which is already done by O’Hara and Sillanpää (2009). Still, we feel this literature is very essential, giving tools for formulating different prior beliefs and performing the necessary posterior computations.

In other than variable selection context, a common approach is to choose the model based on its estimated predictive ability, preferably by using Bayesian leave-one-out cross-validation (LOO-CV) (Geisser and Eddy 1979) or the widely applicable information criterion (WAIC) (Watanabe 2009), both of which are known to give a nearly unbiased estimate of the predictive ability of a given model (Watanabe 2010). Also several other predictive criteria with different loss functions have been proposed, for instance the deviance information criterion (DIC) by Spiegelhalter et al. (2002) and the various squared error measures by Laud and Ibrahim (1995), Gelfand and Ghosh (1998), and Marriott et al. (2001) none of which, however, are unbiased estimates of the true generalization utility of the model.

Table 1 Categorization of the different model selection methods discussed in this paper

Yet an alternative approach is to construct a full encompassing reference model which is believed to best represent the uncertainties regarding the modelling task and choose a simpler submodel that gives nearly similar predictions as the reference model. This approach was pioneered by Lindley (1968) who considered input variable selection for linear regression and used the model with all the variables as the reference model. The idea was extended by Brown et al. (1999, 2002). A related method is due to San Martini and Spezzaferri (1984) who used the Bayesian model averaging (BMA) solution as the reference model and Kullback–Leibler divergence for measuring the difference between the predictions of the reference model and the candidate model instead of the squared error loss used by Lindley. Another related method is the reference model projection by Goutis and Robert (1998) and Dupuis and Robert (2003) in which the information contained in the posterior of the reference model is projected onto the candidate models. The variations of this method include heuristic Lasso-type searching (Nott and Leng 2010) and approximative projection with different cost functions (Tran et al. 2012; Hahn and Carvalho 2015).

Although LOO-CV and WAIC can be used to obtain a nearly unbiased estimate of the predictive ability of a given model, both of these estimates contain a stochastic error term whose variance can be substantial when the dataset is not very large. This variance in the estimate may lead to overfitting in the selection process causing nonoptimal model selection and inducing bias in the performance estimate for the selected model (e.g., Ambroise and McLachlan 2002; Reunanen 2003; Cawley and Talbot 2010). The overfitting in the selection may be negligible if only a few models are being compared but, as we will demonstrate, may become a problem for a larger number of candidate models, such as in variable selection.

Our results show that the optimization of CV or WAIC may yield highly varying results and lead to selecting a model with predictive performance far from optimal. From the predictive point of view, best results are generally obtained by accounting for the model uncertainty and forming the full BMA solution over the candidate models, and one should not expect to do better by selection. Our results agree with what is known about the good performance of the BMA (Hoeting et al. 1999; Raftery and Zheng 2003). If the full model is too complex or the cost for observing all the variables is too high, the model can be simplified most robustly by the projection method which is considerably less vulnerable to the overfitting in the selection. The advantage of the projection approach comes from taking into account the model uncertainty in forming the full encompassing model and reduced variance in the performance evaluation of the candidate models. Overall, the projection framework outperforms also the selection of the most probable inputs or the MAP model, while both of these typically perform better than optimization based on CV or WAIC.

Despite its advantages, the projection approach has suffered from a difficulty in deciding how many variables should be selected in order to get predictions close to the reference model (Peltola et al. 2014; Robert 2014). Our study shows that the model selection can highly benefit from using cross-validation outside the variable searching process both for guiding the model size selection and assessing the predictive performance of the finally selected model. Using cross-validation for choosing only the model size rather than the variable combination introduces substantially less overfitting due to greatly reduced number of model comparisons (see Sect. 4.4 for discussion).

The paper is organized as follows. In Sect. 2 we shortly go through the model selection methods discussed in this paper. This part of the paper somewhat overlaps with the previous studies (especially with Vehtari and Ojanen 2012), but is included for maintaining easier readability as a standalone paper. Section 3 discusses and illustrates the overfitting in model selection and the consequent selection induced bias in the performance evaluation of the chosen model. Section 4 is devoted to the numerical experiments and forms the core of the paper. The discussion in Sect. 5 concludes the paper.

2 Approaches for Bayesian model selection

This section discusses shortly the proposed methods for Bayesian model selection relevant for this paper. We do not attempt anything close to a comprehensive review but summarize the methods under comparison. See the review by Vehtari and Ojanen (2012) for a more complete discussion.

The section is organized as follows. Section 2.1 discusses how the predictive ability of a model is defined in terms of an expected utility and Sects. 2.22.5 shortly review the methods. Following Bernardo and Smith (1994) and Vehtari and Ojanen (2012) we have categorized the methods into \( \mathscr {M} \)-closed, \( \mathscr {M} \)-completed, and \( \mathscr {M} \)-open views, see Table 1. \( \mathscr {M} \)-closed means assuming that one of the candidate models is the true data generating model. Under this assumption, one can set prior probabilities for each model and form the Bayesian model averaging solution (see Sect. 2.5). The \( \mathscr {M} \)-completed view abandons the idea of a true model, but still forms a reference model which is believed to be the best available description of the future observations. In the \( \mathscr {M} \)-open view one does not assume one of the models being true and also rejects the idea of constructing the reference model.

Throughout Sect. 2, the notation assumes a model M which predicts an output y given an input variable x. The same notation is used both for scalars and vectors. We denote the future observations by \( \tilde{y} \) and the model parameters by \( \theta \). To make the notation simpler, we denote the training data as \(D= \{(x_i,y_i)\}_{i=1}^n\).

2.1 Predictive ability as an expected utility

The predictive performance of a model is typically defined in terms of a utility function that describes the quality of the predictions. An often used utility function measuring the quality of the predictive distribution of the candidate model M is the logarithmic score (Good 1952)

$$\begin{aligned} u(M, \tilde{y} ) = \log p( \tilde{y} \,|\, D, M) . \end{aligned}$$
(1)

Here we have left out the future input variables \( \tilde{x} \) to simplify the notation. The logarithmic score is a widely accepted utility function due to its information theoretic grounds and will be used in this paper. However, we point out that in principle any other utility function could be considered, and the choice of a suitable utility function might also be application specific.

Since the future observations \( \tilde{y} \) are unknown, the utility function \(u(M, \tilde{y} )\) cannot be evaluated beforehand. Therefore one usually works with the expected utilities instead

$$\begin{aligned} \bar{u}(M) = \mathrm {E}_{} {\left[ \log p( \tilde{y} \,|\, D, M)\right] } = \int p_ \text {t} ( \tilde{y} ) \log p( \tilde{y} \,|\, D, M) \mathrm d \tilde{y} , \end{aligned}$$
(2)

where \(p_ \text {t} ( \tilde{y} )\) denotes the true data generating distribution. This expression will be referred to as the generalization utility or more loosely as the predictive performance of model M. Maximizing (2) is equivalent to minimizing the Kullback–Leibler (KL) divergence from the true data generating distribution \(p_ \text {t} ( \tilde{y} )\) to the predictive distribution of the candidate model M.

2.2 Generalization utility estimation

2.2.1 Cross-validation

The generalization utility (2) can be estimated by using the already obtained sample \(D\) as proxy for the true data generating distribution \(p_ \text {t} ( \tilde{y} )\). However, estimating the expected utility using the same data \(D\) that were used to train the model leads to an optimistic estimate of the generalization performance. A better estimate is obtained by dividing the data into K subsets \(I_1,\ldots ,I_K\) and using each of these sets in turn for validation while training the model using the remaining \(K-1\) sets. This gives the Bayesian K-fold cross-validation (CV) utility (Geisser and Eddy 1979)

$$\begin{aligned} K \text {-fold-CV} = \frac{1}{n} \sum _{i=1}^n \log p(y_i \,|\, x _i, D_{\setminus I_{s(i)}}, M), \end{aligned}$$
(3)

where \(I_{s(i)}\) denotes the validation set that contains the ith point and \(D_{\setminus I_{s(i)}}\) the training data from which this subset has been removed. Conditioning the predictions on fewer than n data points introduces bias in the utility estimate. This bias can be corrected (Burman 1989) but small K increases the variance in the estimate. One would prefer to set \(K=n\) computing the leave-one-out utility (LOO-CV) but without any computational shortcuts this is often computationally infeasible as the model would need to be fitted n times. An often used compromise is \(K = 10\). Analytical approximations for LOO are discussed by Vehtari et al. (2014) and computations using posterior samples by Vehtari et al. (2016).

2.2.2 Information criteria

Information criteria offer a computationally appealing way of estimating the generalization performance of the model. A fully Bayesian criterion is the widely applicable information criterion (WAIC) by Watanabe (2009, 2010). WAIC can be calculated as

$$\begin{aligned} \text {WAIC} = \frac{1}{n} \sum _{i=1}^n \log p(y_i \,|\, x _i, D, M) - \frac{V}{n}, \end{aligned}$$
(4)

where the first term is the training utility and V is the functional variance given by

$$\begin{aligned} V = \sum _{i=1}^n \Bigg \lbrace \mathrm {E}_{} {\left[ (\log p(y_i \,|\, x _i, \theta , M))^2\right] } - \mathrm {E}_{} {\left[ \log p(y_i \,|\, x _i, \theta , M)\right] } ^2 \Bigg \rbrace \,. \end{aligned}$$
(5)

Here both of the expectations are taken over the posterior \(p( \theta \,|\, D,M)\). Watanabe (2010) proved that WAIC is asymptotically equal to the Bayesian LOO-CV and to the generalization utility (2), and the error is o(1 / n). Watanabe’s proof gives Bayesian LOO and WAIC a solid theoretical justification in the sense that they measure the predictive ability of the model including the uncertainty in the parameters and can be used also for singular models (the set of the “true parameters” consists of more than one point).

Another still popular method is the deviance information criterion (DIC) proposed by Spiegelhalter et al. (2002). DIC estimates the generalization performance of the model with parameters fixed to the posterior mean \( \bar{\theta } = \mathrm {E}_{} {\left[ \theta \,|\, D, M\right] } \). DIC can be written as

$$\begin{aligned} \text {DIC} = \frac{1}{n} \sum _{i=1}^n \log p(y_i \,|\, x _i, \bar{\theta } , M) - \frac{p_ \text {eff} }{n}, \end{aligned}$$
(6)

where \(p_ \text {eff} \) is the effective number of parameters which can be estimated as

$$\begin{aligned} p_ \text {eff} = 2 \sum _{i=1}^n \left( \log p(y_i \,|\, x _i, \bar{\theta } ,M) - \mathrm {E}_{} {\left[ \log p(y_i \,|\, x _i, \theta , M) \right] } \right) \!, \end{aligned}$$
(7)

where the expectation is taken over the posterior. From Bayesian perspective, DIC is not theoretically justified since it measures the fit of the model when the parameters are fixed to the posterior expectation and is not therefore an unbiased estimate of the true generalization utility (2). The use of a point estimate is questionable not only because of Bayesian principles, but also from a practical point of view especially when the model is singular.

2.3 Mixed self and posterior predictive criteria

There exists a few criteria that are not unbiased estimates of the true generalization utility (2) but have been proposed for model selection. These criteria do not fit to the \( \mathscr {M} \)-open view since the candidate models are partially assessed based on their own predictive properties and therefore these criteria resemble \( \mathscr {M} \)-closed/\( \mathscr {M} \)-completed view (for a detailed discussion, see Vehtari and Ojanen 2012).

Ibrahim and Laud (1994) and Laud and Ibrahim (1995) proposed a selection criterion for regression derived by considering replicated measurements \( \tilde{ y } \) at the training inputs. The criterion measures the expected squared error between the new observations and the old ones \( y \) over the posterior predictive distribution of the candidate model M. The error measure can be decomposed as

$$\begin{aligned} L^2&= \sum _{i=1}^n \big (y_i - \mathrm {E}_{} {\left[ \tilde{y} \,|\, x _i, D, M\right] } \big )^2 + \sum _{i=1}^n \mathrm {Var} {\left[ \tilde{y} \,|\, x _i, D, M\right] } , \end{aligned}$$
(8)

that is, as a sum of the squared errors for mean predictions plus sum of the predictive variances. The preferred model is then the one which minimizes (8). \(L^2\)-criterion is not an unbiased estimate of (2) due to different form of the utility function, but Ibrahim and Laud (1994) showed that in model comparison, the criterion penalizes more complex models asymptotically with penalty halfway between the posterior predictive approach (i.e. fit at the training points) and the cross-validation approach.

Marriott et al. (2001) proposed a closely related criterion which is a cross-validated version of (8)

$$\begin{aligned} L^2_ \text {cv}&= \sum _{i=1}^n \big (y_i - \mathrm {E}_{} {\left[ \tilde{y} \,|\, x _i, D_{\setminus I_{s(i)}}, M\right] } \big )^2\nonumber \\&\quad + \sum _{i=1}^n \mathrm {Var} {\left[ \tilde{y} \,|\, x _i, D_{\setminus I_{s(i)}}, M\right] } \,. \end{aligned}$$
(9)

This sounds intuitively better than the \(L^2\)-criterion because it does not use the same data for training and testing. However, little is known about the properties of the estimate (9) as the authors do not provide a theoretical treatment. Empirically it is found that, like \(L^2\), \(L^2_ \text {cv} \)-criterion has a relatively high variance which may cause significant overfitting in model selection as discussed in Sect. 3 and demonstrated experimentally in Sect. 4.

Yet another related criterion based on a replicated measurement was proposed by Gelfand and Ghosh (1998). The authors considered an optimal point prediction which is designed to be close to both the observed and future data and the relative importance between the two is adjusted by a free parameter k. Omitting the derivation, the loss function becomes

$$\begin{aligned} L^2_k&= \frac{k}{k+1} \sum _{i=1}^n \big (y_i - \mathrm {E}_{} {\left[ \tilde{y} \,|\, x _i, D, M\right] } \big )^2\nonumber \\&\quad + \sum _{i=1}^n \mathrm {Var} {\left[ \tilde{y} \,|\, x _i, D, M\right] } \,. \end{aligned}$$
(10)

When \(k \rightarrow \infty \), this is the same as the \(L^2\)-criterion (8). On the other hand, when \(k=0\), the criterion reduces to the sum of the predictive variances. In this case there is no inherent safeguard against poor fit as the model with the narrowest predictive distribution is chosen. In their experiment, the authors reported that the results were not very sensitive to the choice of k.

2.4 Reference model approach

Section 2.2 reviewed methods for utility estimation that are based on sample reuse without any assumptions on the true model (\( \mathscr {M} \)-open view). An alternative way is to construct a full encompassing reference model \(M_*\), which is believed to best describe our knowledge about the future observations, and perform the utility estimation almost as if it was the true data generating distribution (\( \mathscr {M} \)-completed view). We refer to this as the reference model approach. There are basically two somewhat different but related approaches that fit into this category, namely the reference and projection predictive methods, which will be discussed separately.

2.4.1 Reference predictive method

Assuming we have constructed a reference model \(M_*\) which we believe best describes our knowledge about the future observations, the utilities of the candidate models M can be estimated by replacing the true distribution \(p_ \text {t} ( \tilde{y} )\) in (2) by the predictive distribution of the reference model \(p( \tilde{y} \,|\, D, M_*)\). Averaging this over the training inputs \(\{ x _i\}_{i=1}^n\) gives the reference utility

$$\begin{aligned} \bar{u}_ \text {ref} = \frac{1}{n}\sum _{i=1}^n \int p( \tilde{y} \,|\, x _i, D, M_*) \log p( \tilde{y} \,|\, x _i, D, M) \mathrm d \tilde{y} \,. \end{aligned}$$
(11)

Depending on the reference model, this integral may not be analytically available and numerical integration may be required. However, if the output \( \tilde{y} \) is only one dimensional, simple quadratures are often adequate.

As the reference model is in practice different from the true data generating model, the reference utility is a biased estimate of the true generalization utility (2). The maximization of the reference utility is equivalent to minimizing the predictive KL-divergence between the reference model \(M_*\) and the candidate model M at the training inputs

$$\begin{aligned} \delta (M_* \Vert M)&= \frac{1}{n} \sum _{i=1}^n \mathrm {KL} {\left( p( \tilde{y} \,|\, x _i, D, M_*) \, \Vert \, p( \tilde{y} \,|\, x _i, D, M\right) } . \end{aligned}$$
(12)

The model choice can then be based on the strict minimization of the discrepancy measure (12), or choosing the simplest model that has an acceptable discrepancy. What is meant by “acceptable” may be somewhat arbitrary and depend on the context. For more discussion, see the concept of relative explanatory power in the next section, Eq. (17).

The reference predictive approach is inherently a less straightforward approach to model selection than the methods presented in Sect. 2.2, because it requires the construction of the reference model and it is not obvious how it should be done. San Martini and Spezzaferri (1984) proposed using the Bayesian model average (20) as the reference model (see Sect. 2.5). In the variable selection context, the model averaging corresponds to a spike-and-slab type prior (Mitchell and Beauchamp 1988) which is often considered as the “gold standard” and has been extensively used for linear models (see, e.g., George and McCulloch 1993, 1997; Raftery et al. 1997; Brown et al. 2002; Lee et al. 2003; O’Hara and Sillanpää 2009; Narisetty and He 2014) and extended and applied to regression for over one million variables (Peltola et al. 2012a, b). However, we emphasize that any other model or prior could be used as long as we believe it reflects our best knowledge of the problem and allows convenient computation. For instance the Horseshoe prior (Carvalho et al. 2009, 2010) has been shown to have desirable properties empirically and theoretically assuming a properly chosen shrinkage factor (Datta and Ghosh 2013; Pas et al. 2014).

2.4.2 Projection predictive method

A related but somewhat different alternative to the reference predictive method (previous section) is the projection approach. The idea is to project the information in the posterior of the reference model \(M_*\) onto the candidate models M so that the predictive distribution of the candidate model remains as close to the reference model as possible. Thus the key aspect is that the candidate model parameters are determined by the fit of the reference model, not by the data. Therefore also the prior needs to be specified only for the reference model. The idea behind the projection is quite generic and Vehtari and Ojanen (2012) discuss the general framework in more detail.

A practical means for doing the projection was proposed by Goutis and Robert (1998) and further discussed by Dupuis and Robert (2003). Given the parameter of the reference model \( \theta ^*\), the projected parameter \( \theta ^\perp \) in the parameter space of model M is defined via

$$\begin{aligned} \theta ^\perp&= \arg \min _{ \theta } \frac{1}{n}\sum _{i=1}^n \mathrm {KL} {\left( p( \tilde{y} \,|\, x _i, \theta ^*, M_*) \, \Vert \, p( \tilde{y} \,|\, x _i, \theta , M)\right) } \,. \end{aligned}$$
(13)

The discrepancy between the reference model \(M_*\) and the candidate model M is then defined to be the expectation of the divergence over the posterior of the reference model

$$\begin{aligned} \delta (M_* \Vert M) = \frac{1}{n}\sum _{i=1}^n \mathrm {E}_{} {\left[ \mathrm {KL} {\left( p( \tilde{y} \,|\, x _i, \theta ^*, M_*) \, \Vert \, p( \tilde{y} \,|\, x _i, \theta ^\perp , M)\right) } \right] } \,. \end{aligned}$$
(14)

The posterior expectation in (14) is in general not available analytically. Dupuis and Robert (2003) proposed calculating the discrepancy by drawing samples \(\{ \theta _s^*\}_{s=1}^S\) from the posterior of the reference model, calculating the projected parameters \(\{ \theta _s^\perp \}_{s=1}^S\) individually according to (13), and then approximating (14) as

$$\begin{aligned} \delta (M_* \Vert M) \approx \frac{1}{n S}\sum _{i=1}^n \sum _{s=1}^S \mathrm {KL} {\left( p( \tilde{y} \,|\, x _i, \theta _s^*, M_*) \, \Vert \, p( \tilde{y} \,|\, x _i, \theta _s^\perp , M)\right) } \,. \end{aligned}$$
(15)

Also the optimization problem in (13) cannot typically be solved analytically, and a numerical optimization routine may be needed. However, for the simplest models like the Gaussian linear model, the analytical solution is available, see Appendix 1. Moreover, even when the analytical solution does not exist, solving the optimization problem (13) in the case of generalized linear models is equivalent to finding the maximum likelihood parameters for the candidate model M with the observations replaced by the fit of the reference model (Goutis and Robert 1998).

The projected samples \(\{ \theta _s^\perp \}_{s=1}^S\) are used for posterior inference as usual. For example, the predictions of the candidate model M can be computed as

$$\begin{aligned} p( \tilde{y} \,|\, \tilde{x} , D, M) = \frac{1}{S} \sum _{s=1}^S p( \tilde{y} \,|\, \tilde{x} , \theta _s^\perp , M), \end{aligned}$$
(16)

which is the same as the usual Monte Carlo approximation to the predictive distribution, we simply use the projected samples as the posterior approximation.

For deciding which model model to choose, Dupuis and Robert (2003) introduced a measure called relative explanatory power

$$\begin{aligned} \phi (M) = 1- \frac{\delta (M_* \Vert M)}{\delta (M_* \Vert M_0)} \,, \end{aligned}$$
(17)

where \(M_0\) denotes the empty model, that is, the model that has the largest discrepancy to the reference model. In terms of variable selection, \(M_0\) is the variable free model. By definition, the relative explanatory power obtains values between 0 and 1, and Dupuis and Robert (2003) proposed choosing the simplest model with enough explanatory power, for example 90 %, but did not discuss the effect of this threshold for the predictive performance of the selected models. We note that, in general, the relative explanatory power is an unreliable indicator of the predictive performance of the submodel. This is because the reference model is typically different from the true data generating model \(M_ \text {t} \), and therefore both \(M_*\) and M may have the same discrepancy to \(M_ \text {t} \) (that is, the same predictive ability) although the discrepancy between \(M_*\) and M would be nonzero.

Peltola et al. (2014) proposed an alternative way of deciding the model size based on cross-validation outside the searching process. In other words, in a K-fold setting the searching is repeated K times each time leaving 1 / K of the data for testing, and the performance of the found models are tested on this left-out data. Note that also the reference model is trained K times and each time its performance is evaluated on the left-out data. Thus, one can compare the utility of both the found models and the reference model on the independent data and estimate, for instance, how many variables are needed to get statistically indistinguishable predictions compared to the reference model. More precisely, if \(u_m\) denotes the estimated expected utility of choosing m variables and \(u_*\) denotes the estimated utility for the reference model, the models can be compared by estimating the probability

$$\begin{aligned} \Pr {\left[ u_* - u_m \le \Delta u\right] } , \end{aligned}$$
(18)

that is, the probability that the utility difference compared to the reference model is less than \(\Delta u \ge 0\). Peltola et al. (2014) suggested estimating the probabilities above by using Bayesian bootstrap (Rubin 1981) and reported results for all model sizes for \(\Delta u = 0\).

The obvious drawback in this approach are the increased computations (as the selection and reference model fitting is repeated K times), but in Sect. 4.3 we demonstrate that this approach may be very useful when choosing the final model size.

2.5 Model space approach

Bayesian formalism has a natural way of describing the uncertainty with respect to the used model specification given an exhaustive list of candidate models \(\{M_\ell \}_{\ell =1}^L\). The distribution over the model space is given by

$$\begin{aligned} p(M \,|\, D) \propto p(D \,|\, M) \, p(M) \,. \end{aligned}$$
(19)

The predictions are then obtained from the Bayesian model averaging (BMA) solution

$$\begin{aligned} p( \tilde{y} \,|\, D) = \sum _{\ell =1}^L p( \tilde{y} \,|\, D, M_\ell )\, p(M_\ell \,|\, D) \,. \end{aligned}$$
(20)

Strictly speaking, forming the model average means adopting the \( \mathscr {M} \)-closed view, that is, assuming one of the candidate models is the true data generating model. In practice, however, averaging over the discrete model space does not differ in any sense from integrating over the continuous parameters which is the standard procedure in Bayesian modeling. Moreover, BMA has been shown to have a good predictive performance both theoretically and empirically (Raftery and Zheng 2003) and especially in variable selection context the integration over the different variable combinations is widely accepted. See the review by Hoeting et al. (1999) for a thorough discussion of Bayesian model averaging.

From a model selection point of view, one may choose the model maximizing (19) ending up with the maximum a posteriori (MAP) model. Assuming the true data generating model belongs to the set of the candidate models, MAP model can be shown to be the optimal choice under the zero-one utility function (utility being one if the true model is found, and zero otherwise). If the models are given equal prior probabilities, \(p(M)\propto 1\), finding the MAP model reduces to maximizing the marginal likelihood, also referred to as the type-II maximum likelihood.

Barbieri and Berger (2004) proposed a related variable selection method for the Gaussian linear model and named it the Median probability model (which we abbreviate simply as the Median model). The Median model is defined as the model containing all the variables with marginal posterior probability greater than \(\frac{1}{2}\). Let binary vector \( \gamma =(\gamma ^1,\dots ,\gamma ^p)\) denote which of the variables are included in the model (\(\gamma ^j=1\) meaning that variable j is included). The marginal posterior inclusion probability of variable j is then

$$\begin{aligned} \pi _j = \sum _{M \,:\, \gamma ^j = 1} p(M \,|\, D), \end{aligned}$$
(21)

that is, the sum of the posterior probabilities of the models which include variable j. The Median model \( \gamma _{\text {med}}\) is then defined componentwise as

$$\begin{aligned} \gamma ^j_{\text {med}} = {\left\{ \begin{array}{ll} 1, \quad \text {if}\, \pi _j \ge \frac{1}{2}, \\ 0, \quad \text {otherwise} \,. \end{array}\right. } \end{aligned}$$
(22)

The authors showed that when the predictors \( x {=} (x^1,\dots ,x^p)\) are orthogonal, that is when \( Q = \mathrm {E}_{} {\left[ x x ^\mathsf {T}\right] } \) is diagonal, the Median model is the optimal choice. By optimal the authors mean the model whose predictions for future \( \tilde{y} \) are closest to the Bayesian model averaging prediction (20) in the squared error sense. The authors admit that the assumption of the orthogonal predictors is a strong condition that does not often apply. The Median model also assumes that the optimality is defined in terms of the mean predictions, meaning that the uncertainty in the predictive distributions is ignored. Moreover, the Median model is derived assuming Gaussian noise and thus the theory does not apply, for instance, to classification problems.

3 Overfitting and selection induced bias

As discussed in Sect. 2.1, the performance of a model is usually defined in terms of the expected utility (2). Many of the proposed selection criteria reviewed in Sects. 2.22.5 can be thought of as estimates of this quantity even if not designed directly for this purpose.

Consider a hypothetical utility estimation method. For a fixed training dataset \(D\), its utility estimate \(g_\ell = g(M_\ell ,D)\) for model \(M_\ell \) can be decomposed as

$$\begin{aligned} g_\ell = u_\ell + e_\ell , \end{aligned}$$
(23)

where \(u_\ell = u(M_\ell , D)\) represents the true generalization utility of the model, and \(e_\ell = e(M_\ell , D)\) is the error in the utility estimate. Note that also \(u_\ell \) depends on the observed dataset \(D\), because favourable datasets lead to better generalization performance. If the utility estimate is correct on average over the different datasets \( \mathrm {E}_{D} {\left[ g_\ell \right] } = \mathrm {E}_{D} {\left[ u_\ell \right] } \), or equivalently \( \mathrm {E}_{D} {\left[ e_\ell \right] } = 0\), we say the estimate g is unbiased, otherwise it is biased. The unbiasedness of the utility estimate is often considered as beneficial for model selection. However, the unbiasedness is intrinsically unimportant for model selection, and a successful model selection does not necessarily require unbiased utility estimates. To see this, note that the only requirement for a perfect model selection criterion is that the higher utility estimate implies higher generalization performance, that is \(g_\ell > g_k\) implies \(u_\ell > u_k\) for all models \(M_\ell \) and \(M_k\). This condition can be satisfied even if \( \mathrm {E}_{D} {\left[ g_\ell \right] } \ne \mathrm {E}_{D} {\left[ u_\ell \right] } \).

To get an idea how the bias and variance properties of a utility estimate affect the model selection, see Fig. 1. The left plot shows an imaginary prototype of an unbiased but high variance utility estimation method. The grey lines represent the estimated utilities for each model M with different data realizations. On average (black) these curves coincide with the true expected utility over all datasets (red). However, due to the high variance, the maximization of the utility estimate may lead to choosing a model with nonoptimal expected true utility (the maxima become scattered relatively far away from the true optimum). We refer to this phenomenon of choosing a nonoptimal model due to the variance in the utility estimates as overfitting in model selection. In other words, the selection procedure fits to the noise in the utility estimates and therefore it is expected that the chosen model has a nonoptimal true utility. The left plot also demonstrates that, even though the utility estimates are unbiased for each model before the selection, the utility estimate for the selected model is no longer unbiased and is typically optimistic (the maxima of the grey lines tend to lie over the average curve). We refer to this as the selection induced bias.

Fig. 1
figure 1

Schematic illustration of an unbiased (left) and a biased (right) utility estimation method. Grey lines denote the utility estimates for different datasets \(D_i\), black is the average, and red the true expected utility. In this case, the biased method is likely to choose better models (dashed lines) due to better tradeoff in bias and variance

The right plot shows a biased utility estimation method that either under or overestimates the ability of most of the models. However, due to smaller variance, the probability of choosing a model with better true performance is significantly increased (the maxima of the estimates focus closer to the true optimum). This example demonstrates that even though the unbiasedness is beneficial for the performance evaluation of a particular model, it is not necessarily important for model selection. For the selection, it is more important to be able to rank the competing models in an approximately correct order with a low variability.

The overfitting in model selection and the selection induced bias are important concepts that have received relatively little attention compared to the vast literature on model selection in general. However, the topic has been discussed for example by Rencher and Pun (1980), Ambroise and McLachlan (2002), Reunanen (2003), Varma and Simon (2006), and Cawley and Talbot (2010). These authors discuss mainly the model selection using cross-validation, but the ideas apply also to other utility estimation methods. As discussed in Sect. 2.2, cross-validation gives a nearly unbiased estimate of the generalization performance of any given model, but the selection process may overfit when the variance in the utility estimates is high (as depicted in the left plot of Fig. 1). This will be demonstrated empirically in Sect. 4. The variance in the utility estimate is different for different estimation methods but may generally be considerable for small datasets. The amount of overfitting in selection increases with the number of models being compared, and may become a problem for example in variable selection where the number of candidate models grows quickly with the number of variables.

4 Numerical experiments

This section compares the methods presented in Sect. 2 in practical variable selection problems. Section 4.1 discusses the used models and Sects. 4.2 and 4.3 show illustrative examples using simulated and real world data, respectively. The reader is encouraged to go through the simulated examples first as they illustrate most of the important concepts with more detailed discussion. In Sect. 4.4 we then discuss the use of cross-validation for guiding the model size selection and for performance evaluation of the finally selected model. Finally, Sect. 4.5 provides a short note on the computational aspects.

4.1 Models

We will consider both regression and binary classification problems. To reduce the computational burden involved in the experiments, we consider only linear models. For regression, we apply the standard Gaussian model

$$\begin{aligned} \begin{aligned}&y \,|\, x , w , \sigma ^2 \sim \mathrm {N} {\left( w ^\mathsf {T} x , \sigma ^2 \right) } , \\&w \,|\, \sigma ^2, \tau ^2 \sim \mathrm {N} {\left( 0, \tau ^2\sigma ^2 I \right) } , \\&\sigma ^2 \sim \mathrm {Inv\text {-}Gamma} {\left( \alpha _\sigma , \beta _\sigma \right) } , \\&\tau ^{2} \sim \mathrm {Inv\text {-}Gamma} {\left( \alpha _\tau , \beta _\tau \right) } \,, \end{aligned} \end{aligned}$$
(24)

where \( x \) is the p-dimensional vector of inputs, \( w \) contains the corresponding weights and \(\sigma ^2\) is the noise variance. For this model, most of the computations can be obtained analytically because for a given hyperparameter \(\tau ^2\) the prior is conjugate. Since it is difficult to come up with a suitable value for the weight regularising variance \(\tau ^2\), it is given a weakly informative inverse-gamma prior and integrated over numerically. For the binary classification, we use the probit model

$$\begin{aligned} \begin{aligned}&y \,|\, x , w \sim \mathrm {Ber}{\left( \varPhi ( w ^\mathsf {T} x ) \right) } , \\&w \,|\, \tau ^2 \sim \mathrm {N} {\left( 0,\tau ^2 I \right) } , \\&\tau ^2 \sim \mathrm {Inv\text {-}Gamma} {\left( \alpha _\tau , \beta _\tau \right) } , \end{aligned} \end{aligned}$$
(25)

where \(\varPhi (\cdot )\) denotes the cumulative density of the standard normal distribution. Again, a weakly informative prior for \(\tau ^2\) is used. For this model, we use Markov chain Monte Carlo (MCMC) methods to obtain samples from the posterior of the weights to get the predictions. For both models (24) and (25) we include the intercept term by augmenting a constant term in the input vector \( x = (1,\, x^1,\dots , x^p)\) and a corresponding term in the weight vector \( w = (w^0,w^1,\dots ,w^p)\). The exact values used for the hyperparameters \(\alpha _\tau , \beta _\tau , \alpha _\sigma , \beta _\sigma \) will be given together with the dataset descriptions in Sects. 4.2 and 4.3.

Since we are considering a variable selection problem, the submodels have different number of input variables and therefore different dimensionality for \( x \) and \( w \). For notational convenience, the binary vector \( \gamma = (\gamma ^0,\gamma ^1,\dots ,\gamma ^p)\) denoting which of the variables are included in the model is omitted in the above formulas. Both in (24) and (25) the model specification is the same for each submodel \( \gamma \), only the dimensionality of \( x \) and \( w \) change. The reference model \(M_*\) is constructed as the BMA (20) from the submodels using the reversible jump MCMC (RJMCMC) (Green 1995), which corresponds to a spike-and-slab prior for the full model. For an additional illustration using a hierarchical shrinkage prior, see Appendix 2. For the model space we use the prior

$$\begin{aligned} \begin{aligned}&\gamma ^j \,|\, \pi \sim \mathrm {Ber}{\left( \pi \right) } , \quad j=1,\dots ,p \,, \\&\pi \sim \mathrm {Beta}{\left( a, b\right) } . \end{aligned} \end{aligned}$$
(26)

Here parameters a and b adjust the prior beliefs about the number of included variables. We set \(\gamma ^0 = 1\), that is, the intercept term \(w^0\) is included in all the submodels. Also for a and b, the exact values used will be given together with the dataset descriptions in Sects. 4.2 and 4.3.

4.2 Simulated data

We first introduce a simulated variable selection experiment which illustrates a number of important concepts and the main differences between the different methods. The data is distributed as follows

$$\begin{aligned} \begin{aligned}&x \sim \mathrm {N} {\left( 0, R \right) } , \quad R \in \mathbb {R}^{p\times p}, \\&y \,|\, x \sim \mathrm {N} {\left( w ^\mathsf {T} x , \sigma ^2\right) } , \quad \sigma ^2 = 1. \end{aligned} \end{aligned}$$
(27)

We set the total number of variables to \(p=100\). The variables are divided into groups of five variables. Each variable \(x^j\) has a zero mean and unit variance and is correlated with other variables in the same group with coefficient \(\rho \) but uncorrelated with variables in the other groups (the correlation matrix \( R \) is block diagonal). The variables in the first three groups have weights \((w^{1:5},w^{6:10},w^{11:15})=(\xi ,\, 0.5\,\xi ,\, 0.25\,\xi )\) while the rest of the variables have zero weight. Thus there are 15 relevant and 85 irrelevant variables in the data. The constant \(\xi \) adjusts the signal-to-noise ratio of the data. To get comparable results for different levels of correlation \(\rho \), we set \(\xi \) so that \(\sigma ^2 / \mathrm {Var} {\left[ y\right] } = 0.3\). For \(\rho =0, 0.5, 0.9\) this is satisfied by setting approximately \(\xi = 0.59, 0.34, 0.28\), respectively.

Table 2 Compared model selection methods for the experiments

The experiments were carried out by varying the training set size \(n = 100,200,400\) and the correlation coefficient \(\rho =0,0.5,0.9\). We used the regression model (24) with prior parameters \(\alpha _\tau = \beta _\tau = \alpha _\sigma = \beta _\sigma = 0.5\). The posterior inference did not seem to be sensitive to these choices. As the reference model \(M_*\), we used the BMA (20) over the different input combinations with prior \(a=1\), \(b = 10\) for the number of inputs (26). For each combination of \((n,\rho )\), we performed the variable selection with each method listed in Table 2 for 50 different data realizations. As a search heuristic, we used the standard forward search, also known as the stepwise regression. In other words, starting from the empty model, at each step we select the variable that increases the utility estimate (like CV, WAIC, DIC, etc.) the most. The Median and MAP model where estimated from the RJMCMC samples that were drawn to form the BMA (as discussed in Sect. 4.1).

The found models were then tested on an independent test set of size \( \tilde{n} = 1000\). As a proxy for the generalization utility (2), we use the mean log predictive density (MLPD) on the test set

$$\begin{aligned} \text {MLPD} (M) = \frac{1}{ \tilde{n} } \sum _{i=1}^{ \tilde{n} } \log p( \tilde{y} _i \,|\, \tilde{x} _i, D, M). \end{aligned}$$
(28)

To reduce variance over the different data realizations and to better compare the relative performance of the different methods, we report the utilities of the selected submodels M with respect to the gold standard BMA solution \(M_*\)

$$\begin{aligned} \Delta \text {MLPD} (M) = \text {MLPD} (M) - \text {MLPD} (M_*). \end{aligned}$$
(29)

On this relative scale zero indicates the same predictive performance as the BMA and negative values worse (and positive values better, correspondingly). A motivation for looking at the relative performance (29) is that, as we shall see shortly, the selection typically introduces loss in the predictive accuracy, and we want to assess which of the selection methods are able to find the simplest models with performance close to the BMA.

Fig. 2
figure 2

Simulated data Average number of variables chosen (left column) and the average test utilities of the chosen models (right column) with 95 % credible intervals for the different training set sizes n. The utilities are shown with respect to the BMA (29) and the dotted lines denote the performance of the empty model (intercept term only). The colours denote the correlation level between the variables (see the legend). The true number of relevant variables is 15 (grey line in left column plots) and the results are averaged over 50 data realizations. See Table 2 for the used methods

Fig. 3
figure 3

Simulated data Average forward search paths for some of the selection methods for different training set sizes n when \(\rho =0.5\). Red shows the CV utility (tenfold) and black the test utility with respect to the BMA (29) after sorting the variables, as a function of number of variables selected averaged over the 50 different data realizations. The difference between these two curves illustrates the selection induced bias. The dotted vertical lines denote the average number of variables chosen with each of the methods (see Table 2)

Figure 2 shows the average number of selected variables and the test utilities of the selected models in each data setting with respect to the BMA. First, a striking observation is that none of the methods is able to find a model with better predictive performance than the BMA. From the predictive point of view, model averaging yields generally the best results on expectation, and one should not expect to do better by selection. This result is in perfect accordance with what is known about the good performance of the BMA (Hoeting et al. 1999; Raftery and Zheng 2003). Thus, the primary motivation for selection should be the simplification of the model without substantially compromising the predictive accuracy, rather than trying to improve over the predictions obtained by taking into account the model uncertainty.

Second, for the smallest dataset size many of the methods perform poorly and choose models with bad predictive performance, comparable or even worse than the model with no variables at all (the dotted lines). This holds for CV-10, WAIC, DIC, L2, L2-CV, and L2-k, and the conclusion covers all the levels of correlation between the variables (blue, red and green circles), albeit the high dependency between the variables somewhat improves the results. The observed behaviour is due to overfitting in the selection process (as we will show below). Due to scarce data, the high variance in the utility estimates leads to selecting overfitted models as discussed in Sect. 3. These methods perform reasonably only for the largest dataset size \(n=400\). MAP, Median, BMA-ref, and BMA-proj perform significantly better, choosing smaller models with predictive ability closer to the BMA. A closer inspection reveals that out of these four, BMA-proj performs best in terms of the predictive ability especially for the smallest dataset size \(n=100\), but the better predictive accuracy is partially due to selecting more variables than the other three methods. Note also that for BMA-proj the predictions are computed using the projected samples (Eq. (16)), whereas for all the other methods the parameter estimation is done by fitting the submodels to data. Later in this section we will show that the parameter estimation using the projection can play a considerable role in achieving improved predictive accuracy for the submodels, and the good performance of BMA-proj is not simply due to superior ordering of the variables, in fact MPP may perform even better in this aspect (see discussion related to Figs. 5 and 6).

Fig. 4
figure 4

Simulated data Variability in the predictive performance of the found submodels with respect to the BMA (29) along the forward search path as a function of number of variables selected for the same methods as in Fig. 3 for different training set sizes n when \(\rho =0.5\). The grey lines show the test utilities for the different data realizations and the black line denotes the average (the black lines are the same as in Fig. 3). The dotted vertical lines denote the average number of variables chosen

Fig. 5
figure 5

Simulated data Proportion of relevant (vertical-axis) versus proportion of irrelevant variables chosen (horizontal axis) for the different training set sizes n. The data had 100 variables in total with 15 relevant and 85 irrelevant variables, relevant being defined as a variable that was used to generate the output y. The colours denote the correlation level between the variables (see the legend). The curves are averaged over the 50 data realizations

Fig. 6
figure 6

Simulated data The average test utility with respect to the BMA (29) as a function of number of variables selected when the submodel parameters are learned by projection from the BMA (solid line) and by standard fitting to the data (dotted line). The projection improves the performance of the submodels regardless of whether the variables are sorted by their marginal posterior probabilities (top row) or by a forward search minimizing the discrepancy to the BMA (bottom row)

To get more insight to the problem, let us examine more closely how the predictive performance of the submodels change when variables are selected. Figure 3 shows the CV and test utilities after sorting the variables, as a function of the number of variables selected along the forward search path when \(\rho =0.5\). The CV-utility (tenfold) is computed within the data used for selection (n points), and the test utility on independent data (note that computing the CV-curve for BMA-proj requires cross-validating the BMA and performing the projection for the submodels separately for each fold). The search paths for CV-10 (top row) demonstrate the overfitting in model selection; starting from the empty model and adding variables one at a time one finds models that have high CV utility but much worse test utility. In other words, the performance of the models at the search path is dramatically overestimated and the gap between the two curves denotes the selection induced bias. Yet in other words, after selection (sorting the variables) the CV utility is an optimistic estimate for the selected models. Note, however, that for the empty model and the model with all the variables the CV utility and the test utility are on average almost the same because these models do not involve any selection. The overfitting in the selection process decreases when the size of the training set grows because the variance of the error term in decomposition (23) becomes smaller, but the effect is still visible for \(n=400\). The behaviour is very similar also for WAIC, DIC, L2, L2-CV and L2-k (the results for the last three are left out to save space).

Ordering the variables according to their marginal posterior probabilities (MPP) and selecting the Median model works much better than CV-10, leading to selection of smaller models with good predictive performance. However, even better results are obtained by using the reference model approach, especially the projection (BMA-proj). The results clearly show that the projection approach is much less vulnerable to the overfitting than CV, WAIC and DIC, even though the CV utility is still a biased estimate of the true predictive ability for the chosen models. Even for the smallest dataset size, the projection is able to find models with predictive ability very close to the BMA with about 10–15 variables on average. Moreover, the projection has the inherent advantage over the other methods performing reasonably well (like MPP/Median) that when more variables are added, the submodels get ever closer to the reference model (BMA), thus avoiding the dip in the predictive accuracy apparent with the other methods around 10 variables. This is simply because the submodels are constructed to be similar than the model averaging solution which yields the best results (this point will be further discussed below, see discussion related to Fig. 6).

Figure 4 shows the variability in the performance of the selected models for the same selection methods as in Fig. 3. The grey lines denote the test utilities for the selected models as a function of number of selected variables for different data realizations and the black line denotes the average (same as in Fig. 3). For small training set sizes the variability in the predictive performance of the selected submodels is very high for CV-10, WAIC, DIC, and MPP. The reference model approach, especially the projection, reduces the variability substantially finding sparse submodels with predictive performance close to the BMA in all the data realizations. This is another property that makes the projection approach appealing. Figure 4 further emphasizes how difficult it is to improve over the BMA in predictive terms; most of the time the model averaging yields better predictions than any of the found submodels. Moreover, even when better submodels are found (the cases where the grey lines exceed the zero level), the difference in the predictive performance is relatively small.

Although our main focus is on the predictive ability of the chosen models, we also studied how the different methods are able to choose the truly relevant variables over the irrelevant ones. Here by “relevant” we mean those 15 variables that were used to generate the output y even though it might not be completely clear how the relevance should be defined when there are correlations between the variables (for example, should we select one or both of two correlating variables which both correlate with the output). Figure 5 shows the proportion of relevant variables chosen (vertical axis) versus proportion of irrelevant variables chosen (horizontal axis) on average (the larger the area under the curve, the better). In this aspect, ordering the variables according to their marginal probabilities seems to work best, slightly better than the projection. The other methods seem to perform somewhat worse. Interestingly, although the projection does not necessarily order the variables any better than the marginal posterior probability order, the predictive ability of the projected submodels is on average better and varies less as Fig. 4 demonstrates.

To explain this behaviour, we did one more experiment and studied the difference of learning the submodel parameters by the projection from the BMA compared to fitting the submodels to the data. We performed this analysis both when the selection was done by the marginal probabilities or by forward search minimizing the discrepancy to the BMA, see Fig. 6. The results show that constructing the submodels by projection improves the results regardless of which of the two methods is used to sort the variables, and in this example, there seems to be little difference in the final results as long as projection is used to construct the submodels.

This effect can be explained by transmission of information from the reference model. Recall that the BMA corresponds to setting the sparsifying spike-and-slab prior for the full model. Because the prior information is transmitted also to the submodels in the projection, it is natural that the submodels benefit from this [compared to the Gaussian prior in (24)] especially when some irrelevant variables have been included. Furthermore, the noise level is also learned from the full model [see Eq. (31)], which reduces overfitting of the submodels as the full model best represents the uncertainties related to the problem and best captures the correct the noise level. More detailed analysis of these effects would be useful, but we do not focus on it more in this paper and leave it to the future research.

To summarize, it clearly appears that the full model averaging solution produces best predictive results, and the projection appears the most robust method for simplifying the full model without losing much predictive accuracy. However, the number of variables actually selected depends on the arbitrary 95 % explanatory power rule, and although it seems to work quite well for the examples above, it does not always lead to optimal results (see the real world examples in the next section). We discuss this problem and a possible solution further in Sect. 4.4.

Table 3 Summary of the real world datasets and used priors

4.3 Real world datasets

We also studied the performance of the different methods on several real world datasets. Five publicly availableFootnote 1 datasets were used and they are summarized in Table 3. One of the datasets deals with regression and the rest with binary classification. As a preprocessing, we normalized all the input variables to have zero mean and unit variance. For the Crime dataset we also log-normalized the original non-negative target variable (crimes per population) to get a real-valued and more Gaussian output. From this dataset we also removed some input variables and observations with missing values (the given p and n in Table 3 are after removing the missing values). We will not discuss the datasets in detail but refer to the sources for more information.

For the regression problem we applied the Gaussian regression model (24) and for the classification problems the probit model (25). The prior parameters in each case are listed in Table 3. For all the problems we used relatively uninformative priors for the input weights (and measurement noise). As the reference model, we again used the BMA solution estimated using reversible jump MCMC. For the first three problems (Crime, Ionosphere, Sonar) we used a very uninformative prior for the number of input variables (i.e., \(a=b=2\)) because there was basically no prior information about the sparsity level. For the last two datasets (Ovarian and Colon) for which \(p \gg n\) we had to use priors that favor models with only a few variables to avoid overfitting. Figure 7 shows the estimated posterior probabilities for different number of variables (top row) and marginal posterior probabilities for the different inputs (bottom row) for all the datasets. Although these kind of plots may give some idea about the variable relevancies, it is still often difficult to decide which variables should be included in the model and what would be the effect on the predictive performance.

Fig. 7
figure 7

Real datasets Prior and posterior probabilities for the different number of variables (top row) and marginal posterior probabilities for the different variables sorted from the most probable to the least probable (bottom row). The posterior probabilities are given with 95 % credible intervals estimated from the variability between different RJMCMC chains. The results are calculated using the full datasets (not leaving any data out for testing). For Ovarian and Colon datasets the plots are truncated at 30 variables

We then performed the variable selection using the methods in Table 2 except the ones based on the squared error (L2, L2-CV, L2-k) were not used for the classification problems. For Ovarian and Colon datasets, due to large number of variables, we also replaced the tenfold-CV by the importance sampling LOO-CV (IS-LOO-CV) to reduce the computation time. For these two datasets we also performed the forward searching only up to 10 variables.

To estimate the predictive ability of the chosen models, we repeated the selection several times each time leaving part of the data out and then measuring the out-of-sample performance using these observations. The Crime dataset was sufficiently large (\(n=1992\)) to be splitted into training and test sets. We repeated the selection for 50 random splits each time using \(n=100,200,400\) points for training and the rest for testing. This also allowed us to study the effect of the training set size. For Ionosphere and Sonar we used tenfold cross-validation, that is, the selection was performed 10 times each time using 9/10 of the data and estimating the out-of-sample performance with the remaining 1/10 of the data. For Ovarian and Colon datasets, due to few observations, we used leave-one-out cross-validation for performance evaluation (the selection was performed n times each time leaving one point out for testing). Again, we report the results as the mean log predictive density on the independent data with respect to the BMA (29).

Fig. 8
figure 8

Real datasets The number of selected variables (left column) and the estimated out-of-sample utilities of the selected models (right column) on average and with 95 % credible intervals for the different datasets. The out-of-sample utilities are estimated using hold-out samples not used for selection (see text) and are shown with respect to the BMA (29). The dotted line denotes the performance of the empty model (the intercept term only). For Ovarian and Colon datasets the searching was performed only up to 10 variables although both of these datasets contain many more variables

Fig. 9
figure 9

Classification datasets CV (red) and out-of-sample (black) utilities on average for the selected submodels with respect to the BMA (29) along the forward search path as a function of number of variables selected. CV utilities (tenfold) are computed within the same data used for selection and the out-of-sample utilities are estimated on hold-out samples not used for selection (see text) and are given with 95 % credible intervals. The dotted vertical lines denote the average number of variables chosen. CV optimization (top row) is carried out using tenfold-CV for Ionosphere and Sonar, and IS-LOO-CV for Ovarian and Colon

Fig. 10
figure 10

Crime dataset Variability in the test utility of the selected submodels with respect to the BMA (29) along the forward search path as a function of number of variables selected. The selection is performed using \(n=100,200,400\) points and the test utility is computed using the remaining data. The grey lines show the test utilities for the 50 different splits into training and test sets and the black line denotes the average. The dotted vertical lines denote the average number of variables chosen

Figure 8 summarizes the results. The left column shows the average number of variables selected and the right column the estimated out-of-sample utilities for the chosen models (out-of-sample utilities are estimated using hold-out samples not used for selection as explained above). The results are qualitatively very similar to those obtained for the simulated experiments (Sect. 4.2). Again we conclude that the BMA solution gives better predictions than any of the selection methods when measured on independent data. Moreover, the results demonstrate again that model selection using CV, WAIC, DIC, L2, L2-CV, or L2-k is liable to overfitting especially when the dataset is small compared to the number of variables. Overall, MAP and Median models tend to perform better but show non-desirable performance on some of the datasets. Especially the Median model performs badly for Ovarian and Colon datasets where almost all the variables have marginal posterior probability less than 0.5 (depending on the split into training and validation sets). The projection (BMA-proj) shows the most robust performance choosing models with predictive ability close to the BMA for all the datasets.

Figure 9 shows the CV (red) and out-of-sample (black) utilities for the chosen models as a function of number of chosen variables for the classification problems. CV-utilities (tenfold) are computed after sorting the variables within the same data used for selection, and the out-of-sample utilities are estimated on hold-out samples not used for selection as explained earlier. This figure is analogous to Fig. 3 showing the magnitude of the selection induced bias (the difference between the red and black lines). Especially for the last three datasets (Sonar, Ovarian, Colon) the selection induced bias is considerable for all the methods, which emphasizes the importance of validating the variable searching process in order to avoid bias in performance evaluation for the found models. Overall, the projection (BMA-proj) appears to find models with best out-of-sample accuracy for a given model complexity, albeit for Ovarian dataset choosing about five most probable inputs (MPP) would perform even better. Moreover, the uncertainty in the out-of-sample performance for a given number of variables is also the smallest for the projection over all the datasets.

The same applies to the Crime dataset, see Fig. 10. For any given number of variables the projection is able to find models with predictive ability closest to the BMA and also with the least variability, the difference to the other methods being the largest when the dataset size is small. For Crime data, some additional results using hierarchical shrinkage prior for the full model are presented in Appendix 2.2.

4.4 On choosing the final model size

Although the search paths for the projection method (BMA-proj) seem overall better than for the other methods (Figs. 3, 4, 9, 10), the results also demonstrate difficulty in deciding the final model size; for instance, for Ionosphere and Sonar datasets the somewhat arbitrary 95 % explanatory power rule chooses rather too few variables, but for Ovarian and Colon unnecessarily many variables (the out-of-sample utility close to the BMA can be obtained with fewer variables). The same applies for the Crime dataset with the smallest number of training points (\(n=100\)). As discussed in Sect. 2.4, a natural idea would be to decide the final model size based on the estimated out-of-sample utility (the black lines in Figs. 9, 10) which can be done by cross-validation outside the searching process. This opens up the question, does this induce a significant amount of bias in the utility estimate for the finally selected model?

Fig. 11
figure 11

Real datasets Vertical axis shows the final expected utility on independent data with respect to the BMA (29) for the selected submodels when the searching is done using the projection (BMA-proj) selecting the smallest number of variables m satisfying \( \Pr {\left[ \Delta \text {MLPD} (m) \ge U\right] } \ge \alpha \), where \(\Delta \text {MLPD} (m)\) denotes the estimated out-of-sample utility for m variables estimated using the CV (tenfold) outside the searching process (same as the black lines in Fig. 9). The final utility is estimated using another layer of validation (see text). The dotted line denotes the utility for the empty model. When \(\alpha =0.95\), the final utility remains equal or larger than U (the dots stay above the diagonal line) indicating that the applied selection rule does not induce bias in the performance evaluation for the finally selected model

To assess this question, we performed one more experiment on the real world datasets by adding another layer of cross-validation to assess the performance of the finally selected models on independent data. In other words, the variable searching was performed using the projection, the inner layer of cross-validation (tenfold) was used to decide the model size and the outer layer to measure the performance of the finally selected models (tenfold for Ionosphere and Sonar, LOO-CV for Ovarian and Colon, and hold-out with different training set sizes for Crime). As the rule for deciding the model size, we selected the smallest number of variables satisfying

$$\begin{aligned} \Pr {\left[ \Delta \text {MLPD} (m) \ge U\right] } \ge \alpha \end{aligned}$$

for different thresholds U and \(\alpha \). Here \(\Delta \text {MLPD} (m)\) denotes the estimated out-of-sample utility for m variables in the inner validation. This probability is the same as (18), the inequality is merely organized differently. Here U denotes how much one is willing to sacrifice the predictive accuracy in order to reduce the number of variables, and \(\alpha \) denotes how certain we want to be about not going below U. We estimate the above probability using the Bayesian bootstrap.

Figure 11 shows the final expected utility on independent data for different values of U and \(\alpha \) for the different datasets. The results show that for a large enough credible level \(\alpha =0.95\), the applied selection rule appears to be safe in the sense that the final expected utility remains always above the level of U (the dots stay above the diagonal line). This means that there is no substantial amount of selection induced bias at this stage, and the second level of validation is not necessarily needed. However, this does not always hold for smaller values of \(\alpha \) (\(\alpha =0.5\) and 0.05).

We can make these results more intuitive by looking at an example. Consider the projection search path for the Sonar dataset in Fig. 9 (bottom row, second column). Assume now that \(U = -0.01\), meaning that we are willing to lose 0.01 in the MLPD compared to the BMA, which is about 5 % of the difference between the BMA and the empty model. Since the grey bars denote the central 95 % credible intervals, setting \(\alpha =0.975\) would correspond to choosing the smallest model for which the lower limit of the grey bar falls above \(U=-0.01\) (because then the performance of the submodel is greater than this with probability 0.975). This would lead to choosing about 35 variables, and we could be quite confident that the final expected utility would be within 0.01 from the BMA. On the other hand, setting \(\alpha =0.025\) corresponds to choosing the smallest model for which the upper limit of the grey bar falls above \(U=-0.01\) (because then the performance of the submodel is greater than this with probability 0.025). This would lead to choosing only three variables, but in this case we cannot expect confidently that the final expected utility would be within 0.01 from the BMA, and in fact it would be lower than this with high probability.

Following the example above, one can get an idea of the effect of \(\alpha \) and U to the final model size and the final expected utility. Generally speaking, U determines how much we are willing to compromise the predictive accuracy and \(\alpha \) determines how condident we want to be about the final utility estimate. It is application specific whether it is more important to have high predictive accuracy or to reduce the number of variables at the cost of losing predictive accuracy, but a reasonable recommendation might be to use \(\alpha \ge 0.95\) and U to be 5 % of the difference between the reference model and the empty model. Based on Fig. 11, this combination would appear to give predictive accuracy close to the BMA, and based on Fig. 9 yield quite effective reduction in the number of variables (choosing about 20 variables for Ionosphere, 35 for Sonar, and 5–10 variables for Ovarian and Colon by visual inspection).

As a final remark, one might wonder why the cross-validation works well if used only to decide the model size but poorly if used directly to optimize the variable combination (as depicted e.g. in Fig. 3)? As discussed in Sect. 3, the amount of overfitting in the selection and the consequent selection bias depends on the variance of the utility estimate for a given model (over the data realizations) and the number of candidate models being compared. For the latter reason, the overfitting is considerably smaller when cross-validation is used only to decide the model size by comparing only \(p+1\) models (given the variable ordering), in contrast to selecting variable combination in the forward search phase among \(O(p^2)\) models. Moreover, the utilities of two consecutive models at the search path are likely to be highly correlated, which further reduces the freedom of selection and therefore the overfitting at this point. It is for this same reason why cross-validation may yield reasonable results when used to choose a single hyperparameter among a small set of values, in which case essentially only a few different models are being compared.

Based on the results presented in this section, despite the increased computational effort, we believe the use of cross-validation on top of the variable searching is highly advisable both for choosing the final model size and giving a nearly unbiased estimate of the out-of-sample performance for the selected model. We emphasize the importance of this regardless of the method used for searching the variables, but generally we recommend using the projection given its overall superior performance in our experiments.

4.5 Computational considerations

We conclude with a remark on the computational aspects. The results in Sects. 4.2 and 4.3 emphasized that from the predictive point of view, the model averaging over the different submodels yields often better results than selection of any single model. Forming the model averaging solution over the variable combinations may be computationally challenging, but is quite feasible up to problems with a few thousand variables or less with, for instance, a straightforward implementation of the RJMCMC algorithm with simple proposal distributions. Scalability up to a million variables can be obtained with more sophisticated and efficient proposals (Peltola et al. 2012b). The computations could also be sped up by using a hierarchical shrinkage prior such as the horseshoe (see Appendix 2.1) for the regression weights in the full model, instead of the spike-and-slab (which is equivalent to the model averaging over the variable combinations).

After forming the full reference model (either the BMA or using some alternative prior), the subsequent computations needed for the actual selection are typically less burdensome. Computing the MAP and the Median models from the MCMC samples from the model space is easy, but these cannot be computed with alternative priors. Constructing the submodels by projection and performing a forward search through the variable combinations is somewhat more laborous, but takes still usually considerably less time than forming the reference model. The projection approach can also be used with any prior for the full model, as the posterior samples are all that is needed (see Appendix 2).

The methods that do not rely on the construction of the full reference model (CV, WAIC, DIC, and the \(L^2\)-variants) are typically computationally easier as they avoid the burden of forming the reference model in the first place. However, if one has to go through a lot of models in the search, the procedure is fast only if the fitting of the submodels is fast. This is the case for instance for the Gaussian linear model for which the posterior computations are obtained analytically, but in other problems like in classification, sampling the parameters separately for each submodel may be very expensive, and one may be forced to use faster approximations (such as the Laplace method or expectation propagation). On the other hand, it must be kept in mind that the possible computational savings from avoiding the construction of the full model may come at the risk of overfitting and reduced predictive accuracy, as our results show.

5 Conclusions

In this paper we have shortly reviewed many of the proposed methods for Bayesian predictive model selection and illustrated their use and performance in practical variable selection problems for regression and binary classification, where the goal is to select a minimal subset of input variables with a good predictive accuracy. The experiments have been carried out using both simulated and several real world datasets.

The numerical experiments show that the overfitting in the selection may be a potential problem and hinder the model selection considerably. This is the case especially when the dataset is small (high variance in the utility estimates) and the number of models under comparison large (large number of variables). Especially vulnerable methods for this type of overfitting are CV, WAIC, DIC and other methods that rely on data reuse and have therefore relatively high variance in the utility estimates. From the predictive point of view, better results are generally obtained by accounting for the model uncertainty and forming the full encompassing (reference) model with all the variables and best possible prior information on the sparsity level. Our results showed that Bayesian model averaging (BMA) over the candidate models yields often the best results on expectation, and one should not expect to do better by selection. This agrees with what is known about the good performance of the BMA (Hoeting et al. 1999; Raftery and Zheng 2003).

If the full model is too complex or the cost for observing all the variables is too high, the model can be simplified most robustly by the projection method which is considerably less vulnerable to the overfitting in the selection. The advantage of the projection approach comes from taking into account the uncertainty in forming the full encompassing model and then finding a simpler model which gives similar answers as the full model. Overall, the projection framework outperforms also the selection of the most probable variables or variable combination (Median and MAP models) being able to best retain the predictive ability of the full model while effectively reducing the model complexity. The results also demonstrated that the projection does not only outperform the other methods on average but the variability over the different data realizations is also considerably smaller compared to the other methods. In addition, the numerical experiments showed that constructing the submodels by the projection from the full model may improve the predictive accuracy even when some other strategy, such as marginal probabilities, are used to rank the variables.

Despite its advantages, the projection method has the inherent challenge of forming the reference model in the first place. There is no automated way of coming up with a good reference model which emphasizes the model criticism. However, as already stressed, incorporating the best possible prior information into the full encompassing model is formally the correct Bayesian way of dealing with the model uncertainty and often seems to also provide the best predictions in practice. In this study we used the model averaging over the variable combinations as the reference model, but similar results are obtained also with the hierarchical shrinkage prior (see Appendix 2).

Another issue is that, even though the projection method seems the most robust way of searching for good submodels, the estimated discrepancy between the reference model and a submodel is in general an unreliable indicator of the predictive performance of the submodel. In variable selection, this property makes it problematic to decide how many variables should be selected to obtain predictive performance close to the reference model, even though the minimization of the discrepancy from the reference model typically finds a good search path through the model space.

However, the results show that this problem can be solved by using cross-validation outside the searching process, as this allows studying the tradeoff between the number of included variables and the predictive performance, which we believe is highly informative. Moreover, we demonstrated that selecting the number of variables this way does not produce considerable overfitting or selection induced bias in the utility estimate for the finally selected model, because the selection is conditioned on a greatly reduced number of models (see Sect. 4.4). While this still leaves the user the responsibility of deciding the final model size, we emphasize that this decision depends on the application and the costs of the inputs. Without any costs for the variables, we would simply recommend using them all and carrying out the full Bayesian inference on the model.