Computation for intrinsic variable selection in normal regression models via expected-posterior prior
Abstract
In this paper, we focus on the variable selection problem in normal regression models using the expected-posterior prior methodology. We provide a straightforward MCMC scheme for the derivation of the posterior distribution, as well as Monte Carlo estimates for the computation of the marginal likelihood and posterior model probabilities. Additionally, for large spaces, a model search algorithm based on \(\mathit{MC}^{3}\) is constructed. The proposed methodology is applied in two real life examples, already used in the relevant literature of objective variable selection. In both examples, uncertainty over different training samples is taken into consideration.
Keywords
Bayesian variable selection Expected-posterior priors Imaginary data Intrinsic priors Jeffreys prior Objective model selection methods Normal regression models1 Introduction
When using improper prior distributions to express prior ignorance for the model parameters, Bayes factors cannot be evaluated because of the presence of the unknown normalizing constants. This has urged the Bayesian community to develop various methodologies to overcome the problem of prior specification in variable selection problems including the approaches based on Zellner’s (1986) g-priors amongst others; see in Fernandez et al. (2001), Liang et al. (2008), Celeux et al. (2012) and Dellaportas et al. (2012) for some recent advances and comparisons.
One of the proposed approaches is also the intrinsic Bayes factors (IBFs), introduced by Berger and Pericchi (1996). In order to provide a full Bayesian interpretation of IBFs, they have also defined intrinsic prior (IP) distributions. The intrinsic prior methodology has been applied for objective variable selection problems in normal regression models, by Casella and Moreno (2006), Moreno and Girón (2008), Girón et al. (2006) and Casella et al. (2009).
Intrinsic priors are closely related to the expected-posterior prior distributions of Pérez (1998) and Pérez and Berger (2002) which have nice interpretation based on imaginary training data coming from prior predictive distributions. The expected-posterior priors overcome some of the difficulties that appear in Bayesian model comparison and variable selection when using improper priors, like the indeterminacy of the Bayes factors, since the unknown normalizing constants cancel out in the marginal likelihood ratios. Moreover, all prior distributions are calculated automatically, having a notion of compatibility, since they are based on averages of posterior distributions of similar imaginary-predictive data. Another advantage is that these priors take into account the different interpretation of the coefficients of each model. They are also connected not only to the intrinsic priors, but to the Zellner’s (1986) g-priors that use a specific “imaginary” dataset instead of averaging across a predictive distribution. For a complete and more detailed list of the advantages of the expected-posterior priors see Pérez (1998) and Pérez and Berger (2002).
In this paper we implement the expected-posterior prior methodology on variable selection problems in normal regression models. We construct a straightforward MCMC scheme for the derivation of the posterior distribution, as well as a Monte Carlo estimate for the computation of the Bayes factors and posterior model probabilities under the intrinsic prior. The proposed methodology is applied to a variety of random training samples and therefore the uncertainty over different training samples is considered.
2 Expected-posterior priors
In the above equation, if we use the Bayes theorem to replace \(\pi_{\ell}^{N} ( \boldsymbol{\beta}_{\ell}, \sigma^{2} | \boldsymbol{y}^{*}, \mathrm{X} _{\ell}^{*})\) by the corresponding likelihood-prior product and write the marginal likelihood \(m_{0}^{N}(\boldsymbol{y}^{*}|\, \mathrm{X} _{0}^{*} )\) as an integral of the likelihood over the prior of the parameters of the reference model, then we end up with the intrinsic prior as defined in Berger and Pericchi (1996).
A question that naturally arises is which model should be selected as a reference model. In order (1) to coincide with the intrinsic prior, m_{0} must be nested to all models m_{ℓ} under consideration. Therefore, in variable selection problems, a natural choice for the reference model is the constant model.
3 Prior specification
4 Computation of the posterior distribution
- 1.Generate y^{∗} from
- 2.
Generate σ^{2} from \(\mathit{IG}( \widetilde{a}_{\ell}^{N} , \widetilde {b}_{\ell}^{N} ) \).
- 3.
Generate β_{ℓ} from \(N_{d_{\ell}} ( \widetilde{\boldsymbol{\beta}}^{N}, \widetilde{\mathrm{\Sigma}}^{N} \sigma^{2} )\).
5 Variable selection computation
In this section we provide two alternative approaches for the evaluation of the models under consideration. In Sect. 5.1 we construct an efficient Monte Carlo scheme for the estimation of the marginal likelihood for any given training sample \(\mathrm{X} ^{*}\), while in Sect. 5.2 we introduce an MCMC algorithm, more appropriate for large model spaces, which directly estimates the posterior model probabilities over all possible training subsamples.
5.1 Monte Carlo estimation of the marginal likelihood
For small model spaces it is easy to estimate the unnormalized marginal likelihoods (11) for all models under consideration using the above sampling scheme. For large model spaces, it is possible to implement an \(\mathit{MC}^{3}\) algorithm (Madigan and York 1995; Kass and Raftery 1995) by estimating (11) for each model that is evaluated for the first time within the iterative scheme. The estimator presented in this section is compatible with the importance sampling estimator of Pérez (1998, Sect. 3.4.1); here we use the conditional posterior predictive distributions which under the baseline prior (2) are multivariate Student distributions and therefore can be calculated directly.
5.2 Computation of the posterior model weights over different training samples
- For k=1,…,K (training samples):
- 1.
Randomly consider a submatrix \(\mathrm{X} ^{*}\) of \(\mathrm{X} \) with dimension (n^{∗}×d).
- 2.For t=1,…,T (iterations):
- (a)
- (b)For j=1,…,p, propose with probability one to move to model m_{ℓ ′} by changing the status of the j covariate and accept the proposed model with probability α=min{1,A}, where and f(m_{ℓ}) is the prior probability of model m_{ℓ}.
- (a)
- 3.
Calculate the posterior weights for each training sample k.
- 1.
From the above \(\mathit{MC}^{3}\) scheme we can produce summaries of the posterior model weights over the K different training samples. This might be more efficient in large model spaces since we avoid implementing the Monte Carlo computation presented in Sect. 5.1 for each newly visited model. Nevertheless, in such cases, the number of iterations T within each training sample must be increased to ensure that the model space is satisfactorily explored for each training sample.
6 Experimental results
In this section the proposed methodology is illustrated on two real life examples. In both examples we use a uniform prior on the model space.
6.1 Hald’s data
We consider the Hald’s cement data (Montgomery and Peck 1982) to illustrate the proposed approach. This dataset consists of n=13 observations and p=4 covariates and has been previously used by Girón et al. (2006) for illustrating objective variable selection methods. The response variable Y is the heat evolved in a cement mix and the explanatory variables are the tricalcium aluminate (X_{1}), the tricalcium silicate (X_{2}), the tetracalcium alumino ferrite (X_{3}) and the dicalcium silicate (X_{4}). An important feature of Hald’s cement data is that variables X_{1} and X_{3} and variables X_{2} and X_{4} are highly correlated (\(\mathit{corr}(X_{1},\,X_{3})=-0.824\) and \(\mathit{corr}(X_{1},X_{4})=-0.975\)).
Summaries of posterior model probabilities for the best models over 100 different training sub-samples together with median based posterior odds of the MAP (m_{1}) vs. m_{j}<5 for the Hald’s cement data (Example 6.1)
m_{j} | Model formula | Expected posterior prior | g-Prior^{a} | ||||||
---|---|---|---|---|---|---|---|---|---|
Posterior model probabilities | Median based PO^{c} | f(m_{j}|y)^{b} | PO^{c} | ||||||
Mean | Median | SD | Percentiles | ||||||
2.5 % | 97.5 % | ||||||||
1 | X_{1}+X_{2} | 0.411 | 0.405 | 0.074 | 0.309 | 0.554 | 1.000 | 0.325 | 1.000 |
2 | X_{1}+X_{4} | 0.167 | 0.160 | 0.040 | 0.107 | 0.278 | 2.529 | 0.225 | 1.444 |
3 | X_{1}+X_{2}+X_{4} | 0.132 | 0.138 | 0.034 | 0.061 | 0.183 | 2.930 | 0.109 | 2.980 |
4 | X_{1}+X_{2}+X_{3} | 0.128 | 0.132 | 0.033 | 0.062 | 0.185 | 3.061 | 0.109 | 2.990 |
5 | X_{1}+X_{3}+X_{4} | 0.105 | 0.106 | 0.027 | 0.051 | 0.148 | 3.807 | 0.102 | 3.185 |
We also performed the same task using 1000 different training samples, instead of 100; results were almost identical. Furthermore, for illustrative reasons and in order to evaluate the efficiency of our approach, we implemented the proposed \(\mathit{MC}^{3}\) scheme of Sect. 5.2 for 1000 iterations, considering 100 different training samples. Results were very similar to the ones from the full enumeration run, with some increased variability across different samples that could be eliminated by increasing the number of iterations. Graphical comparison of the results obtained using \(\mathit{MC}^{3}\) and the Monte Carlo full enumeration results are presented in Figs. 1 and 2.
Posterior marginal inclusion probabilities for the Hald’s cement data (Example 6.1)
Method | X_{1} | X_{2} | X_{3} | X_{4} |
---|---|---|---|---|
Monte Carlo EPP^{a} | 0.958 | 0.721 | 0.302 | 0.465 |
\(\mathit{MC}^{3}\) EPP^{a} | 0.946 | 0.708 | 0.326 | 0.492 |
Liang et al. (2008) g-prior^{b} | 0.900 | 0.636 | 0.340 | 0.564 |
g-Prior^{c} with standardized data | 0.908 | 0.644 | 0.338 | 0.558 |
g-Prior^{c} with unstandardized data | 0.319 | 0.340 | 0.265 | 0.350 |
6.2 Prostate cancer data
In this section, we present results of our methodology for the prostate cancer data (Stamey et al. 1989). This dataset has been also used by Girón et al. (2006) and Moreno and Girón (2008) to illustrate their approach. It consists of n=97 observations and p=8 covariates. The response variable Y is the level of prostate-specific antigen, and the covariates are the logarithm of cancer volume (X_{1}), the logarithm of prostate weight (X_{2}), the age of the patient (X_{3}), the logarithm of the amount of benign prostatic hyperplasia (X_{4}), the seminal vesicle invasion (X_{5}), the logarithm of capsular penetration (X_{6}), the Gleason score (X_{7}) and the percent of Gleason scores 4 and 5 (X_{8}).
Summaries of posterior model probabilities for the best models over 100 different training sub-samples together with median based posterior odds of the MAP (m_{1}) vs. m_{j}<5 for the prostate cancer data (Example 6.2)
m_{j} | Model formula | Expected posterior prior | g-Prior^{a} | ||||||
---|---|---|---|---|---|---|---|---|---|
Posterior model probabilities | Median based PO^{c} | f(m_{j}|y)^{b} | PO^{c} | ||||||
Mean | Median | SD | Percentiles | ||||||
2.5 % | 97.5 % | ||||||||
1 | X_{1}+X_{2}+X_{5} | 0.299 | 0.296 | 0.062 | 0.199 | 0.446 | 1.000 | 0.374 | 1.000 |
2 | X_{1}+X_{2}+X_{4}+X_{5} | 0.107 | 0.104 | 0.036 | 0.054 | 0.177 | 2.845 | 0.101 | 3.696 |
3 | X_{1}+X_{2}+X_{3}+X_{5} | 0.076 | 0.069 | 0.032 | 0.035 | 0.148 | 4.300 | 0.071 | 5.278 |
4 | X_{1}+X_{2}+X_{5}+X_{8} | 0.067 | 0.066 | 0.021 | 0.033 | 0.110 | 4.472 | 0.062 | 5.981 |
Posterior marginal inclusion probabilities for the prostate cancer data (Example 6.2)
Method | X_{1} | X_{2} | X_{3} | X_{4} | X_{5} | X_{6} | X_{7} | X_{8} |
---|---|---|---|---|---|---|---|---|
Monte Carlo EPP^{a} | 1.000 | 0.939 | 0.232 | 0.293 | 0.911 | 0.132 | 0.145 | 0.196 |
\(\mathit{MC}^{3}\) EPP^{a} | 1.000 | 0.936 | 0.234 | 0.309 | 0.910 | 0.152 | 0.153 | 0.215 |
Liang et al. (2008) g-prior^{b} | 1.000 | 0.946 | 0.193 | 0.254 | 0.917 | 0.110 | 0.125 | 0.162 |
g-Prior^{c} with standardized data | 1.000 | 0.948 | 0.195 | 0.255 | 0.920 | 0.110 | 0.125 | 0.163 |
g-Prior^{c} with unstandardized data | 1.000 | 0.924 | 0.175 | 0.241 | 0.875 | 0.109 | 0.121 | 0.159 |
Finally, the BMA leave-one-out cross-validatory log-scores for EPP, over 30 different training samples, and averaged over models with posterior probabilities higher than 0.01, were calculated. They ranged between −138.8 and −71.9 with mean −105.8 and standard deviation 13.3. The corresponding log-scores under the three different g-prior setups (modified version of g-prior as in Liang et al. (2008), original g-prior with standardized data and original g-prior with unstandardized data) were found equal to −108.6, −108.1 and −113.8 respectively, indicating, for this illustration, a better predictive performance on average for the EPP approach.
7 Discussion
We have presented a computational approach for variable selection in normal regression models, based on the expected-posterior prior methodology. We have constructed efficient MCMC schemes for the estimation of the parameters within each model, based on data-augmentation of the imaginary data, coming from the prior predictive distribution of a reference model. Exploiting this data-augmentation scheme, we have also constructed an efficient Monte Carlo estimate of the marginal likelihood of each competing model. Variable selection is then attained by estimating posterior model weights in the full space, or by considering an alternative \(\mathit{MC}^{3}\) scheme. The proposed methodology has been implemented on two real life examples.
All results have been presented over different training samples, in contrast to relevant research work, where uncertainty due to the training sample selection has been ignored. On large model spaces where accurate estimation of posterior model probabilities is computationally demanding (if not infeasible), selection of “good” models can be based on posterior marginal inclusion probabilities (Barbieri and Berger 2004), which can be estimated in an easier fashion and more accurately from an MCMC output as suggested by Berger and Molina (2005); also see in Clyde et al. (2011) for an efficient variable selection method based on posterior marginal inclusion probabilities.
References
- Barbieri, M., Berger, J.: Optimal predictive model selection. Ann. Stat. 32, 870–897 (2004) MathSciNetMATHCrossRefGoogle Scholar
- Berger, J., Molina, G.: Posterior model probabilities via path-based pairwise priors. Stat. Neerl. 59, 3–15 (2005) MathSciNetMATHCrossRefGoogle Scholar
- Berger, J., Pericchi, L.: The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. 91, 109–122 (1996) MathSciNetMATHCrossRefGoogle Scholar
- Casella, G., Girón, F., Martínez, M., Moreno, E.: Consistency of Bayesian procedures for variable selection. Ann. Stat. 37, 1207–1228 (2009) MATHCrossRefGoogle Scholar
- Casella, G., Moreno, E.: Objective Bayesian variable selection. J. Am. Stat. Assoc. 101, 157–167 (2006) MathSciNetMATHCrossRefGoogle Scholar
- Celeux, G., El Anbari, M., Marin, J.-M., Robert, C.P.: Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation. Bayesian Anal. (forthcoming), arXiv:1010.0300
- Clyde, M., Ghosh, J., Littman, M.: Bayesian adaptive sampling for variable selection and model averaging. J. Comput. Graph. Stat. 20, 80–101 (2011) MathSciNetCrossRefGoogle Scholar
- Dellaportas, P., Forster, J., Ntzoufras, I.: Joint specification of model space and parameter space prior distributions. Statist. Sci. (2012, forthcoming). Currently available at http://www.stat-athens.aueb.gr/~jbn/papers/paper24.htm
- Fernandez, C., Ley, E., Steel, M.: Benchmark priors for Bayesian model averaging. J. Econom. 100, 381–427 (2001) MathSciNetMATHCrossRefGoogle Scholar
- Girón, F., Moreno, E., Martínez, M.: An objective Bayesian procedure for variable regression in regression. In: Balakrishnan, N., Castillo, E., Sarabia, J.M. (eds.) Advances on Distribution Theory, Order Statistics and Inference, pp. 393–408. Birkhäuser, Boston (2006) Google Scholar
- Kass, R., Raftery, A.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995) MATHCrossRefGoogle Scholar
- Leng, C., Tran, M.N., Nott, D.: Bayesian adaptive lasso (2010), arXiv:1009.2300. Available at http://adsabs.harvard.edu/abs/2010arXiv1009.2300L
- Liang, F., Paulo, R., Molina, G., Clyde, M., Berger, J.: Mixtures of g priors for Bayesian variable selection. J. Am. Stat. Assoc. 103, 410–423 (2008) MathSciNetMATHCrossRefGoogle Scholar
- Madigan, D., York, J.: Bayesian graphical models for discrete data. Int. Stat. Rev. 63, 215–232 (1995) MATHCrossRefGoogle Scholar
- Montgomery, D., Peck, E.: Introduction to Linear Regression Analysis. Wiley, New York (1982) MATHGoogle Scholar
- Moreno, E., Girón, F.: Comparison of Bayesian objective procedures for variable selection in linear regression. Test 17, 472–490 (2008) MathSciNetMATHCrossRefGoogle Scholar
- Ntzoufras, I.: Bayesian analysis of the normal regression model. In: Bocker, K. (ed.) Rethinking Risk Measurement and Reporting: Uncertainty, Bayesian Analysis and Expert Judgment: Volume I, pp. 69–106 (2010). ISBN-10:1-906348-40-5, ISBN-13:978-1-906348-40-3: Risk Books Google Scholar
- Pérez, J.: Development of expected posterior prior distribution for model comparisons. Ph.D. thesis, Department of Statistics, Purdue University, USA (1998) Google Scholar
- Pérez, J., Berger, J.: Expected-posterior prior distributions for model selection. Biometrika 89, 491–511 (2002) MathSciNetMATHCrossRefGoogle Scholar
- Stamey, T., Kabakin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., Yang, N.: Prostate-specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate II: radical prostatectomy treated patients. J. Urol. 16, 1076–1083 (1989) Google Scholar
- Zellner, A.: On assessing prior distributions and Bayesian regression analysis using g-prior distributions. In: Goel, P., Zellner, A. (eds.) Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, pp. 233–243. North-Holland, Amsterdam (1986) Google Scholar
- Zellner, A., Siow, A.: Posterior odds ratios for selected regression hypothesis (with discussion). In: Bernardo, J.M., DeGroot, M.H., Lindley, D.V., Smith, A.F.M. (eds.) Bayesian Statistics, vol. 1, pp. 585–606 & 618–647 (discussion). Oxford University Press, Oxford (1980). Google Scholar