Abstract
When a linear model is chosen by searching for the best subset among a set of candidate predictors, a fixed penalty such as that imposed by the Akaike information criterion may penalize model complexity inadequately, leading to biased model selection. We study resampling-based information criteria that aim to overcome this problem through improved estimation of the effective model dimension. The first proposed approach builds upon previous work on bootstrap-based model selection. We then propose a more novel approach based on cross-validation. Simulations and analyses of a functional neuroimaging data set illustrate the strong performance of our resampling-based methods, which are implemented in a new R package.
Similar content being viewed by others
References
Akaike H. (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov B.N., Csàki F. (eds) Second International Symposium on Information Theory. Akademiai Kiàdo, Budapest, pp 267–281
Akaike H. (1974) A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716–723
Biswal B., Yetkin F.Z., Haughton V.M., Hyde J.S. (1995) Functional connectivity in the motor cortex of resting human brain using echo-planar MRI. Magnetic Resonance in Medicine 34: 537–541
Cerdeira, J. O., Duarte Silva, P., Cadima, J., Minhoto, M. (2009). subselect: Selecting variable subsets. R package version 0.10-1. http://CRAN.R-project.org/package=subselect
Davison A.C. (2003) Statistical Models. Cambridge University Press, Cambridge
Efron B. (1983) Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association 78: 316–331
Efron B. (2004) The estimation of prediction error: Covariance penalties and cross-validation (with discussion). Journal of the American Statistical Association 99: 619–642
Efron B., Tibshirani R. (1997) Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association 92: 548–560
Foster D.P., George E.I. (1994) The risk inflation criterion for multiple regression. Annals of Statistics 22: 1947–1975
George E.I., Foster D.P. (2000) Calibration and empirical Bayes variable selection. Biometrika 87: 731–747
Harville D.A. (2008) Matrix Algebra from a Statistician’s Perspective. Springer, New York
Helland I.S. (1988) On the structure of partial least squares regression. Communications in Statistics: Theory and Methods 17: 588–607
Hoerl A.E., Kennard R.W. (1970) Ridge regression: applications to nonorthogonal problems. Technometrics 12: 69–82
Hurvich C.M., Tsai C.-L. (1989) Regression and time series model selection in small samples. Biometrika 76: 297–307
Ishiguro M., Sakamoto Y., Kitagawa G. (1997) Bootstrapping log likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics 49: 411–434
Konishi S., Kitagawa G. (1996) Generalised information criteria in model selection. Biometrika 83: 875–890
Konishi S., Kitagawa G. (2008) Information Criteria and Statistical Modeling. Springer, New York
Lawless J.F., Singhal K. (1978) Efficient screening of nonnormal regression models. Biometrics 34: 318–327
Lumley, T. (2009). using Fortran code by A. Miller. leaps: regression subset selection. R package version 2.9. http://CRAN.R-project.org/package=leaps
Luo X., Stefanski L.A., Boos D.D. (2006) Tuning variable selection procedures by adding noise. Technometrics 48: 165–175
Magnus J.R. (1986) The exact moments of a ratio of quadratic forms in normal variables. Annales d’Économie et de Statistique 4: 95–109
Miller A. (2002) Subset Selection in Regression, 2nd ed. Chapman & Hall/CRC, Boca Raton
Pan W., Le C.T. (2001) Bootstrap model selection in generalized linear models. Journal of Agricultural, Biological, and Environmental Statistics 6: 49–61
R Development Core Team (2010). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. http://www.R-project.org.
Reiss, P. T., Huang, L., Mennes, M. (2010). Fast function-on-scalar regression with penalized basis expansions. International Journal of Biostatistics, 6, article 28.
Rosenberg M. (1965) Society and the Adolescent Self-Image. Princeton University Press, Princeton, NJ
Shao J. (1996) Bootstrap model selection. Journal of the American Statistical Association 91: 655–665
Shen X., Ye J. (2002) Adaptive model selection. Journal of the American Statistical Association 97: 210–221
Stark D.E., Margulies D.S., Shehzad Z., Reiss P.T., Kelly A.M.C., Uddin L.Q., Gee D., Roy A.K., Banich M.T., Castellanos F.X., Milham M.P. (2008) Regional variation in interhemispheric coordination of intrinsic hemodynamic fluctuations. Journal of Neuroscience 28: 13754–13764
Stein J.L., Wiedholz L.M., Bassett D.S., Weinberger D.R., Zink C.F., Mattay V.S., Meyer-Lindenberg A. (2007) A validated network of effective amygdala connectivity. NeuroImage 36: 736–745
Sugiura N. (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections. Communicatons in Statistics: Theory and Methods 7: 13–26
Tibshirani R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58: 267–288
Tibshirani R., Knight K. (1999) The covariance inflation criterion for adaptive model selection. Journal of the Royal Statistical Society, Series B 61: 529–546
Watson, D., Weber, K., Assenheimer, J. S., Clark, L. A., Strauss, M. E., McCormick, R. A. (1995). Testing a tripartite model: I. Evaluating the convergent and discriminant validity of anxiety and depression symptom scales. Journal of Abnormal Psychology, 104, 3–14.
Wood S.N. (1994) Monotonic smoothing splines fitted by cross validation. SIAM Journal on Scientific Computing, 15: 1126–1133
Wood S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, Boca Raton
Ye J. (1998) On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association 93: 120–131
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Reiss, P.T., Huang, L., Cavanaugh, J.E. et al. Resampling-based information criteria for best-subset regression. Ann Inst Stat Math 64, 1161–1186 (2012). https://doi.org/10.1007/s10463-012-0353-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-012-0353-1