Best subset selection via cross-validation criterion

Abstract

This paper is concerned with the cross-validation criterion for selecting the best subset of explanatory variables in a linear regression model. In contrast with the use of statistical criteria (e.g., Mallows’ \(C_p\), the Akaike information criterion, and the Bayesian information criterion), cross-validation requires only mild assumptions, namely, that samples are identically distributed and that training and validation samples are independent. For this reason, the cross-validation criterion is expected to work well in most situations involving predictive methods. The purpose of this paper is to establish a mixed-integer optimization approach to selecting the best subset of explanatory variables via the cross-validation criterion. This subset-selection problem can be formulated as a bilevel MIO problem. We then reduce it to a single-level mixed-integer quadratic optimization problem, which can be solved exactly by using optimization software. The efficacy of our method is evaluated through simulation experiments by comparison with statistical-criterion-based exhaustive search algorithms and \(L_1\)-regularized regression. Our simulation results demonstrate that, when the signal-to-noise ratio was low, our method delivered good accuracy for both subset selection and prediction.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723

    Article  Google Scholar 

  2. Allen DM (1974) The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1):125–127

    Article  Google Scholar 

  3. Arthanari TS, Dodge Y (1981) Mathematical programming in statistics. Wiley, New York

    Google Scholar 

  4. Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    Article  Google Scholar 

  5. Benati S, García S (2014) A mixed integer linear model for clustering with variable selection. Comput Oper Res 43:280–285

    Article  Google Scholar 

  6. Bennett KP, Hu J, Ji X, Kunapuli G, Pang JS (2006) Model selection via bilevel optimization. In: Proceedings of the 2006 IEEE international joint conference on neural networks, pp 1922–1929

  7. Bertsimas D, King A (2016) OR forum—an algorithmic approach to linear regression. Oper Res 64(1):2–16

    Article  Google Scholar 

  8. Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. Ann Stat 44(2):813–852

    Article  Google Scholar 

  9. Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106(7):1039–1082

    Article  Google Scholar 

  10. Bertsimas D, King A (2017) Logistic regression: from art to science. Stat Sci 32(3):367–384

    Article  Google Scholar 

  11. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Google Scholar 

  12. Chung S, Park YW, Cheong T (2017) A mathematical programming approach for integrated multiple linear regression subset selection and validation. arXiv preprint arXiv:1712.04543

  13. Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Ann Oper Res 153(1):235–256

    Article  Google Scholar 

  14. Cozad A, Sahinidis NV, Miller DC (2014) Learning surrogate models for simulation-based optimization. AIChE J 60(6):2211–2227

    Article  Google Scholar 

  15. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22

    Article  Google Scholar 

  16. Geisser S (1975) The predictive sample reuse method with applications. J Am Stat Assoc 70(350):320–328

    Article  Google Scholar 

  17. Hastie T, Tibshirani R, Tibshirani RJ (2017) Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692

  18. Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67

    Article  Google Scholar 

  19. Hooker JN, Osorio MA (1999) Mixed logical-linear programming. Discrete Appl Math 96–97:395–442

    Article  Google Scholar 

  20. Kimura K, Waki H (2018) Minimization of Akaike’s information criterion in linear regression analysis via mixed integer nonlinear program. Optim Methods Softw 33(3):633–649

    Article  Google Scholar 

  21. Konno H, Yamamoto R (2009) Choosing the best set of variables in regression analysis using integer programming. J Glob Optim 44(2):273–282

    Article  Google Scholar 

  22. Kunapuli G, Bennett KP, Hu J, Pang JS (2008) Classification model selection via bilevel programming. Optim Methods Softw 23(4):475–489

    Article  Google Scholar 

  23. Maldonado S, Pérez J, Weber R, Labbé M (2014) Feature selection for support vector machines via mixed integer linear programming. Inf Sci 279:163–175

    Article  Google Scholar 

  24. Mallows CL (1973) Some comments on \(C_p\). Technometrics 15(4):661–675

    Google Scholar 

  25. Miller A (2002) Subset selection in regression. Chapman and Hall, Boca Raton

    Google Scholar 

  26. Miyashiro R, Takano Y (2015a) Subset selection by Mallows’ \(C_p\): a mixed integer programming approach. Expert Syst Appl 42(1):325–331

    Article  Google Scholar 

  27. Miyashiro R, Takano Y (2015b) Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur J Oper Res 247(3):721–731

    Article  Google Scholar 

  28. Mosier CI (1951) I. Problems and designs of cross-validation. Educ Psychol Meas 11(1):5–11

    Article  Google Scholar 

  29. Naganuma M, Takano Y, Miyashiro R (2019) Feature subset selection for ordered logit model via tangent-plane-based approximation. IEICE Tran Inf Syst E102-D(5), 1046–1053

  30. Okuno T, Takeda A, Kawana A (2018) Hyperparameter learning for bilevel nonsmooth optimization. arXiv preprint arXiv:1806.01520

  31. Park YW, Klabjan D (2017) Subset selection for multiple linear regression via optimization. arXiv preprint arXiv:1701.07920

  32. Pedregosa F (2016) Hyperparameter optimization with approximate gradient. In: Proceedings of the 33rd international conference on machine learning, pp 737–746

  33. Sato T, Takano Y, Miyashiro R, Yoshise A (2016) Feature subset selection for logistic regression via mixed integer optimization. Comput Optim Appl 64(3):865–880

    Article  Google Scholar 

  34. Sato T, Takano Y, Miyashiro R (2017) Piecewise-linear approximation for feature subset selection in a sequential logit model. J Oper Res Soc Jpn 60(1):1–14

    Article  Google Scholar 

  35. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  Google Scholar 

  36. Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494

    Article  Google Scholar 

  37. Sinha A, Malo P, Deb K (2018) A review on bilevel optimization: from classical to evolutionary approaches and applications. IEEE Trans Evolut Comput 22(2):276–295

    Article  Google Scholar 

  38. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Methodol 36(2):111–147

    Google Scholar 

  39. Tamura R, Kobayashi K, Takano Y, Miyashiro R, Nakata K, Matsui T (2017) Best subset selection for eliminating multicollinearity. J Oper Res Soc Jpn 60(3):321–336

    Article  Google Scholar 

  40. Tamura R, Kobayashi K, Takano Y, Miyashiro R, Nakata K, Matsui T (2019) Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. J Glob Optim 73(2):431–446

    Article  Google Scholar 

  41. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol 58:267–288

    Google Scholar 

  42. Ustun B, Rudin C (2016) Supersparse linear integer models for optimized medical scoring systems. Mach Learn 102(3):349–391

    Article  Google Scholar 

  43. van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Oxford

    Google Scholar 

  44. Wherry R (1931) A new formula for predicting the shrinkage of the coefficient of multiple correlation. Ann Math Stat 2(4):440–457

    Article  Google Scholar 

  45. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank two anonymous reviewers for their helpful comments. This work was partially supported by JSPS KAKENHI Grant Numbers JP17K01246 and JP17K12983.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yuichi Takano.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Takano, Y., Miyashiro, R. Best subset selection via cross-validation criterion. TOP 28, 475–488 (2020). https://doi.org/10.1007/s11750-020-00538-1

Download citation

Keywords

  • Integer programming
  • Subset selection
  • Cross-validation
  • Ridge regression
  • Statistics

Mathematics Subject Classification

  • 62F07
  • 62J05
  • 90C11
  • 90C90