Skip to main content
Log in

Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

Multicollinearity exists when some explanatory variables of a multiple linear regression model are highly correlated. High correlation among explanatory variables reduces the reliability of the analysis. To eliminate multicollinearity from a linear regression model, we consider how to select a subset of significant variables by means of the variance inflation factor (VIF), which is the most common indicator used in detecting multicollinearity. In particular, we adopt the mixed integer optimization (MIO) approach to subset selection. The MIO approach was proposed in the 1970s, and recently it has received renewed attention due to advances in algorithms and hardware. However, none of the existing studies have developed a computationally tractable MIO formulation for eliminating multicollinearity on the basis of VIF. In this paper, we propose mixed integer quadratic optimization (MIQO) formulations for selecting the best subset of explanatory variables subject to the upper bounds on the VIFs of selected variables. Our two MIQO formulations are based on the two equivalent definitions of VIF. Computational results illustrate the effectiveness of our MIQO formulations by comparison with conventional local search algorithms and MIO-based cutting plane algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arthanari, T.S., Dodge, Y.: Mathematical Programming in Statistics. Wiley, New York (1981)

    MATH  Google Scholar 

  2. Beale, E.M.L.: Two transportation problems. In: Kreweras, G., Morlat, G. (eds.) Proceedings of the Third International Conference on Operational Research, pp. 780–788 (1963)

  3. Beale, E.M.L., Tomlin, J.A.: Special facilities in a general mathematical programming system for non-convex problems using ordered sets of variables. In: Lawrence, J. (ed.) Proceedings of the Fifth International Conference on Operational Research, pp. 447–454 (1970)

  4. Belsley, D.A., Kuh, E., Welsch, R.E.: Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, Hoboken (2005)

    MATH  Google Scholar 

  5. Benati, S., García, S.: A mixed integer linear model for clustering with variable selection. Comput. Oper. Res. 43, 280–285 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn. 136, 1039–1082 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bertsimas, D., King, A.: OR forum: an algorithmic approach to linear regression. Oper. Res. 64, 2–16 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bertsimas, D., King, A.: Logistic regression: from art to science. Stat. Sci. 32, 367–384 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44, 813–852 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  10. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chatterjee, S., Hadi, A.S.: Regression Analysis by Example. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  12. Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., García Marquéz, J.R., Gruber, B., Lafoourcade, B., Leitão, P.J., Münkemüller, T., McClean, C., Osborne, P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell, D., Lautenbach, S.: Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27–46 (2013)

    Article  Google Scholar 

  13. Gurobi Optimization, Inc.: Gurobi Optimizer Reference Manual. http://www.gurobi.com (2016). Accessed 6 Oct 2017

  14. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  15. Hastie, T., Tibshirani, R., Tibshirani, R.J.: Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692 (2017)

  16. Hocking, R.R.: The analysis and selection of variables in linear regression. Biometrics 32, 1–49 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  17. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55–67 (1970)

    Article  MATH  Google Scholar 

  18. Huberty, C.J.: Issues in the use and interpretation of discriminant analysis. Psychol. Bull. 95, 156–171 (1984)

    Article  Google Scholar 

  19. IBM: IBM ILOG CPLEX Optimization Studio. https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/ (2015). Accessed 6 Oct 2017

  20. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013)

    Book  MATH  Google Scholar 

  21. Jolliffe, I.T.: A note on the use of principal components in regression. Appl. Stat. 31, 300–303 (1982)

    Article  Google Scholar 

  22. Kimura, K., Waki, H.: Minimization of Akaike’s information criterion in linear regression analysis via mixed integer nonlinear program. Optim. Methods Softw. 33, 633–649 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  23. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)

    Article  MATH  Google Scholar 

  24. Konno, H., Yamamoto, R.: Choosing the best set of variables in regression analysis using integer programming. J. Glob. Optim. 44, 273–282 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  25. Lichman, M.: UCI Machine Learning Repository. School of Information and Computer Science, University of California, Irvine. http://archive.ics.uci.edu/ml (2013)

  26. Liu, H., Motoda, H.: Computational Methods of Feature Selection. CRC Press, Boca Raton (2007)

    MATH  Google Scholar 

  27. Maldonado, S., Pérez, J., Weber, R., Labbé, M.: Feature selection for support vector machines via mixed integer linear programming. Inf. Sci. 279, 163–175 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  28. Massy, W.F.: Principal components regression in exploratory statistical research. J. Am. Stat. Assoc. 60, 234–256 (1965)

    Article  Google Scholar 

  29. Mazumder, R., Radchenko, P.: The discrete Dantzig selector: estimating sparse linear models via mixed integer linear optimization. IEEE Trans. Inf. Theory 63, 3053–3075 (2017)

    MathSciNet  MATH  Google Scholar 

  30. Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (2002)

    Book  MATH  Google Scholar 

  31. Miyashiro, R., Takano, Y.: Subset selection by Mallows’ \(C_p\): a mixed integer programming approach. Expert. Syst. Appl. 42, 325–331 (2015)

    Article  Google Scholar 

  32. Miyashiro, R., Takano, Y.: Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur. J. Oper. Res. 247, 721–731 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  33. R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org (2014). Accessed 6 Oct 2017

  34. Sato, T., Takano, Y., Miyashiro, R., Yoshise, A.: Feature subset selection for logistic regression via mixed integer optimization. Comput. Optim. Appl. 64, 865–880 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  35. Sato, T., Takano, Y., Miyashiro, R.: Piecewise-linear approximation for feature subset selection in a sequential logit model. J. Oper. Res. Soc. Jpn. 60, 1–14 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  36. Tamura, R., Kobayashi, K., Takano, Y., Miyashiro, R., Nakata, K., Matsui, T.: Best subset selection for eliminating multicollinearity. J. Oper. Res. Soc. Jpn. 60, 321–336 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  37. Ustun, B., Rudin, C.: Supersparse linear integer models for optimized medical scoring systems. Mach. Learn. 102, 349–391 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  38. Wilson, Z.T., Sahinidis, N.V.: The ALAMO approach to machine learning. Comput. Chem. Eng. 106, 785–795 (2017)

    Article  Google Scholar 

  39. Wold, S., Ruhe, A., Wold, H., Dunn III, W.J.: The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 5, 735–743 (1984)

  40. Wold, S., Sjöström, M., Eriksson, L.: PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 58, 109–130 (2001)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by JSPS KAKENHI Grant Nos. JP17K01246 and JP17K12983.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryuhei Miyashiro.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tamura, R., Kobayashi, K., Takano, Y. et al. Mixed integer quadratic optimization formulations for eliminating multicollinearity based on variance inflation factor. J Glob Optim 73, 431–446 (2019). https://doi.org/10.1007/s10898-018-0713-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-018-0713-3

Keywords

Navigation