Skip to main content

Sparse regression over clusters: SparClur

Abstract

Prediction tasks in personalized medicine require models that combine accuracy and interpretability. We propose an integer optimization approach for building sparse regression models with enforced coordination, using data partitioned among leaves in a prediction tree. We show that the method recovers the true underlying relationship between observations and target variables in large-scale synthetic data in seconds. We apply our method to several real-world medical prediction problems and observe that the additional structure imposed provides a substantial gain in interpretability, at a low cost to accuracy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Data availability

All synthetic datasets and all publicly available datasets are available to interested readers. Medical data are protected under privacy rules and are not available.

References

  1. Benjamin, E.J., Levy, D., Vaziri, S.M., D’agostino, R.B., Belanger, A.J., Wolf, P.A.: Independent risk factors for atrial fibrillation in a population-based cohort: the Framingham Heart Study. JAMA 271(11), 840–844 (1994)

    Article  Google Scholar 

  2. Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. 270, 931–942 (2018)

    MathSciNet  Article  Google Scholar 

  3. Bertsimas, D., Dunn, J.: Machine Learning Under a Modern Optimization Lens. Dynamic Ideas (2019)

  4. Bertsimas, D., Van Parys, B.: Sparse high-dimensional regression: exact scalable algorithms and phase transitions. Ann. Stat. 48(1), 300–323 (2020)

    MathSciNet  Article  Google Scholar 

  5. Bertsimas, D., King, A., Mazumder, R., et al.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)

    MathSciNet  Article  Google Scholar 

  6. Bertsimas, D., Kallus, N., Weinstein, A.M., Zhuo, Y.D.: Personalized diabetes management using electronic medical records. Diabetes Care 40(2), 210–217 (2017)

    Article  Google Scholar 

  7. Bertsimas, D., Pauphilet, J., Van Parys, B.: Sparse classification and phase transitions: a discrete optimization perspective (2017). arXiv preprint arXiv:1710.01352

  8. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: a fresh approach to numerical computing. SIAM Rev. 59(1), 65–98 (2017)

    MathSciNet  Article  Google Scholar 

  9. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC Press (1984)

  10. Dunn, J.: Optimal trees for prediction and prescription. PhD thesis, Massachusetts Institute of Technology (2018)

  11. Dunning, I., Huchette, J., Lubin, M.: Jump: a modeling language for mathematical optimization. SIAM Rev. 59(2), 295–320 (2017)

    MathSciNet  Article  Google Scholar 

  12. Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math. Program. 36(3), 307–339 (1986)

    MathSciNet  Article  Google Scholar 

  13. Kaggle: House sales in King County, USA. https://www.kaggle.com/harlfoxem/housesalesprediction. Accessed: 2020-12-05 (2016)

  14. Kapelevich, L., Zhang, R.: Sparclur/Sparclur.jl: v0.1 (2020). https://doi.org/10.5281/zenodo.4308537

  15. Kornblith, S., Contributors: GLMNet.jl: Julia wrapper for fitting Lasso/ElasticNet GLM models using glmnet (2013). https://github.com/JuliaStats/GLMNet.jl

  16. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  17. Tikhonov, A.N.: On the stability of inverse problems. Dokl. Akad. Nauk SSSR 39, 195–198 (1943)

    MathSciNet  Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1122374.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitris Bertsimas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bertsimas, D., Dunn, J., Kapelevich, L. et al. Sparse regression over clusters: SparClur. Optim Lett 16, 433–448 (2022). https://doi.org/10.1007/s11590-021-01770-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11590-021-01770-9

Keywords

  • Prediction trees
  • Regression
  • Integer optimization
  • Personalized medicine