Sparse regression over clusters: SparClur


Prediction tasks in personalized medicine require models that combine accuracy and interpretability. We propose an integer optimization approach for building sparse regression models with enforced coordination, using data partitioned among leaves in a prediction tree. We show that the method recovers the true underlying relationship between observations and target variables in large-scale synthetic data in seconds. We apply our method to several real-world medical prediction problems and observe that the additional structure imposed provides a substantial gain in interpretability, at a low cost to accuracy.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Data availability

All synthetic datasets and all publicly available datasets are available to interested readers. Medical data are protected under privacy rules and are not available.


This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1122374.

  • Prediction trees
  • Regression
  • Integer optimization
  • Personalized medicine