# Learning customized and optimized lists of rules with mathematical programming

## Abstract

We introduce a mathematical programming approach to building rule lists, which are a type of interpretable, nonlinear, and logical machine learning classifier involving IF-THEN rules. Unlike traditional decision tree algorithms like CART and C5.0, this method does not use greedy splitting and pruning. Instead, it aims to fully optimize a combination of accuracy and sparsity, obeying user-defined constraints. This method is useful for producing non-black-box predictive models, and has the benefit of a clear user-defined tradeoff between training accuracy and sparsity. The flexible framework of mathematical programming allows users to create customized models with a provable guarantee of optimality. The software reviewed as part of this submission was given the DOI (Digital Object Identifier) https://doi.org/10.5281/zenodo.1344142.

## Keywords

Mixed-integer programming Decision trees Decision lists Sparsity Interpretable modeling Associative classification 68T05—Computer Science Artificial intelligence Learning and adaptive systems## Mathematics Subject Classification

68T05 Learning and adaptive systems 90C11 Mixed integer programming 62-04 Explicit machine computation and programs (not the theory of computation or programming)## Notes

### Acknowledgements

We gratefully acknowledge funding from the MIT Big Data Initiative, and the National Science Foundation under grant IIS-1053407. Thanks to Daniel Bienstock and anonymous reviewers for encouragement and for helping us to improve the readability of the manuscript.

## References

- 1.Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Databases, pp. 487–499 (1994)Google Scholar
- 2.Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., Rudin, C.: Learning certifiably optimal rule lists for categorical data. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2017)Google Scholar
- 3.Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., Rudin, C.: Learning certifiably optimal rule lists for categorical data. J. Mach. Learn. Res.
**18**, 1–78 (2018)Google Scholar - 4.Anthony, M.: Decision lists. Tech. rep., CDAM Research Report LSE-CDAM-2005-23 (2005)Google Scholar
- 5.Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
- 6.Bayardo, R.J., Agrawal, R.: Mining the most interesting rules. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 145–154 (1999)Google Scholar
- 7.Bennett, K.P., Blue, J.A.: Optimal decision trees. Tech. rep., R.P.I. Math Report No. 214, Rensselaer Polytechnic Institute (1996)Google Scholar
- 8.Bertsimas, D., Dunn, J.: Optimal classification trees. Mach. Learn.
**7**, 1039–1082 (2017)MathSciNetCrossRefGoogle Scholar - 9.Boros, E., Hammer, P.L., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng.
**12**(2), 292–306 (2000)CrossRefGoogle Scholar - 10.Breiman, L.: Random forests. Mach Learn
**45**(1), 5–32 (2001)CrossRefGoogle Scholar - 11.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)zbMATHGoogle Scholar
- 12.Chang, A.: Integer optimization methods for machine learning. Ph.D. thesis, Massachusetts Institute of Technology (2012)Google Scholar
- 13.Chen, C., Rudin, C.: An optimization approach to learning falling rule lists. In: Proceedings of Artificial Intelligence and Statistics (AISTATS) (2018)Google Scholar
- 14.Chipman, H.A., George, E.I., McCulloch, R.E.: Bayesian CART model search. J. Am. Stat. Assoc.
**93**(443), 935–948 (1998)CrossRefGoogle Scholar - 15.Cieslak, D.A., Chawla, N.V.: Learning decision trees for unbalanced data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science, vol. 5211, pp. 241–256. Springer, Berlin (2008)CrossRefGoogle Scholar
- 16.Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann (1995)Google Scholar
- 17.Cusick, G.R., Courtney, M.E., Havlicek, J., Hess, N.: Crime during the transition to adulthood: how youth fare as they leave out-of-home care. National Institute of Justice, Office of Justice Programs, US Department of Justice (2010)Google Scholar
- 18.Dobkin, D., Fulton, T., Gunopulos, D., Kasif, S., Salzberg, S.: Induction of shallow decision trees (1996)Google Scholar
- 19.Farhangfar, A., Greiner, R., Zinkevich, M.: A fast way to produce optimal fixed-depth decision trees. In: International Symposium on Artificial Intelligence and Mathematics (ISAIM 2008), Fort Lauderdale, Florida, USA, January 2–4 (2008)Google Scholar
- 20.Fawcett, T.: Prie: a system for generating rulelists to maximize roc performance. Data Min. Knowl. Discov.
**17**(2), 207–224 (2008)MathSciNetCrossRefGoogle Scholar - 21.Freitas, A.A.: Comprehensible classification models: a position paper. ACM SIGKDD Explor. Newsl.
**15**(1), 1–10 (2014)CrossRefGoogle Scholar - 22.Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.
**55**(1), 119–139 (1997)MathSciNetCrossRefGoogle Scholar - 23.Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Ann. Appl. Stat.
**2**(3), 916–954 (2008)MathSciNetCrossRefGoogle Scholar - 24.Geng, L., Hamilton, H.J.: Interestingness measures for data mining: a survey. ACM Comput. Surv. (2006). https://doi.org/10.1145/1132960.1132963 CrossRefGoogle Scholar
- 25.Goethals, B.: Survey on frequent pattern mining. Tech. rep., Helsinki Institute for Information Technology (2003)Google Scholar
- 26.Goh, S.T., Rudin, C.: Box drawings for learning with imbalanced data. In: Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2014)Google Scholar
- 27.Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl.
**11**(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278 CrossRefGoogle Scholar - 28.Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov.
**15**, 55–86 (2007)MathSciNetCrossRefGoogle Scholar - 29.Hata, I., Veloso, A., Ziviani, N.: Learning accurate and interpretable classifiers using optimal multi-criteria rules. J. Inf. Data Manag.
**4**(3) (2013)Google Scholar - 30.Hipp, J., Güntzer, U., Nakhaeizadeh, G.: Algorithms for association rule mining: a general survey and comparison. SIGKDD Explor.
**2**, 58–64 (2000)CrossRefGoogle Scholar - 31.Huysmans, J., Dejaeger, K., Mues, C., Vanthienen, J., Baesens, B.: An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decis. Support Syst.
**51**(1), 141–154 (2011)CrossRefGoogle Scholar - 32.Jennings, D.L., Amabile, T.M., Ross, L.: Informal covariation assessments: Data-based versus theory-based judgements. In: Kahneman, D., Slovic, P., Tversky, A. (eds.) Judgment Under Uncertainty: Heuristics and Biases, pp. 211–230. Cambridge Press, Cambridge (1982)CrossRefGoogle Scholar
- 33.Klivans, A.R., Servedio, R.A.: Toward attribute efficient learning of decision lists and parities. J. Mach. Learn. Res.
**7**, 587–602 (2006)MathSciNetzbMATHGoogle Scholar - 34.Kuhn, M., Weston, S., Coulter, N.: C50: C5.0 Decision Trees and Rule-Based Models, C Code for C5.0 by R. Quinlan. http://CRAN.R-project.org/package=C50. r package version 0.1.0-013 (2012)
- 35.Lakkaraju, H., Rudin, C.: Learning cost effective and interpretable treatment regimes in the form of rule lists. In: Proceedings of Artificial Intelligence and Statistics (AISTATS) (2017)Google Scholar
- 36.Leondes, C.T.: Expert Systems: The Technology of Knowledge Management and Decision Making for the 21st Century. Academic Press, London (2002)Google Scholar
- 37.Letham, B., Rudin, C., McCormick, T.H., Madigan, D.: Interpretable classifiers using rules and bayesian analysis: building a better stroke prediction model. Ann. Appl. Stat.
**9**(3), 1350–1371 (2015)MathSciNetCrossRefGoogle Scholar - 38.Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. IEEE International Conference on Data Mining, pp. 369–376 (2001)Google Scholar
- 39.Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80–96 (1998)Google Scholar
- 40.Long, P.M., Servedio, R.A.: Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions. Adv. Neural Inf. Process. Syst.
**19**, 921–928 (2007)Google Scholar - 41.Malioutov, D., Varshney, K.: Exact rule learning via boolean compressed sensing. In: Proceedings of The 30th International Conference on Machine Learning, pp. 765–773 (2013)Google Scholar
- 42.Marchand, M., Sokolova, M.: Learning with decision lists of data-dependent features. J. Mach. Learn. Res.
**6**, 427–451 (2005)MathSciNetzbMATHGoogle Scholar - 43.McCormick, T.H., Rudin, C., Madigan, D.: Bayesian hierarchical modeling for predicting medical conditions. Ann. Appl. Stat.
**6**(2), 652–668 (2012)MathSciNetCrossRefGoogle Scholar - 44.McGarry, K.: A survey of interestingness measures for knowledge discovery. Knowl. Eng. Rev.
**20**, 39–61 (2005)CrossRefGoogle Scholar - 45.Meinshausen, N.: Node harvest. Ann. Appl. Stat.
**4**(4), 2049–2072 (2010)MathSciNetCrossRefGoogle Scholar - 46.Miller, G.A.: The magical number seven, plus or minus two: Some limits to our capacity for processing information. Psychol. Rev.
**63**(2), 81–97 (1956)CrossRefGoogle Scholar - 47.Muggleton, S., De Raedt, L.: Inductive logic programming: theory and methods. J. Log. Program.
**19**, 629–679 (1994)MathSciNetCrossRefGoogle Scholar - 48.Naumov, G.: NP-completeness of problems of construction of optimal decision trees. Sov. Phys. Dokl.
**36**(4), 270–271 (1991)zbMATHGoogle Scholar - 49.Nijssen, S., Fromont, E.: Mining optimal decision trees from itemset lattices. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2007)Google Scholar
- 50.Nijssen, S., Fromont, E.: Optimal constraint-based decision tree induction from itemset lattices. Data Min. Knowl. Discov.
**21**(1), 9–51 (2010)MathSciNetCrossRefGoogle Scholar - 51.Norouzi, M., Collins, M., Johnson, M.A., Fleet, D.J., Kohli, P.: Efficient non-greedy optimization of decision trees. Adv. Neural Inf. Process. Syst.
**28**, 1729–1737 (2015)Google Scholar - 52.Plate, T.A.: Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika
**26**, 29–50 (1999)CrossRefGoogle Scholar - 53.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Los Altos (1993)Google Scholar
- 54.Ridgeway, G.: The pitfalls of prediction. NIJ J. Natl. Inst. Justice
**271**, 34–40 (2013)Google Scholar - 55.Rivest, R.L.: Learning decision lists. Mach. Learn.
**2**(3), 229–246 (1987)Google Scholar - 56.Rückert, U.: A statistical approach to rule learning. Ph.D. thesis, Technischen Universität München (2008)Google Scholar
- 57.Rudin, C., Letham, B., Salleb-Aouissi, A., Kogan, E., Madigan, D.: Sequential event prediction with association rules. In: Proceedings of the 24th Annual Conference on Learning Theory (COLT) (2011)Google Scholar
- 58.Rudin, C., Letham, B., Madigan, D.: Learning theory analysis for association rules and sequential event prediction. J. Mach. Learn. Res.
**14**, 3384–3436 (2013)MathSciNetzbMATHGoogle Scholar - 59.Rüping, S.: Learning interpretable models. Ph.D. thesis, Universität Dortmund (2006)Google Scholar
- 60.Simon, G.J., Kumar, V., Li, P.W.: A simple statistical model and association rule filtering for classification. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 823–831 (2011)Google Scholar
- 61.Su, G., Wei, D., Varshney, K.R., Malioutov, D.M.: Interpretable two-level boolean rule learning for classification. In: ICML Workshop on Human Interpretability in Machine Learning (WHI 2016) (2016). arXiv:1606.05798
- 62.Tan, P.N., Kumar, V.: Interestingness measures for association patterns: a perspective. Tech. rep., Department of Computer Science, University of Minnesota (2000)Google Scholar
- 63.Thabtah, F.: A review of associative classification mining. Knowl. Eng. Rev.
**22**, 37–65 (2007)CrossRefGoogle Scholar - 64.Ustun, B., Rudin, C.: Supersparse linear integer models for optimized medical scoring systems. Mach. Learn.
**102**(3), 349–391 (2016)MathSciNetCrossRefGoogle Scholar - 65.Ustun, B., Rudin, C.: Optimized risk scores. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017)Google Scholar
- 66.Vanhoof, K., Depaire, B.: Structure of association rule classifiers: a review. In: Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering (ISKE), pp. 9–12 (2010)Google Scholar
- 67.Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
- 68.Vellido, A., Martín-Guerrero, J.D., Lisboa, P.J.: Making machine learning models interpretable. In: Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2012)Google Scholar
- 69.Verwer, S., Zhang, Y.: Learning decision trees with flexible constraints and objectives using integer optimization In: Salvagnin, D., Lombardi, M. (eds.) Integration of AI and OR Techniques in Constraint Programming. CPAIOR 2017. Lecture Notes in Computer Science, vol. 10335, pp 94–103. Springer (2017)Google Scholar
- 70.Wang, F., Rudin, C.: Falling rule lists. In: Proceedings of Artificial Intelligence and Statistics (AISTATS) (2015)Google Scholar
- 71.Wang, T., Rudin, C., Doshi-Velez, F., Liu, Y., Klampfl, E., MacNeille, P.: A Bayesian framework for learning rule sets for interpretable classification. J. Mach. Learn. Res.
**18**(70), 1–37 (2017)MathSciNetzbMATHGoogle Scholar - 72.Wu, Y., Tjelmeland, H., West, M.: Bayesian CART: prior specification and posterior simulation. J. Comput. Graph. Stat.
**16**(1), 44–66 (2007)MathSciNetCrossRefGoogle Scholar - 73.Yang, H., Rudin, C., Seltzer, M.: Scalable Bayesian rule lists. In: Proceedings of the 34th International Conference on Machine Learning (ICML) (2017)Google Scholar
- 74.Yin, X., Han, J.: CPAR: Classification based on predictive association rules. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 331–335 (2003)CrossRefGoogle Scholar
- 75.Zeng, J., Ustun, B., Rudin, C.: Interpretable classification models for recidivism prediction. J. R. Stat. Soc. Ser. A (Stat. Soc.)
**180**(3), 689–722 (2017)MathSciNetCrossRefGoogle Scholar - 76.Zhang, Y., Laber, E.B., Tsiatis, A., Davidian, M.: Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics
**71**(4), 895–904 (2015)MathSciNetCrossRefGoogle Scholar