Machine Learning

, Volume 106, Issue 7, pp 1039–1082 | Cite as

Optimal classification trees

Article

Abstract

State-of-the-art decision tree methods apply heuristics recursively to create each split in isolation, which may not capture well the underlying characteristics of the dataset. The optimal decision tree problem attempts to resolve this by creating the entire decision tree at once to achieve global optimality. In the last 25 years, algorithmic advances in integer optimization coupled with hardware improvements have resulted in an astonishing 800 billion factor speedup in mixed-integer optimization (MIO). Motivated by this speedup, we present optimal classification trees, a novel formulation of the decision tree problem using modern MIO techniques that yields the optimal decision tree for axes-aligned splits. We also show the richness of this MIO formulation by adapting it to give optimal classification trees with hyperplanes that generates optimal decision trees with multivariate splits. Synthetic tests demonstrate that these methods recover the true decision tree more closely than heuristics, refuting the notion that optimal methods overfit the training data. We comprehensively benchmark these methods on a sample of 53 datasets from the UCI machine learning repository. We establish that these MIO methods are practically solvable on real-world datasets with sizes in the 1000s, and give average absolute improvements in out-of-sample accuracy over CART of 1–2 and 3–5% for the univariate and multivariate cases, respectively. Furthermore, we identify that optimal classification trees are likely to outperform CART by 1.2–1.3% in situations where the CART accuracy is high and we have sufficient training data, while the multivariate version outperforms CART by 4–7% when the CART accuracy or dimension of the dataset is low.

Keywords

Classification Decision trees Mixed integer optimization 

References

  1. Arthanari, T., & Dodge, Y. (1981). Mathematical programming in statistics (Vol. 341). New York: Wiley.MATHGoogle Scholar
  2. Auer, P., Holte, R. C., & Maass, W. (1995). Theory and applications of agnostic pac-learning with small decision trees. In Proceedings of the 12th international conference on machine learning (pp. 21–29).Google Scholar
  3. Bennett, K. P. (1992). Decision tree construction via linear programming. In M. Evans (Ed.), Proceedings of the 4th midwest artificial intelligence and cognitive science society conference (pp. 97–101).Google Scholar
  4. Bennett, K. P., & Blue, J. (1996). Optimal decision trees. Rensselaer Polytechnic Institute Math Report No. 214.Google Scholar
  5. Bennett, K. P., & Blue, J. A. (1998). A support vector machine approach to decision trees. In IEEE international joint conference on neural networks proceedings. IEEE world congress on computational intelligence (Vol. 3, pp. 2396–2401).Google Scholar
  6. Bertsimas, D., & King, A. (2015). An algorithmic approach to linear regression. Operations Research, 64(1), 2–16.MathSciNetCrossRefMATHGoogle Scholar
  7. Bertsimas, D., & King, A. (2017). Logistic regression: From art to science. Statistical Science (to appear).Google Scholar
  8. Bertsimas, D., & Mazumder, R. (2014). Least quantile regression via modern optimization. The Annals of Statistics, 42(6), 2494–2525.MathSciNetCrossRefMATHGoogle Scholar
  9. Bertsimas, D., & Shioda, R. (2007). Classification and regression via integer optimization. Operations Research, 55(2), 252–271.MathSciNetCrossRefMATHGoogle Scholar
  10. Bertsimas, D., & Weismantel, R. (2005). Optimization over integers. Belmont, MA: Dynamic Ideas.Google Scholar
  11. Bertsimas, D., King, A., & Mazumder, R. (2016). Best subset selection via a modern optimization lens. Annals of Statistics, 44(2), 813–852.MathSciNetCrossRefMATHGoogle Scholar
  12. Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2014). Julia: A fresh approach to numerical computing. arXiv preprint arXiv:1411.1607
  13. Bixby, R. E. (2012). A brief history of linear and mixed-integer programming computation. Documenta Mathematica, Extra Volume: Optimization Stories, 107–121.Google Scholar
  14. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefMATHGoogle Scholar
  15. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks.MATHGoogle Scholar
  16. Cox, L. A, Jr., Yuping, Q., & Kuehner, W. (1989). Heuristic least-cost computation of discrete classification functions with uncertain argument values. Annals of Operations Research, 21(1), 1–29.MathSciNetCrossRefMATHGoogle Scholar
  17. Esmeir, S., & Markovitch, S. (2007). Anytime learning of decision trees. The Journal of Machine Learning Research, 8, 891–933.MATHGoogle Scholar
  18. Gurobi Optimization Inc. (2015a). Gurobi 6.0 performance benchmarks. http://www.gurobi.com/pdfs/benchmarks.pdf. Accessed September 5, 2015.
  19. Gurobi Optimization Inc. (2015b). Gurobi optimizer reference manual. http://www.gurobi.com.
  20. Heath, D., Kasif, S., & Salzberg, S. (1993). Induction of oblique decision trees. In IJCAI, Citeseer (pp. 1002–1007).Google Scholar
  21. Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1), 15–17.MathSciNetCrossRefMATHGoogle Scholar
  22. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22. http://CRAN.R-project.org/doc/Rnews/.
  23. Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
  24. Loh, W. Y., & Shih, Y. S. (1997). Split selection methods for classification trees. Statistica Sinica, 7(4), 815–840.MathSciNetMATHGoogle Scholar
  25. López-Chau, A., Cervantes, J., López-García, L., & Lamont, F. G. (2013). Fishers decision tree. Expert Systems with Applications, 40(16), 6283–6291.CrossRefGoogle Scholar
  26. Lubin, M., & Dunning, I. (2015). Computing in operations research using julia. INFORMS Journal on Computing, 27(2), 238–248.MathSciNetCrossRefMATHGoogle Scholar
  27. Murthy, S., & Salzberg, S. (1995a). Lookahead and pathology in decision tree induction. In IJCAI, Citeseer (pp. 1025–1033).Google Scholar
  28. Murthy, S. K., & Salzberg, S. (1995b). Decision tree induction: How effective is the greedy heuristic? In KDD (pp. 222–227).Google Scholar
  29. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32.MATHGoogle Scholar
  30. Nemhauser, G. L. (2013). Integer programming: The global impact. In Presented at EURO, INFORMS, Rome, Italy, 2013. http://euro-informs2013.org/data/http_/euro2013.org/wp-content/uploads/nemhauser.pdf. Accessed September 9, 2015.
  31. Norouzi, M., Collins, M. D., Johnson, M. A., Fleet, D. J., & Kohli, P. (2015). Efficient non-greedy optimization of decision trees. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Proceedings of the Advances in Neural Information Processing Systems 2015, 28: Annual Conference on Neural Information Processing Systems, 7–12 December 2015, Montreal, QC, pp. 1729–1737.Google Scholar
  32. Norton, S. W. (1989). Generating better decision trees. In IJCAI (Vol. 89, pp. 800–805).Google Scholar
  33. Payne, H. J., & Meisel, W. S. (1977). An algorithm for constructing optimal binary decision trees. IEEE Transactions on Computers, 100(9), 905–916.MathSciNetCrossRefMATHGoogle Scholar
  34. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.Google Scholar
  35. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann.Google Scholar
  36. R Core Team. (2015). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org/.
  37. Son, N. H. (1998). From optimal hyperplanes to optimal decision trees. Fundamenta Informaticae, 34(1, 2), 145–174.MathSciNetMATHGoogle Scholar
  38. Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart: Recursive partitioning and regression trees. http://CRAN.R-project.org/package=rpart, R package version 4.1-9.
  39. Tjortjis, C., & Keane, J. (2002). T3: A classification algorithm for data mining. Lecture Notes in Computer Science (Vol. 2412, pp. 50–55). Berlin: Springer.Google Scholar
  40. Top500 Supercomputer Sites. (2015). Performance development. http://www.top500.org/statistics/perfdevel/. Accessed September 4, 2015.
  41. Truong, A. (2009). Fast growing and interpretable oblique trees via logistic regression models. Ph.D. thesis, University of Oxford.Google Scholar
  42. Tzirakis, P., & Tjortjis, C. (2016). T3c: Improving a decision tree classification algorithms interval splits on continuous attributes. Advances in Data Analysis and Classification, 1–18.Google Scholar
  43. Wickramarachchi, D., Robertson, B., Reale, M., Price, C., & Brown, J. (2016). Hhcart: An oblique decision tree. Computational Statistics & Data Analysis, 96, 12–23.MathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.Operations Research Center, Sloan School of ManagementMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.Operations Research CenterMassachusetts Institute of TechnologyCambridgeUSA

Personalised recommendations