Optimal classification trees
State-of-the-art decision tree methods apply heuristics recursively to create each split in isolation, which may not capture well the underlying characteristics of the dataset. The optimal decision tree problem attempts to resolve this by creating the entire decision tree at once to achieve global optimality. In the last 25 years, algorithmic advances in integer optimization coupled with hardware improvements have resulted in an astonishing 800 billion factor speedup in mixed-integer optimization (MIO). Motivated by this speedup, we present optimal classification trees, a novel formulation of the decision tree problem using modern MIO techniques that yields the optimal decision tree for axes-aligned splits. We also show the richness of this MIO formulation by adapting it to give optimal classification trees with hyperplanes that generates optimal decision trees with multivariate splits. Synthetic tests demonstrate that these methods recover the true decision tree more closely than heuristics, refuting the notion that optimal methods overfit the training data. We comprehensively benchmark these methods on a sample of 53 datasets from the UCI machine learning repository. We establish that these MIO methods are practically solvable on real-world datasets with sizes in the 1000s, and give average absolute improvements in out-of-sample accuracy over CART of 1–2 and 3–5% for the univariate and multivariate cases, respectively. Furthermore, we identify that optimal classification trees are likely to outperform CART by 1.2–1.3% in situations where the CART accuracy is high and we have sufficient training data, while the multivariate version outperforms CART by 4–7% when the CART accuracy or dimension of the dataset is low.
KeywordsClassification Decision trees Mixed integer optimization
We thank the three reviewers and the action editor for their comments that improved the paper substantially. We also thank Prof. Andrew Mason for comments on our earlier MIO formulation, and Jenny McLean for editorial comments that improved the exposition of the paper.
- Auer, P., Holte, R. C., & Maass, W. (1995). Theory and applications of agnostic pac-learning with small decision trees. In Proceedings of the 12th international conference on machine learning (pp. 21–29).Google Scholar
- Bennett, K. P. (1992). Decision tree construction via linear programming. In M. Evans (Ed.), Proceedings of the 4th midwest artificial intelligence and cognitive science society conference (pp. 97–101).Google Scholar
- Bennett, K. P., & Blue, J. (1996). Optimal decision trees. Rensselaer Polytechnic Institute Math Report No. 214.Google Scholar
- Bennett, K. P., & Blue, J. A. (1998). A support vector machine approach to decision trees. In IEEE international joint conference on neural networks proceedings. IEEE world congress on computational intelligence (Vol. 3, pp. 2396–2401).Google Scholar
- Bertsimas, D., & King, A. (2017). Logistic regression: From art to science. Statistical Science (to appear).Google Scholar
- Bertsimas, D., & Weismantel, R. (2005). Optimization over integers. Belmont, MA: Dynamic Ideas.Google Scholar
- Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2014). Julia: A fresh approach to numerical computing. arXiv preprint arXiv:1411.1607
- Bixby, R. E. (2012). A brief history of linear and mixed-integer programming computation. Documenta Mathematica, Extra Volume: Optimization Stories, 107–121.Google Scholar
- Gurobi Optimization Inc. (2015a). Gurobi 6.0 performance benchmarks. http://www.gurobi.com/pdfs/benchmarks.pdf. Accessed September 5, 2015.
- Gurobi Optimization Inc. (2015b). Gurobi optimizer reference manual. http://www.gurobi.com.
- Heath, D., Kasif, S., & Salzberg, S. (1993). Induction of oblique decision trees. In IJCAI, Citeseer (pp. 1002–1007).Google Scholar
- IBM ILOG CPLEX. (2014). V12.1 users manual. https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/.
- Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3), 18–22. http://CRAN.R-project.org/doc/Rnews/.
- Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
- Murthy, S., & Salzberg, S. (1995a). Lookahead and pathology in decision tree induction. In IJCAI, Citeseer (pp. 1025–1033).Google Scholar
- Murthy, S. K., & Salzberg, S. (1995b). Decision tree induction: How effective is the greedy heuristic? In KDD (pp. 222–227).Google Scholar
- Nemhauser, G. L. (2013). Integer programming: The global impact. In Presented at EURO, INFORMS, Rome, Italy, 2013. http://euro-informs2013.org/data/http_/euro2013.org/wp-content/uploads/nemhauser.pdf. Accessed September 9, 2015.
- Norouzi, M., Collins, M. D., Johnson, M. A., Fleet, D. J., & Kohli, P. (2015). Efficient non-greedy optimization of decision trees. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Proceedings of the Advances in Neural Information Processing Systems 2015, 28: Annual Conference on Neural Information Processing Systems, 7–12 December 2015, Montreal, QC, pp. 1729–1737.Google Scholar
- Norton, S. W. (1989). Generating better decision trees. In IJCAI (Vol. 89, pp. 800–805).Google Scholar
- Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.Google Scholar
- Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann.Google Scholar
- R Core Team. (2015). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org/.
- Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart: Recursive partitioning and regression trees. http://CRAN.R-project.org/package=rpart, R package version 4.1-9.
- Tjortjis, C., & Keane, J. (2002). T3: A classification algorithm for data mining. Lecture Notes in Computer Science (Vol. 2412, pp. 50–55). Berlin: Springer.Google Scholar
- Top500 Supercomputer Sites. (2015). Performance development. http://www.top500.org/statistics/perfdevel/. Accessed September 4, 2015.
- Truong, A. (2009). Fast growing and interpretable oblique trees via logistic regression models. Ph.D. thesis, University of Oxford.Google Scholar
- Tzirakis, P., & Tjortjis, C. (2016). T3c: Improving a decision tree classification algorithms interval splits on continuous attributes. Advances in Data Analysis and Classification, 1–18.Google Scholar