Tree Induction for ProbabilityBased Ranking
 Foster Provost,
 Pedro Domingos
 … show all 2 hide
Abstract
Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probabilitybased rankings, and by how much. In this paper we first discuss why the decisiontree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decisiontree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reducederror pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straightforward methods for improving probabilitybased rankings. We show that using a simple, common smoothing method—the Laplace correction—uniformly improves probabilitybased rankings. In addition, bagging substantially improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on classmembership probability are required.
 Apte, C., Grossman, E., Pednault, E., Rosen, B., Tipu, F., White, B. (1999) Probabilistic estimationbased data mining for discovering insurance risks. IEEE Intelligent Systems 14: pp. 4958
 Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1989). A treebased statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37:7, 1001–1008.
 Bauer, E., Kohavi, R. (1999) An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning 36: pp. 105142
 Bennett, P. (2002). Using a symmetric distributions to improve classifier probabilities: A comparison of new and standard parametric methods. Technical report CMUCS02126, School of Computer Science, Carnegie Mellon University.
 Blake, C., & Merz, C. J. (2000). UCI repository of machine learning databases. Machinereadable data repository, Department of Information and Computer Science, University of California at Irvine, Irvine, CA. Available at http://www.ics.uci.edu/?mlearn/MLRepository.html.
 Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., Brodley, C. E. (1998) Pruning decision trees with misclassification costs. Proceedings of the Tenth European Conference on Machine Learning. Springer Verlag, Berlin, pp. 131136
 Bradley, A. P. (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30: pp. 11451159
 Breiman, L. (1996) Bagging predictors. Machine Learning 24: pp. 123140
 Breiman, L. (1998). Outofbag estimation. Unpublished manuscript.
 Breiman, L. (2000). Private communication.
 Breiman, L., Friedman, J. H., Olshen, R. A.,& Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group.
 Buntine, W. (1991) A theory of learning classification rules. School of Computer Science, University of Technology, Sydney, Australia
 Cestnik, B. (1990). Estimating probabilities:Acrucial task in machine learning. Proceedings of the Ninth European Conference on Artificial Intelligence (pp. 147–149). Pitman.
 Clark, P., Boswell, R. (1991) Rule induction with CN2: Some recent improvements. Proceedings of the Sixth European Working Session on Learning. Springer, Berlin, pp. 151163
 Danyluk, A., & Provost, F. (2002). Telecommunications network diagnosis. In W. Kloesgen, & J. Zytkow (Eds.), Handbook of Knowledge Discovery and Data Mining, 897–902.
 Domingos, P. (1997) Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, pp. 155158
 Domingos, P. (1999) MetaCost: A general method for making classifiers costsensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, pp. 155164
 Domingos, P. Knowledge acquisition from examples via multiple models. In: Fisher, D. H. eds. (1997) Proceedings of the Fourteenth International Conference on Machine Learning (ICML97). Morgan Kaufmann, San Francisco, CA, pp. 98106
 Drummond, C., Holte, R. (2000) Exploiting the cost (in)sensitivity of decision tree splitting criteria. Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 239246
 Dzeroski, S., Cestnik, B., Petrovski, I. (1993) Using the mestimate in rule induction. Journal of Computing and Information Technology 1: pp. 3746
 Friedman, N., Goldszmidt, M. (1996) Learning Bayesian networks with local structure. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francisco, pp. 252262
 Good, I. J. (1965) The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press, Cambridge, MA
 Gordon, L., Olshen, R. A. (1984) Almost sure consistent nonparametric regression from recursive partitioning schemes. Journal of Multivariate Analysis 15: pp. 147163
 Hand, D. J. (1997) Construction and Assessment of Classification Rules. John Wiley and Sons, Chichester
 Hand, D. J., Till, R. J. (2001) A simple generalization of the area under the ROC curve for multiple class classification problems. Machine Learning 45: pp. 171186
 Hanley, J. A., McNeil, B. J. (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143: pp. 2936
 Hastie, T. J., & Pregibon, D. (1990). Shrinking trees. Technical report, AT&T Laboratories.
 Heckerman, D., Chickering, M., Meek, C., Rounthwaite, R., Kadie, C. (2000) Dependency networks for density estimation, collaborative filtering, and data visualization. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Francisco
 Holte, R., Acker, L., Porter, B. (1989) Concept learning and the problem of small disjuncts. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence. Morgan Kaufmann, San Francisco, pp. 813818
 Jelinek, F. (1997) Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA
 Kohavi, R., Becker, B., & Sommerfield, D. (1997). Improving simple Bayes. The Ninth European Conference on Machine Learning (pp. 78–87).
 Lim, T.J., Loh, W.Y., Shih, Y.S. (2000) A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classification algorithms. Machine Learning 40: pp. 203228
 Margineantu, D. D., Dietterich, T. G. Improved class probability estimates from decision tree models. In: Holmes, C. eds. (2001) Nonlinear Estimation and Classification. The Mathematical Sciences Research Institute, University of California, Berkeley
 McCallum, A., Rosenfeld, R., Mitchell, T., Ng, A. Y. (1998) Improving text classification by shrinkage in a hierarchy of classes. Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 359367
 Niblett, T. (1987) Constructing decision trees in noisy domains. Proceedings of the Second European Working Session on Learning. Sigma Press, Wilmslow, England, pp. 6778
 Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C. (1994) Reducing misclassification costs. Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 217225
 Perlich, C., Provost, F., & Simonoff, J. S. (2003). Tree induction versus logistic regression: A learningcurve analysis. Journal of Machine Learning Research. (In press).
 Provost, F., Domingos, P. (2000) Welltrained PETs: Improving probability estimation trees. Stern School of Business, New York University, NY
 Provost, F., Fawcett, T. (1997) Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD97). AAAI Press, Menlo Park, CA, pp. 4348
 Provost, F., Fawcett, T. (2001) Robust classification for imprecise environments. Machine Learning 42: pp. 203231
 Provost, F., Fawcett, T., Kohavi, R. (1998) The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 445453
 Provost, F., Kolluri, V. (1999) A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3: pp. 131169
 Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco
 Simonoff, J. S. (1995) Smoothing categorical data. Journal of Statistical Planning and Inference 47: pp. 4169
 Smyth, P., Gray, A., Fayyad, U. (1995) Retrofitting decision tree classifiers using kernel density estimation. Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 506514
 Sobehart, J. R., Stein, R. M., Mikityanskaya, V., & Li, L. (2000). Moody's public firm risk model: A hybrid approach to modeling short term default risk. Tech rep., Moody's Investors Service, Global Credit Research. Available: http://www.moodysqra.com/research/crm/53853.asp.
 Swets, J. (1988) Measuring the accuracy of diagnostic systems. Science 240: pp. 12851293
 Zadrozny, B., Elkan, C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Brodley, C., Danyluk, A. eds. (2001) Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, pp. 609616
 Title
 Tree Induction for ProbabilityBased Ranking
 Journal

Machine Learning
Volume 52, Issue 3 , pp 199215
 Cover Date
 20030901
 DOI
 10.1023/A:1024099825458
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 ranking
 probability estimation
 classification
 costsensitive learning
 decision trees
 Laplace correction
 bagging
 Industry Sectors
 Authors

 Foster Provost ^{(1)}
 Pedro Domingos ^{(2)}
 Author Affiliations

 1. New York University, New York, NY, USA
 2. University of Washington, Seattle, WA, USA