Machine Learning

, Volume 52, Issue 3, pp 199–215 | Cite as

Tree Induction for Probability-Based Ranking

  • Foster Provost
  • Pedro Domingos
Article

Abstract

Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probability-based rankings, and by how much. In this paper we first discuss why the decision-tree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decision-tree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reduced-error pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straightforward methods for improving probability-based rankings. We show that using a simple, common smoothing method—the Laplace correction—uniformly improves probability-based rankings. In addition, bagging substantially improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on class-membership probability are required.

ranking probability estimation classification cost-sensitive learning decision trees Laplace correction bagging 

References

  1. Apte, C., Grossman, E., Pednault, E., Rosen, B., Tipu, F., & White, B. (1999). Probabilistic estimation-based data mining for discovering insurance risks. IEEE Intelligent Systems, 14, 49–58.Google Scholar
  2. Bahl, L. R., Brown, P. F., de Souza, P. V., & Mercer, R. L. (1989). A tree-based statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37:7, 1001–1008.Google Scholar
  3. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36, 105–142.Google Scholar
  4. Bennett, P. (2002). Using a symmetric distributions to improve classifier probabilities: A comparison of new and standard parametric methods. Technical report CMU-CS-02-126, School of Computer Science, Carnegie Mellon University.Google Scholar
  5. Blake, C., & Merz, C. J. (2000). UCI repository of machine learning databases. Machine-readable data repository, Department of Information and Computer Science, University of California at Irvine, Irvine, CA. Available at http://www.ics.uci.edu/?mlearn/MLRepository.html.Google Scholar
  6. Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., & Brodley, C. E. (1998). Pruning decision trees with misclassification costs. Proceedings of the Tenth European Conference on Machine Learning (pp. 131–136). Berlin: Springer Verlag.Google Scholar
  7. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:7, 1145–1159.Google Scholar
  8. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.Google Scholar
  9. Breiman, L. (1998). Out-of-bag estimation. Unpublished manuscript.Google Scholar
  10. Breiman, L. (2000). Private communication.Google Scholar
  11. Breiman, L., Friedman, J. H., Olshen, R. A.,& Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group.Google Scholar
  12. Buntine,W. (1991). A theory of learning classification rules. Ph.D. thesis, School of Computer Science, University of Technology, Sydney, Australia.Google Scholar
  13. Cestnik, B. (1990). Estimating probabilities:Acrucial task in machine learning. Proceedings of the Ninth European Conference on Artificial Intelligence (pp. 147–149). Pitman.Google Scholar
  14. Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. Proceedings of the Sixth European Working Session on Learning (pp. 151–163). Berlin: Springer.Google Scholar
  15. Danyluk, A., & Provost, F. (2002). Telecommunications network diagnosis. In W. Kloesgen, & J. Zytkow (Eds.), Handbook of Knowledge Discovery and Data Mining, 897–902.Google Scholar
  16. Domingos, P. (1997). Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (pp. 155–158). Menlo Park, CA: AAAI Press.Google Scholar
  17. Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 155–164). New York: ACM Press.Google Scholar
  18. Domingos, P. (1997).Knowledge acquisition from examples via multiple models. In D. H. Fisher (Ed.), Proceedings of the Fourteenth International Conference on Machine Learning (ICML-97) (pp. 98–106). San Francisco, CA: Morgan Kaufmann.Google Scholar
  19. Drummond, C., & Holte, R. (2000). Exploiting the cost (in)sensitivity of decision tree splitting criteria. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 239–246). San Francisco: Morgan Kaufmann.Google Scholar
  20. Dzeroski, S., Cestnik, B., & Petrovski, I. (1993). Using the m-estimate in rule induction. Journal of Computing and Information Technology, 1, 37–46.Google Scholar
  21. Friedman, N., & Goldszmidt, M. (1996). Learning Bayesian networks with local structure. Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence (pp. 252–262). San Francisco: Morgan Kaufmann.Google Scholar
  22. Good, I. J. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods. Cambridge, MA: MIT Press.Google Scholar
  23. Gordon, L., & Olshen, R. A. (1984). Almost sure consistent nonparametric regression from recursive partitioning schemes. Journal of Multivariate Analysis, 15, 147–163.Google Scholar
  24. Hand, D. J. (1997). Construction and Assessment of Classification Rules. Chichester: John Wiley and Sons.Google Scholar
  25. Hand, D. J., & Till, R. J. (2001). A simple generalization of the area under the ROC curve for multiple class classification problems. Machine Learning, 45:2, 171–186.Google Scholar
  26. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36.Google Scholar
  27. Hastie, T. J., & Pregibon, D. (1990). Shrinking trees. Technical report, AT&T Laboratories.Google Scholar
  28. Heckerman, D., Chickering, M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for density estimation, collaborative filtering, and data visualization. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann.Google Scholar
  29. Holte, R., Acker, L., & Porter, B. (1989). Concept learning and the problem of small disjuncts. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 813–818). San Francisco: Morgan Kaufmann.Google Scholar
  30. Jelinek, F. (1997). Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press.Google Scholar
  31. Kohavi, R., Becker, B., & Sommerfield, D. (1997). Improving simple Bayes. The Ninth European Conference on Machine Learning (pp. 78–87).Google Scholar
  32. Lim, T.-J., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40:3, 203–228.Google Scholar
  33. Margineantu, D. D., & Dietterich, T. G. (2001). Improved class probability estimates from decision tree models. In C. Holmes (Ed.), Nonlinear Estimation and Classification. The Mathematical Sciences Research Institute, University of California, Berkeley.Google Scholar
  34. McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. Proceedings of the Fifteenth International Conference on Machine Learning (pp. 359–367). San Francisco: Morgan Kaufmann.Google Scholar
  35. Niblett, T. (1987). Constructing decision trees in noisy domains. Proceedings of the Second European Working Session on Learning (pp. 67–78). Wilmslow, England: Sigma Press.Google Scholar
  36. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., & Brunk, C. (1994). Reducing misclassification costs. Proceedings of the Eleventh International Conference on Machine Learning (pp. 217–225). San Francisco: Morgan Kaufmann.Google Scholar
  37. Perlich, C., Provost, F., & Simonoff, J. S. (2003). Tree induction versus logistic regression: A learning-curve analysis. Journal of Machine Learning Research. (In press).Google Scholar
  38. Provost, F., & Domingos, P. (2000). Well-trained PETs: Improving probability estimation trees. CeDER Working Paper #IS-00-04, Stern School of Business, New York University, NY 10012.Google Scholar
  39. Provost, F., & Fawcett,T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97) (pp. 43–48). Menlo Park, CA: AAAI Press.Google Scholar
  40. Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42, 203–231.Google Scholar
  41. Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceedings of the Fifteenth International Conference on Machine Learning (pp. 445–453). San Francisco: Morgan Kaufmann.Google Scholar
  42. Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3:2, 131–169.Google Scholar
  43. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann.Google Scholar
  44. Simonoff, J. S. (1995). Smoothing categorical data. Journal of Statistical Planning and Inference, 47, 41–69.Google Scholar
  45. Smyth, P., Gray, A., & Fayyad, U. (1995). Retrofitting decision tree classifiers using kernel density estimation. Proceedings of the Twelfth International Conference on Machine Learning (pp. 506–514). San Francisco: Morgan Kaufmann.Google Scholar
  46. Sobehart, J. R., Stein, R. M., Mikityanskaya, V., & Li, L. (2000). Moody's public firm risk model: A hybrid approach to modeling short term default risk. Tech rep., Moody's Investors Service, Global Credit Research. Available: http://www.moodysqra.com/research/crm/53853.asp.Google Scholar
  47. Swets, J. (1988). Measuring the accuracy of diagnostic systems. Science, 240, 1285–1293.Google Scholar
  48. Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In C. Brodley, & A. Danyluk (Eds.), Proceedings of the Eighteenth International Conference on Machine Learning (pp. 609–616). San Francisco: Morgan Kaufmann.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Foster Provost
    • 1
  • Pedro Domingos
    • 2
  1. 1.New York UniversityNew YorkUSA
  2. 2.University of WashingtonSeattleUSA

Personalised recommendations