, Volume 52, Issue 3, pp 199215
Tree Induction for ProbabilityBased Ranking
 Foster ProvostAffiliated withNew York University
 , Pedro DomingosAffiliated withUniversity of Washington
Abstract
Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probabilitybased rankings, and by how much. In this paper we first discuss why the decisiontree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decisiontree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reducederror pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straightforward methods for improving probabilitybased rankings. We show that using a simple, common smoothing method—the Laplace correction—uniformly improves probabilitybased rankings. In addition, bagging substantially improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on classmembership probability are required.
 Title
 Tree Induction for ProbabilityBased Ranking
 Journal

Machine Learning
Volume 52, Issue 3 , pp 199215
 Cover Date
 200309
 DOI
 10.1023/A:1024099825458
 Print ISSN
 08856125
 Online ISSN
 15730565
 Publisher
 Kluwer Academic Publishers
 Additional Links
 Topics
 Keywords

 ranking
 probability estimation
 classification
 costsensitive learning
 decision trees
 Laplace correction
 bagging
 Industry Sectors
 Authors

 Foster Provost ^{(1)}
 Pedro Domingos ^{(2)}
 Author Affiliations

 1. New York University, New York, NY, USA
 2. University of Washington, Seattle, WA, USA