Abstract
Binary classification algorithms are often used in situations when one of the two classes is extremely rare. A common practice is to oversample units of the rare class when forming the training set. For some classification algorithms, like logistic classification, there are theoretical results that justify such an approach. Similar results are not available for other popular classification algorithms like classification trees. In this paper the use of balanced datasets, when dealing with rare classes, for tree classifiers and boosting algorithms is discussed and results from analyzing a real dataset and a simulated dataset are reported.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
BERRY, M. J. A. and LINOFF, G. S. (2000): Mastering Data Mining, Wiley, New York.
COSSLETT, S. R. (1981): Maximum likelihood estimator for choice based models, Econometrica, 49, 1289–1316.
HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2001): The Elements of Statistical Learning, Springer, New York.
JAPKOWICZ, N. (2000): The class imbalance problem: Significance and strategies, Proceedings of the 2000 International Conference on Artificial Intelligence pp. 111–117.
JOSHI, M. (2002): On evaluating performance of classifiers for rare classes, url:www-users.cs.umn.edu/mjoshi/papers/icdm02sub.ps.
PRENTICE, R. L. and PYKE, R. (1979): Logistic disease models in case-control studies, Biometrika, 66, 403–411
SAS (1998): SAS Institute Best Practice Paper, Data Mining and the Case for Sampling, URL:http//nas.cl.uh.edu/boetticher/ML-DataMining/SAS-SEMMA.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin · Heidelberg
About this paper
Cite this paper
Scarpa, B., Torelli, N. (2005). Selecting the Training Set in Classification Problems with Rare Events. In: Bock, HH., et al. New Developments in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-27373-5_5
Download citation
DOI: https://doi.org/10.1007/3-540-27373-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23809-6
Online ISBN: 978-3-540-27373-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)