Abstract
We studied three methods to improve identification of difficult small classes by balancing imbalanced class distribution with data reduction. The new method, neighborhood cleaning rule (NCL), outperformed simple random and one-sided selection methods in experiments with ten data sets. All reduction methods improved identification of small classes (20–30%), but the differences were insignificant. However, significant differences in accuracies, true-positive rates and true-negative rates obtained with the 3-nearest neighbor method and C4.5 from the reduced data favored NCL. The results suggest that NCL is a useful method for improving the modeling of difficult small classes, and for building classifiers to identify these classes from the real-world data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cochran, W.G.: Sampling Techniques. 3rd edn. Wiley, New York (1977)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-Based Learning Algorithms. Mach. Learn. 6 (1991) 37–66
Wilson, D.R., Martinez, T.R.: Reduction Techniques for Instance-Based Learning Algorithms. Mach. Learn. 38 (2000) 257–286
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Fisher, D.H. (ed.): Proceedings of the Fourteenth International Conference in Machine Learning. Morgan Kaufmann, San Francisco (1997) 179–186
Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, University of California, Department of Information and Computer Science (1998)
Laurikkala, J., Juhola, M., Lammi, S., Penttinen, J., Aukee P.: Analysis of the Imputed Female Urinary Incontinence Data for the Evaluation of Expert System Parameters. Comput. Biol. Med. 31 (2001)
Kentala, E.: Characteristics of Six Otologic Diseases Involving Vertigo. Am. J. Otol. 17 (1996) 883–892
Laurikkala J.: Improving Identification of Difficult Small Classes by Balancing Class Distribution [ftp://ftp.cs.uta.fi/pub/reports/pdf/A-2001-2.pdf]. Dept. of Computer and Information Sciences, University of Tampere, Tech. Report A-2001-2, April 2001
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufman, San Mateo (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Laurikkala, J. (2001). Improving Identification of Difficult Small Classes by Balancing Class Distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds) Artificial Intelligence in Medicine. AIME 2001. Lecture Notes in Computer Science(), vol 2101. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48229-6_9
Download citation
DOI: https://doi.org/10.1007/3-540-48229-6_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42294-5
Online ISBN: 978-3-540-48229-1
eBook Packages: Springer Book Archive