Learning from Imbalanced Data in Presence of Noisy and Borderline Examples

Napierała, Krystyna; Stefanowski, Jerzy; Wilk, Szymon

doi:10.1007/978-3-642-13529-3_18

Krystyna Napierała²⁴,
Jerzy Stefanowski²⁴ &
Szymon Wilk²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6086))

Included in the following conference series:

International Conference on Rough Sets and Current Trends in Computing

1804 Accesses
104 Citations

Abstract

In this paper we studied re-sampling methods for learning classifiers from imbalanced data. We carried out a series of experiments on artificial data sets to explore the impact of noisy and borderline examples from the minority class on the classifier performance. Results showed that if data was sufficiently disturbed by these factors, then the focused re-sampling methods – NCR and our SPIDER2 – strongly outperformed the oversampling methods. They were also better for real-life data, where PCA visualizations suggested possible existence of noisy examples and large overlapping ares between classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chawla, N.: Data mining for imbalanced datasets: An overview. In: Maimon, O., Rokach, L. (eds.) The Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, Heidelberg (2005)
Chapter Google Scholar
Garcia, V., Sanchez, J., Mollineda, R.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 397–406. Springer, Heidelberg (2007)
Chapter Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Transactions on Data and Knowledge Engineering 21(9), 1263–1284 (2009)
Article Google Scholar
Japkowicz, N.: Class imbalance: Are we focusing on the right issue? In: Proc. II Workshop on Learning from Imbalanced Data Sets, ICML, pp. 17–23 (2003)
Google Scholar
Jo, T., Japkowicz, N.: Class Imbalances versus small disjuncts. SIGKDD Explorations 6(1), 40–49 (2004)
Article MathSciNet Google Scholar
Kubat, M., Matwin, S.: Addresing the curse of imbalanced training sets: one-side selection. In: Proc. of the 14th Int. Conf. on Machine Learning, pp. 179–186 (1997)
Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Tech. Report A-2001-2, University of Tampere (2001)
Google Scholar
Stefanowski, J., Wilk, S.: Improving rule based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: Proc. of the RSKD Workshop at ECML/PKDD, pp. 54–65 (2007)
Google Scholar
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 283–292. Springer, Heidelberg (2008)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing Science, Poznań University of Technology, ul. Piotrowo 2, 60–965, Poznań, Poland
Krystyna Napierała, Jerzy Stefanowski & Szymon Wilk

Authors

Krystyna Napierała
View author publications
You can also search for this author in PubMed Google Scholar
Jerzy Stefanowski
View author publications
You can also search for this author in PubMed Google Scholar
Szymon Wilk
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Mathematics, Warsaw University, Banacha 2, 02-097, Warsaw, Poland
Marcin Szczuka
ICS, Warsaw University of Technology,,
Marzena Kryszkiewicz
Department of Applied Computer Science, University of Winnipeg, R3B 2E9, Winnipeg, Manitoba, Canada
Sheela Ramanna
Dept. of Computer Science, The University of Wales, Aberystwyth, UK
Richard Jensen
Harbin Institute of Technology, PO Box 458, 150006, Harbin, China
Qinghua Hu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Napierała, K., Stefanowski, J., Wilk, S. (2010). Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds) Rough Sets and Current Trends in Computing. RSCTC 2010. Lecture Notes in Computer Science(), vol 6086. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13529-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-13529-3_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13528-6
Online ISBN: 978-3-642-13529-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics