Under-Sample Binary Data Using CURE for Classification

Kathirvalavakumar, T.; Karthikeyan, S.; Prasath, Rajendra

doi:10.1007/978-3-030-66187-8_18

T. Kathirvalavakumar¹²,
S. Karthikeyan¹³ &
Rajendra Prasath¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11987))

Included in the following conference series:

International Conference on Mining Intelligence and Knowledge Exploration

221 Accesses

Abstract

Classification is a major break-through in the field of research. The performance of a classifier is highly dependent on the preprocessing. Drawback with most of the classifiers is its performance. It always focuses on the class having a high number of samples and ignores the class having fewer numbers of samples. This problem is identified through state-of-the-art evaluation metrics. To overcome this problem, the data in imbalanced form are converted into balanced form before the classification process. In the proposed work, instead of balancing, samples are re-sampled with the help of cluster based technique CURE. It performs under-sampling by reducing the majority samples but not balancing with minority samples. The experimental results show that the data re-sampled through CURE performs better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21 (2012)
Article Google Scholar
Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
Article Google Scholar
Fawcett, T.O.M.: Adaptive fraud detection 316, 291–316 (1997)
Google Scholar
Dorn, K.H., Jobst, T.: Innenreinigung von Rohrleitungssystemen aus Stahl. JOT J. fuer Oberflaechentechnik 42(5), 56–57 (2002)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6(11), 769–772 (1976)
Google Scholar
Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol. 344, pp. 731–740. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-37256-1_89
Chapter Google Scholar
Altınçay, H., Ergün, C.: Clustering based under-sampling for improving speaker verification decisions using AdaBoost. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR/SPR 2004. LNCS, vol. 3138, pp. 698–706. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-27868-9_76
Chapter Google Scholar
Yen, S.-J., Lee, Y.-S.: Cluster-based sampling approaches to imbalanced data distributions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 427–436. Springer, Heidelberg (2006). https://doi.org/10.1007/11823728_41
Chapter Google Scholar
Rahman, M.M., Davis, D.N.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of World Congress on Engineering (2013), vol. III, pp. 1–6 (2013)
Google Scholar
Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17876-9_5
Chapter Google Scholar
Srividhya, S., Mallika, R.: Cluster concentric circle based undersampling to handle imbalanced data. 24, 314–319 (2016). Spec. Issue Innov. Information, Embed. Commun. Syst.
Google Scholar
Anuradha, N., Varma, G.P.S.: PBCCUT-priority based class clustered under sampling technique approaches for imbalanced data classification. Indian J. Sci. Technol. 10(18), 1–9 (2017)
Article Google Scholar
Arafat, M.Y., Hoque, S., Farid, D.M.: Cluster-based under-sampling with random forest for multi-class imbalanced classification. In: International Conference on Software, Knowledge, Information Management & Applications, SKIMA, December 2017, pp. 1–6 (2018)
Google Scholar
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. (Ny) 409–410, 17–26 (2017)
Article Google Scholar
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40 (2004)
Article Google Scholar
Olukanmi, P.O., Twala, B.: k-means-sharp: modified centroid update for outlier-robust k-means clustering. In: 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference, PRASA-RobMech 2017, January 2018, November, pp. 14–19 (2018)
Google Scholar
Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML, pp. 91–99 (1998)
Google Scholar
Seiffert, C., Van Hulse, J., Raton, B.: Hybrid sampling for imbalanced data. In: IEEE International Conference on Information Reuse and Integration, pp. 202–207 (2008)
Google Scholar
Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. Inf. Syst. 26(1), 35–58 (2001)
Article Google Scholar
Small Scale Data Set. https://sci2s.ugr.es/keel/

Download references

Author information

Authors and Affiliations

Department of Computer Science, VHNSN College, Virudhunagar, India
T. Kathirvalavakumar
Department of Information Technology, VHNSN College, Virudhunagar, India
S. Karthikeyan
Indian Institute of Information Technology, Sri City, Chittoor, AP, India
Rajendra Prasath

Authors

T. Kathirvalavakumar
View author publications
You can also search for this author in PubMed Google Scholar
S. Karthikeyan
View author publications
You can also search for this author in PubMed Google Scholar
Rajendra Prasath
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to T. Kathirvalavakumar .

Editor information

Editors and Affiliations

National Institute of Technology, Goa, India
Purushothama B. R.
National Institute of Technology, Goa, India
Veena Thenkanidiyoor
Indian Institute of Information Technology, Sri City, India
Rajendra Prasath
Indian Institute of Information Technology, Sri City, India
Odelu Vanga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kathirvalavakumar, T., Karthikeyan, S., Prasath, R. (2020). Under-Sample Binary Data Using CURE for Classification. In: B. R., P., Thenkanidiyoor, V., Prasath, R., Vanga, O. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2019. Lecture Notes in Computer Science(), vol 11987. Springer, Cham. https://doi.org/10.1007/978-3-030-66187-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-66187-8_18
Published: 20 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66186-1
Online ISBN: 978-3-030-66187-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics