Abstract
Classification is a major break-through in the field of research. The performance of a classifier is highly dependent on the preprocessing. Drawback with most of the classifiers is its performance. It always focuses on the class having a high number of samples and ignores the class having fewer numbers of samples. This problem is identified through state-of-the-art evaluation metrics. To overcome this problem, the data in imbalanced form are converted into balanced form before the classification process. In the proposed work, instead of balancing, samples are re-sampled with the help of cluster based technique CURE. It performs under-sampling by reducing the majority samples but not balancing with minority samples. The experimental results show that the data re-sampled through CURE performs better.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
GarcÃa, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21 (2012)
Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
Fawcett, T.O.M.: Adaptive fraud detection 316, 291–316 (1997)
Dorn, K.H., Jobst, T.: Innenreinigung von Rohrleitungssystemen aus Stahl. JOT J. fuer Oberflaechentechnik 42(5), 56–57 (2002)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6(11), 769–772 (1976)
Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol. 344, pp. 731–740. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-37256-1_89
Altınçay, H., Ergün, C.: Clustering based under-sampling for improving speaker verification decisions using AdaBoost. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR/SPR 2004. LNCS, vol. 3138, pp. 698–706. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-27868-9_76
Yen, S.-J., Lee, Y.-S.: Cluster-based sampling approaches to imbalanced data distributions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 427–436. Springer, Heidelberg (2006). https://doi.org/10.1007/11823728_41
Rahman, M.M., Davis, D.N.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of World Congress on Engineering (2013), vol. III, pp. 1–6 (2013)
Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17876-9_5
Srividhya, S., Mallika, R.: Cluster concentric circle based undersampling to handle imbalanced data. 24, 314–319 (2016). Spec. Issue Innov. Information, Embed. Commun. Syst.
Anuradha, N., Varma, G.P.S.: PBCCUT-priority based class clustered under sampling technique approaches for imbalanced data classification. Indian J. Sci. Technol. 10(18), 1–9 (2017)
Arafat, M.Y., Hoque, S., Farid, D.M.: Cluster-based under-sampling with random forest for multi-class imbalanced classification. In: International Conference on Software, Knowledge, Information Management & Applications, SKIMA, December 2017, pp. 1–6 (2018)
Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. (Ny) 409–410, 17–26 (2017)
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40 (2004)
Olukanmi, P.O., Twala, B.: k-means-sharp: modified centroid update for outlier-robust k-means clustering. In: 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference, PRASA-RobMech 2017, January 2018, November, pp. 14–19 (2018)
Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML, pp. 91–99 (1998)
Seiffert, C., Van Hulse, J., Raton, B.: Hybrid sampling for imbalanced data. In: IEEE International Conference on Information Reuse and Integration, pp. 202–207 (2008)
Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. Inf. Syst. 26(1), 35–58 (2001)
Small Scale Data Set. https://sci2s.ugr.es/keel/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kathirvalavakumar, T., Karthikeyan, S., Prasath, R. (2020). Under-Sample Binary Data Using CURE for Classification. In: B. R., P., Thenkanidiyoor, V., Prasath, R., Vanga, O. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2019. Lecture Notes in Computer Science(), vol 11987. Springer, Cham. https://doi.org/10.1007/978-3-030-66187-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-66187-8_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66186-1
Online ISBN: 978-3-030-66187-8
eBook Packages: Computer ScienceComputer Science (R0)