Skip to main content

Under-Sample Binary Data Using CURE for Classification

  • Conference paper
  • First Online:
Mining Intelligence and Knowledge Exploration (MIKE 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11987))

  • 221 Accesses

Abstract

Classification is a major break-through in the field of research. The performance of a classifier is highly dependent on the preprocessing. Drawback with most of the classifiers is its performance. It always focuses on the class having a high number of samples and ignores the class having fewer numbers of samples. This problem is identified through state-of-the-art evaluation metrics. To overcome this problem, the data in imbalanced form are converted into balanced form before the classification process. In the proposed work, instead of balancing, samples are re-sampled with the help of cluster based technique CURE. It performs under-sampling by reducing the majority samples but not balancing with minority samples. The experimental results show that the data re-sampled through CURE performs better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl.-Based Syst. 25(1), 13–21 (2012)

    Article  Google Scholar 

  2. Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)

    Article  Google Scholar 

  3. Fawcett, T.O.M.: Adaptive fraud detection 316, 291–316 (1997)

    Google Scholar 

  4. Dorn, K.H., Jobst, T.: Innenreinigung von Rohrleitungssystemen aus Stahl. JOT J. fuer Oberflaechentechnik 42(5), 56–57 (2002)

    Article  Google Scholar 

  5. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  6. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)

    Article  MathSciNet  Google Scholar 

  7. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6(11), 769–772 (1976)

    Google Scholar 

  8. Yen, S.J., Lee, Y.S.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: Huang, D.S., Li, K., Irwin, G.W. (eds.) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol. 344, pp. 731–740. Springer, Heidelberg (2006). https://doi.org/10.1007/978-3-540-37256-1_89

    Chapter  Google Scholar 

  9. Altınçay, H., Ergün, C.: Clustering based under-sampling for improving speaker verification decisions using AdaBoost. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR/SPR 2004. LNCS, vol. 3138, pp. 698–706. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-27868-9_76

    Chapter  Google Scholar 

  10. Yen, S.-J., Lee, Y.-S.: Cluster-based sampling approaches to imbalanced data distributions. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2006. LNCS, vol. 4081, pp. 427–436. Springer, Heidelberg (2006). https://doi.org/10.1007/11823728_41

    Chapter  Google Scholar 

  11. Rahman, M.M., Davis, D.N.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of World Congress on Engineering (2013), vol. III, pp. 1–6 (2013)

    Google Scholar 

  12. Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS (LNAI), vol. 8983, pp. 69–83. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17876-9_5

    Chapter  Google Scholar 

  13. Srividhya, S., Mallika, R.: Cluster concentric circle based undersampling to handle imbalanced data. 24, 314–319 (2016). Spec. Issue Innov. Information, Embed. Commun. Syst.

    Google Scholar 

  14. Anuradha, N., Varma, G.P.S.: PBCCUT-priority based class clustered under sampling technique approaches for imbalanced data classification. Indian J. Sci. Technol. 10(18), 1–9 (2017)

    Article  Google Scholar 

  15. Arafat, M.Y., Hoque, S., Farid, D.M.: Cluster-based under-sampling with random forest for multi-class imbalanced classification. In: International Conference on Software, Knowledge, Information Management & Applications, SKIMA, December 2017, pp. 1–6 (2018)

    Google Scholar 

  16. Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. (Ny) 409–410, 17–26 (2017)

    Article  Google Scholar 

  17. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40 (2004)

    Article  Google Scholar 

  18. Olukanmi, P.O., Twala, B.: k-means-sharp: modified centroid update for outlier-robust k-means clustering. In: 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference, PRASA-RobMech 2017, January 2018, November, pp. 14–19 (2018)

    Google Scholar 

  19. Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: ICML, pp. 91–99 (1998)

    Google Scholar 

  20. Seiffert, C., Van Hulse, J., Raton, B.: Hybrid sampling for imbalanced data. In: IEEE International Conference on Information Reuse and Integration, pp. 202–207 (2008)

    Google Scholar 

  21. Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data

    Google Scholar 

  22. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. Inf. Syst. 26(1), 35–58 (2001)

    Article  Google Scholar 

  23. Small Scale Data Set. https://sci2s.ugr.es/keel/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to T. Kathirvalavakumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kathirvalavakumar, T., Karthikeyan, S., Prasath, R. (2020). Under-Sample Binary Data Using CURE for Classification. In: B. R., P., Thenkanidiyoor, V., Prasath, R., Vanga, O. (eds) Mining Intelligence and Knowledge Exploration. MIKE 2019. Lecture Notes in Computer Science(), vol 11987. Springer, Cham. https://doi.org/10.1007/978-3-030-66187-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66187-8_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66186-1

  • Online ISBN: 978-3-030-66187-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics