Skip to main content

Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

  • Conference paper
  • First Online:
Next Generation Information Processing System

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1162 ))

  • 393 Accesses

Abstract

Most of the real-world data sets exhibit a skewed scenario of data distribution in contrast to the well-established data sets. The total number of instances of a particular class extremely surpasses the count of other classes. This uneven dispersal of classes leads to a state of imbalance data sets posing an extreme difficulty for learning procedures. Additionally, due to its intrinsic complex data features, analyzing such imbalanced data sets has setup an avenue for focused researchers. Imbalanced class distribution is effectively handled with over sampling of minority class data which is usually independent of the classifiers. A over sampling technique: Clustering minority samples over sampling technique (CMSOT) is proposed to enhance the classification of imbalanced data sets. The projected technique is implemented on Apache Hadoop under mapreduce environment. The data sets are mainly encompassed from the UCI repository. The effect of True Positive rates justifying the imbalance ratio including the examination of improved classification from the generated pool is studied. The achieved experimental results along with its corresponding statistical analysis of over sampled data sets clearly mark the supremacy of the planned technique to the selected benchmarking techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wei, W., Li, J., Cao, L., Ou, Y., Chen, J.: Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web. 4, 449–475 (2013)

    Article  Google Scholar 

  2. Tomczak, J., ZięBa, M.: Probabilistic combination of classification rules and its application to medical diagnosis. Mach. Learn. 1–3, 105–135 (2015)

    Article  MathSciNet  Google Scholar 

  3. Chen, Y.: An empirical study of a hybrid imbalanced-class DT-RST classification procedure to elucidate therapeutic effects in uremia patients. Med. Biol. Eng. Compu. 6, 983–1001 (2016)

    Article  Google Scholar 

  4. Elhag, S., Fernández, A., Bawakid, A., Alshomrani, S., Herrera, F.: On the combination of genetic fuzzy systems and pairwise learning for improving detection rates on intrusion detection systems. Expert Syst. Appl. 1, 193–202 (2015)

    Article  Google Scholar 

  5. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)

    Article  Google Scholar 

  6. Del Río, S., López, V., Benítez, J., Herrera, F.: On the use of MapReduce for imbalanced big data using random forest. Inf. Sci. 285, 112–137 (2014)

    Article  Google Scholar 

  7. Jiang, H., Chen, Y., Qiao, Z., Weng, T., Li, K.: Scaling up MapReduce-based big data processing on multi-GPU systems. Cluster Comput. 1, 369–383 (2015)

    Article  Google Scholar 

  8. Huang, J., Ling, C.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 3, 299–310 (2005)

    Google Scholar 

  9. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intell. Data Anal. 5, 429–449 (2002)

    Article  Google Scholar 

  10. He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 9, 1263–1284 (2008)

    Google Scholar 

  11. Sun, Y., Wong, A., Kamel, M.: Classification of imbalanced data: A review. Int. J. Pattern Recognit Artif Intell. 04, 687–719 (2009)

    Article  Google Scholar 

  12. Maalouf, M., Trafalis, T.: Robust weighted kernel logistic regression in imbalanced and rare events data. Comput. Stat. Data Anal. 55, 168–183 (2011)

    Article  MathSciNet  Google Scholar 

  13. Japkowicz, N., Myers, C., Gluck, M.: A novelty detection approach to classification. InIJCAI 1, 518–523 (1995)

    Google Scholar 

  14. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  15. Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, Springer, Berlin, pp. 878–887 (2005)

    Google Scholar 

  16. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin. pp. 475–482 (2009)

    Google Scholar 

  17. He, H., Bai, Y., Garcia, E., Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp. 322–1328 (2008)

    Google Scholar 

  18. Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 1, 92–122 (2014)

    Article  MathSciNet  Google Scholar 

  19. Hu, F., Li, H.: A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Problems Eng. (20130

    Google Scholar 

  20. Chawla, N., Lazarevic, A., Hall, L., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, Berlin. pp. 107–119 (2003)

    Google Scholar 

  21. Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inf. Syst. 4, 1783–1790 (2012)

    Google Scholar 

  22. Gong, J., Kim, H.: RHSBoost: Improving classification performance in imbalance data. Comput. Stat. Data Anal. 111, 1–3 (2017)

    Article  MathSciNet  Google Scholar 

  23. Barua, S., Islam, M., Yao, X., Murase, K.: MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2, 405–425 (2012)

    Google Scholar 

  24. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 3, 664–684 (2012)

    Article  Google Scholar 

  25. UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets.html Accessed 13 Nov 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sachin Patil .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Patil, S., Sonavane, S. (2021). Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique. In: Deshpande, P., Abraham, A., Iyer, B., Ma, K. (eds) Next Generation Information Processing System. Advances in Intelligent Systems and Computing, vol 1162 . Springer, Singapore. https://doi.org/10.1007/978-981-15-4851-2_32

Download citation

Publish with us

Policies and ethics