Skip to main content

A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality

  • 102 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13264)

Abstract

The interest in exploiting big datasets with machine learning has led to adapting classic strategies in this new paradigm determined by volume, speed, and variety. Because data quality is a determining factor in constructing a classifier, it has also been necessary to adapt or develop new data preprocessing techniques. One of the challenges of most significant interest is the class imbalance problem, where the class of interest has a smaller number of examples concerning another class called the majority. To alleviate this problem, one of the most recognized techniques is SMOTE, which is characterized by generating instances of the minority class through a process that uses the nearest neighbor rule and the Euclidean distance. Various articles have shown that SMOTE is not appropriate for datasets with high dimensionality. However, in big data, datasets with high dimensionality have contained many zeros. Therefore, in this article, our objective is to analyze the SMOTE-BD behavior on imbalanced big datasets with sparse and dense dimensionality. Experimental results using two classifiers and big datasets with different dimensionalities suggest that sparsity is a predominant factor than the dimensionality in the behavior of SMOTE-BD.

Keywords

  • Big data
  • SMOTE
  • Class imbalance
  • High dimensionality
  • Dense dataset
  • Sparse dataset

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-07750-0_5
  • Chapter length: 10 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-031-07750-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.

Notes

  1. 1.

    Although this medium-high dimensional dataset may not represent a big data problem in terms of volume, we believe it can be treated as such since it may not be processed and analyzed on standard hardware.

References

  1. Ali, A., Shamsuddin, S.M., Ralescu, A.: Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl. 7(3), 176–204 (2015)

    Google Scholar 

  2. Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: SMOTE-BD: an exact and scalable oversampling method for imbalanced classification in big data. In: VI Jornadas de Cloud Computing & Big Data (JCC&BD) (La Plata 2018) (2018)

    Google Scholar 

  3. Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: An analysis of local and global solutions to address big data imbalanced classification: a case study with SMOTE preprocessing. In: Naiouf, M., Chichizola, F., Rucci, E. (eds.) JCC&BD 2019. CCIS, vol. 1050, pp. 75–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27713-0_7

    CrossRef  Google Scholar 

  4. Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14(106), 1–16 (2013)

    Google Scholar 

  5. Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modelling under imbalanced distributions. CoRR abs/1505.01658 (2015). http://arxiv.org/abs/1505.01658

  6. Brennan, P.: A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection. Master’s thesis, Institute of Technology Blanchardstown, Dublin, Ireland (2012)

    Google Scholar 

  7. Chang, C.C., Lin, C.J.: LIBSVM. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    CrossRef  Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    CrossRef  Google Scholar 

  9. Elreedy, D., Atiya, A.F.: A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 505, 32–64 (2019)

    CrossRef  Google Scholar 

  10. Fernández, A., García, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, markin the 15-year anniversary. J. Artif. Intell. Res. 51, 863–905 (2018)

    CrossRef  Google Scholar 

  11. García, V., Alejo, R., Sánchez, J.S., Sotoca, J.M., Mollineda, R.A.: Combined effects of class imbalance and class overlap on instance-based classification. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 371–378. Springer, Heidelberg (2006). https://doi.org/10.1007/11875581_45

    CrossRef  Google Scholar 

  12. Hassib, E.M., El-Desouky, A.I., Labib, L.M., El-kenawy, E.S.M.: WOA + BRNN: an imbalanced big data classification framework using whale optimization and deep neural network. Soft. Comput. 24(8), 5573–5592 (2020)

    CrossRef  Google Scholar 

  13. Jain, A., Ratnoo, S., Kumar, D.: Addressing class imbalance problem in medical diagnosis: a genetic algorithm approach. In: 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), pp. 1–8 (2017)

    Google Scholar 

  14. Japkowicz, N., Shah, M.: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, Cambridge (2011)

    Google Scholar 

  15. Joyanes Aguilar, L.: Big Data: Análisis de grandes volúmenes de datos en organizaciones. Alfaomega (2013)

    Google Scholar 

  16. Kovács, G.: SMOTE-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019)

    CrossRef  Google Scholar 

  17. Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5(1), 1–30 (2018). https://doi.org/10.1186/s40537-018-0151-6

    CrossRef  Google Scholar 

  18. Maillo, J., Ramírez, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)

    CrossRef  Google Scholar 

  19. Maillo, J., Triguero, I., Herrera, F.: Redundancy and complexity metrics for big data classification: towards smart data. IEEE Access 8, 87918–87928 (2020)

    CrossRef  Google Scholar 

  20. Maldonado, S., López, J., Vairetti, C.: An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 76, 380–389 (2019)

    CrossRef  Google Scholar 

  21. Pengfei, J., Chunkai, Z., Zhenyu, H.: A new sampling approach for classification of imbalanced data sets with high density. In: 2014 International Conference on Big Data and Smart Computing (BIGCOMP), pp. 217–222 (2014)

    Google Scholar 

  22. Saez, J.A., Galar, M., Krawczyk, B.: Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy. IEEE Access 7, 83396–83411 (2019)

    CrossRef  Google Scholar 

  23. Sleeman, W.C., IV., Krawczyk, B.: Multi-class imbalanced big data classification on spark. Knowl.-Based Syst. 212, 106598 (2021)

    CrossRef  Google Scholar 

  24. Suárez, J.L., García, S., Herrera, F.: A tutorial on distance metric learning: mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing 425, 300–322 (2021)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Bolívar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Bolívar, A., García, V., Florencia, R., Alejo, R., Rivera, G., Sánchez-Solís, J.P. (2022). A Preliminary Study of SMOTE on Imbalanced Big Datasets When Dealing with Sparse and Dense High Dimensionality. In: Vergara-Villegas, O.O., Cruz-Sánchez, V.G., Sossa-Azuela, J.H., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2022. Lecture Notes in Computer Science, vol 13264. Springer, Cham. https://doi.org/10.1007/978-3-031-07750-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07750-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07749-4

  • Online ISBN: 978-3-031-07750-0

  • eBook Packages: Computer ScienceComputer Science (R0)