An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

  • María José BasgallEmail author
  • Waldo Hasperué
  • Marcelo Naiouf
  • Alberto Fernández
  • Francisco Herrera
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1050)


Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context.

In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise.

Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance.

In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).


Big Data Imbalanced classification Preprocessing techniques SMOTE Scalability 


  1. 1.
    Chen, C.L.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)CrossRefGoogle Scholar
  2. 2.
    Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)CrossRefGoogle Scholar
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004)Google Scholar
  4. 4.
    Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fusion 42, 51–61 (2018)CrossRefGoogle Scholar
  5. 5.
    García-Gil, D., Luengo, J., García, S., Herrera, F.: Enabling smart data: noise filtering in big data classification. Inf. Sci. 479, 135–152 (2019)CrossRefGoogle Scholar
  6. 6.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  7. 7.
    Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)MathSciNetCrossRefGoogle Scholar
  8. 8.
    White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2015)Google Scholar
  9. 9.
    Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28. USENIX, San Jose (2012)Google Scholar
  10. 10.
    Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media, Sebastopol (2015)Google Scholar
  11. 11.
    Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 423–438. ACM, New York (2013)Google Scholar
  13. 13.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)CrossRefGoogle Scholar
  14. 14.
    López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250(20), 113–141 (2013)CrossRefGoogle Scholar
  15. 15.
    Fernandez, A., del Rio, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: Outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)CrossRefGoogle Scholar
  16. 16.
    Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: SMOTE-BD: an exact and scalable oversampling method for imbalanced classification in big data. J. Comput. Sci. Technol. 18(03), e23 (2018)CrossRefGoogle Scholar
  17. 17.
  18. 18.
    Maillo, J., Ramírez-Gallego, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)CrossRefGoogle Scholar
  19. 19.
    SMOTE-MR source code (2018).
  20. 20.
    Fernandez, A., Herrera, F., Cordon, O., Jose del Jesus, M., Marcelloni, F.: Evolutionary fuzzy systems for explainable artificial intelligence: why, when, what for, and where to? IEEE Comput. Intell. Mag. 14(1), 69–81 (2019)CrossRefGoogle Scholar
  21. 21.
    Lichman, M.: UCI machine learning repository (2013)Google Scholar
  22. 22.
    Gutierrez, P.D., Lastra, M., Benitez, J.M., Herrera, F.: SMOTE-GPU: big data preprocessing on commodity hardware for imbalanced classification. Prog. Artif. Intell. 6(4), 347–354 (2017)CrossRefGoogle Scholar
  23. 23.
    Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)CrossRefGoogle Scholar
  24. 24.
    Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.UNLP, CONICET, III-LIDILa PlataArgentina
  2. 2.Instituto de Investigación en Informática (III-LIDI)CIC-PBA Facultad de Informática - Universidad Nacional de La PlataLa PlataArgentina
  3. 3.University of GranadaGranadaSpain
  4. 4.DaSCI Andalusian Institute of Data Science and Computational IntelligenceUniversity of GranadaGranadaSpain

Personalised recommendations