Skip to main content

An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

  • Conference paper
  • First Online:
Cloud Computing and Big Data (JCC&BD 2019)

Abstract

Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context.

In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise.

Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance.

In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The SMOTE variants are abbreviated as “SMT-BD” or “SMT-MR” in all tables.

References

  1. Chen, C.L.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  2. Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)

    Article  Google Scholar 

  3. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004)

    Google Scholar 

  4. Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fusion 42, 51–61 (2018)

    Article  Google Scholar 

  5. García-Gil, D., Luengo, J., García, S., Herrera, F.: Enabling smart data: noise filtering in big data classification. Inf. Sci. 479, 135–152 (2019)

    Article  Google Scholar 

  6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  7. Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)

    Article  MathSciNet  Google Scholar 

  8. White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2015)

    Google Scholar 

  9. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28. USENIX, San Jose (2012)

    Google Scholar 

  10. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media, Sebastopol (2015)

    Google Scholar 

  11. Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  12. Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 423–438. ACM, New York (2013)

    Google Scholar 

  13. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)

    Article  Google Scholar 

  14. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250(20), 113–141 (2013)

    Article  Google Scholar 

  15. Fernandez, A., del Rio, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: Outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)

    Article  Google Scholar 

  16. Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: SMOTE-BD: an exact and scalable oversampling method for imbalanced classification in big data. J. Comput. Sci. Technol. 18(03), e23 (2018)

    Article  Google Scholar 

  17. SMOTE-BD Spark Package (2018). https://spark-packages.org/package/majobasgall/smote-bd

  18. Maillo, J., Ramírez-Gallego, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)

    Article  Google Scholar 

  19. SMOTE-MR source code (2018). https://github.com/majobasgall/smote-mr

  20. Fernandez, A., Herrera, F., Cordon, O., Jose del Jesus, M., Marcelloni, F.: Evolutionary fuzzy systems for explainable artificial intelligence: why, when, what for, and where to? IEEE Comput. Intell. Mag. 14(1), 69–81 (2019)

    Article  Google Scholar 

  21. Lichman, M.: UCI machine learning repository (2013)

    Google Scholar 

  22. Gutierrez, P.D., Lastra, M., Benitez, J.M., Herrera, F.: SMOTE-GPU: big data preprocessing on commodity hardware for imbalanced classification. Prog. Artif. Intell. 6(4), 347–354 (2017)

    Article  Google Scholar 

  23. Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)

    Article  Google Scholar 

  24. Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to María José Basgall .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F. (2019). An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing. In: Naiouf, M., Chichizola, F., Rucci, E. (eds) Cloud Computing and Big Data. JCC&BD 2019. Communications in Computer and Information Science, vol 1050. Springer, Cham. https://doi.org/10.1007/978-3-030-27713-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27713-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27712-3

  • Online ISBN: 978-3-030-27713-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics