An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco

doi:10.1007/978-3-030-27713-0_7

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1050))

Included in the following conference series:

Conference on Cloud Computing and Big Data

482 Accesses
9 Citations

Abstract

Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context.

In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise.

Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance.

In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The SMOTE variants are abbreviated as “SMT-BD” or “SMT-MR” in all tables.

References

Chen, C.L.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Article Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Silva, D.F.: Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl. Inf. Syst. 45(1), 247–270 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004)
Google Scholar
Ramírez-Gallego, S., Fernández, A., García, S., Chen, M., Herrera, F.: Big data: tutorial and guidelines on information and process fusion for analytics algorithms with mapreduce. Inf. Fusion 42, 51–61 (2018)
Article Google Scholar
García-Gil, D., Luengo, J., García, S., Herrera, F.: Enabling smart data: noise filtering in big data classification. Inf. Sci. 479, 135–152 (2019)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Article MathSciNet Google Scholar
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media, Sebastopol (2015)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28. USENIX, San Jose (2012)
Google Scholar
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media, Sebastopol (2015)
Google Scholar
Meng, X., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
MathSciNet MATH Google Scholar
Zaharia, M., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 423–438. ACM, New York (2013)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250(20), 113–141 (2013)
Article Google Scholar
Fernandez, A., del Rio, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: Outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)
Article Google Scholar
Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F.: SMOTE-BD: an exact and scalable oversampling method for imbalanced classification in big data. J. Comput. Sci. Technol. 18(03), e23 (2018)
Article Google Scholar
SMOTE-BD Spark Package (2018). https://spark-packages.org/package/majobasgall/smote-bd
Maillo, J., Ramírez-Gallego, S., Triguero, I., Herrera, F.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017)
Article Google Scholar
SMOTE-MR source code (2018). https://github.com/majobasgall/smote-mr
Fernandez, A., Herrera, F., Cordon, O., Jose del Jesus, M., Marcelloni, F.: Evolutionary fuzzy systems for explainable artificial intelligence: why, when, what for, and where to? IEEE Comput. Intell. Mag. 14(1), 69–81 (2019)
Article Google Scholar
Lichman, M.: UCI machine learning repository (2013)
Google Scholar
Gutierrez, P.D., Lastra, M., Benitez, J.M., Herrera, F.: SMOTE-GPU: big data preprocessing on commodity hardware for imbalanced classification. Prog. Artif. Intell. 6(4), 347–354 (2017)
Article Google Scholar
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)
Article Google Scholar
Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

UNLP, CONICET, III-LIDI, La Plata, Argentina
María José Basgall
Instituto de Investigación en Informática (III-LIDI), CIC-PBA Facultad de Informática - Universidad Nacional de La Plata, La Plata, Argentina
María José Basgall, Waldo Hasperué & Marcelo Naiouf
University of Granada, Granada, Spain
María José Basgall
DaSCI Andalusian Institute of Data Science and Computational Intelligence, University of Granada, Granada, Spain
Alberto Fernández & Francisco Herrera

Authors

María José Basgall
View author publications
You can also search for this author in PubMed Google Scholar
Waldo Hasperué
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Naiouf
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to María José Basgall .

Editor information

Editors and Affiliations

III-LIDI, Facultad de Informatica, Universidad Nacional de La Plata, La Plata , Argentina
Marcelo Naiouf
III-LIDI, Facultad de Informatica, Universidad Nacional de La Plata, La Plata, Argentina
Franco Chichizola
III-LIDI, Facultad de Informatica, Universidad Nacional de La Plata, La Plata, Argentina
Enzo Rucci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basgall, M.J., Hasperué, W., Naiouf, M., Fernández, A., Herrera, F. (2019). An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing. In: Naiouf, M., Chichizola, F., Rucci, E. (eds) Cloud Computing and Big Data. JCC&BD 2019. Communications in Computer and Information Science, vol 1050. Springer, Cham. https://doi.org/10.1007/978-3-030-27713-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-27713-0_7
Published: 27 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27712-3
Online ISBN: 978-3-030-27713-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics