Imbalanced Data Preprocessing for Big Data

Luengo, Julián; García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-030-39105-8_8

Julián Luengo⁶,
Diego García-Gil⁶,
Sergio Ramírez-Gallego⁷,
Salvador García⁶ &
…
Francisco Herrera⁶

2218 Accesses
4 Citations

Abstract

The negative impact on learning associated with imbalanced proportion of classes has exploded lately with the exponential growth of “cheap” data. Many real-world problems present scarce number of instances in one class whereas in others their cardinality is several factors greater. The current techniques that treat large-scale imbalanced data are focused on obtaining fast, scalable, and parallel sampling techniques following the standard MapReduce procedure. These generate local balanced solutions in each map, which are eventually combined into a final set. Nevertheless, as we will see later, this divide-and-conquer strategy entails several problems, such as small disjuncts, data lack, etc. In this chapter we also review the latest proposals on imbalanced Big Data preprocessing and present a MapReduce framework for imbalanced preprocessing which includes several state-of-the-art sampling techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Basgall, M. J., Hasperué, W., Naiouf, M., Fernández, A., & Herrera, F. (2018). SMOTE-BD: An exact and scalable oversampling method for imbalanced classification in big data. Journal of Computer Science and Technology, 18(03), e23.
Article Google Scholar
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.
Article Google Scholar
Bhagat, R. C., & Patil, S. S. (2015). Enhanced smote algorithm for classification of imbalanced big-data using Random Forest. In Souvenir of the 2015 IEEE International Advance Computing Conference, IACC 2015 (pp. 403–408)
Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Article Google Scholar
del Río, S., Bentez, J. M., & Herrera, F. (2015). Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced Big Data classification. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 180–185).
Google Scholar
del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced Big Data using random forest. Information Sciences, 285, 112–137.
Article Google Scholar
Elkan, C. (2001). The foundations of cost-sensitive learning. In In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (pp. 973–978).
Google Scholar
Fernández, A., del Río, S., Chawla, N. V., & Herrera, F. (2017). An insight into imbalanced big data classification: Outcomes and challenges. Complex & Intelligent Systems, 3(2), 105–120.
Article Google Scholar
Fernández, A., López, V., Galar, M., Del Jesus, M. J., & Herrera, F. (2013). Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42, 97–110.
Article Google Scholar
Guo, Y., Graber, A., McBurney, R. N., & Balasubramanian, R. (2010). Sample size and statistical power considerations in high-dimensionality data settings: A comparative study of classification algorithms. BMC Bioinformatics, 11, 447.
Article Google Scholar
Gutierrez, P. D., Lastra, M., Bacardit, J., Benitez, J. M., & Herrera, F. (2016). GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs. Information Sciences, 373, 165–182.
Article Google Scholar
Gutierrez, P. D., Lastra, M., Benitez, J. M., & Herrera, F. (2017). SMOTE-GPU: Big data preprocessing on commodity hardware for imbalanced classification. Progress in Artificial Intelligence, 6(4), 347–354.
Article Google Scholar
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Article Google Scholar
Hu, F., & Li, H. (2013). A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Mathematical Problems in Engineering, 2013, 1–10.
MathSciNet MATH Google Scholar
Hu, F., Li, H., Lou, H., & Dai, J. (2014). A parallel oversampling algorithm based on NRSBoundary-SMOTE. Journal of Information and Computational Science, 11(13), 4655–4665.
Article Google Scholar
Kamal, S., Ripon, S. H., Dey, N., Ashour, A. S., & Santhi, V. (2016). A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Computer Methods and Programs in Biomedicine, 131, 191–206.
Article Google Scholar
Krawczyk, B. (2016). GPU-accelerated extreme learning machines for imbalanced data streams with concept drift. In M. Connolly (Ed.), The International Conference on Computational Science, Procedia Computer Science (Vol. 80, pp. 1692–1701)
Google Scholar
López, V., Fernández, A., del Jesus, M. J., & Herrera, F. (2013). A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowledge-Based Systems, 38, 85–104. Special Issue on Advances in Fuzzy Knowledge Systems: Theory and Application.
Google Scholar
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Article Google Scholar
Maíllo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Article Google Scholar
Triguero, I., Derrac, J., García, S., & Herrera, F. (2012). Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing, 97, 332–343.
Article Google Scholar
Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., & Herrera, F. (2016). Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In IEEE Congress on Evolutionary Computation (CEC 2016), Vancouver (pp. 640–647).
Google Scholar
Triguero, I., Galar, M., Vluymans, S., Cornelis, C., Bustince, H., Herrera, F., & Saeys, Y. (2015). Evolutionary undersampling for imbalanced Big Data classification. In 2015 IEEE Congress on Evolutionary Computation (CEC) (pp. 715–722).
Google Scholar
Triguero, I., Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 Big Data competition: An extremely imbalanced Big Data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.
Article Google Scholar
Zhai, J., Zhang, S., & Wang, C. (2015). The classification of imbalanced large data sets based on MapReduce and ensemble of elm classifiers. International Journal of Machine Learning and Cybernetics, 1–9.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Spain
Julián Luengo, Diego García-Gil, Salvador García & Francisco Herrera
DOCOMO Digital España, Madrid, Madrid, Spain
Sergio Ramírez-Gallego

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Imbalanced Data Preprocessing for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-39105-8_8
Published: 17 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics