Skip to main content

Imbalanced Data Preprocessing for Big Data

  • Chapter
  • First Online:
Big Data Preprocessing

Abstract

The negative impact on learning associated with imbalanced proportion of classes has exploded lately with the exponential growth of “cheap” data. Many real-world problems present scarce number of instances in one class whereas in others their cardinality is several factors greater. The current techniques that treat large-scale imbalanced data are focused on obtaining fast, scalable, and parallel sampling techniques following the standard MapReduce procedure. These generate local balanced solutions in each map, which are eventually combined into a final set. Nevertheless, as we will see later, this divide-and-conquer strategy entails several problems, such as small disjuncts, data lack, etc. In this chapter we also review the latest proposals on imbalanced Big Data preprocessing and present a MapReduce framework for imbalanced preprocessing which includes several state-of-the-art sampling techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Basgall, M. J., Hasperué, W., Naiouf, M., Fernández, A., & Herrera, F. (2018). SMOTE-BD: An exact and scalable oversampling method for imbalanced classification in big data. Journal of Computer Science and Technology, 18(03), e23.

    Article  Google Scholar 

  2. Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.

    Article  Google Scholar 

  3. Bhagat, R. C., & Patil, S. S. (2015). Enhanced smote algorithm for classification of imbalanced big-data using Random Forest. In Souvenir of the 2015 IEEE International Advance Computing Conference, IACC 2015 (pp. 403–408)

    Google Scholar 

  4. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    Article  Google Scholar 

  5. del Río, S., Bentez, J. M., & Herrera, F. (2015). Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced Big Data classification. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 180–185).

    Google Scholar 

  6. del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced Big Data using random forest. Information Sciences, 285, 112–137.

    Article  Google Scholar 

  7. Elkan, C. (2001). The foundations of cost-sensitive learning. In In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (pp. 973–978).

    Google Scholar 

  8. Fernández, A., del Río, S., Chawla, N. V., & Herrera, F. (2017). An insight into imbalanced big data classification: Outcomes and challenges. Complex & Intelligent Systems, 3(2), 105–120.

    Article  Google Scholar 

  9. Fernández, A., López, V., Galar, M., Del Jesus, M. J., & Herrera, F. (2013). Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems, 42, 97–110.

    Article  Google Scholar 

  10. Guo, Y., Graber, A., McBurney, R. N., & Balasubramanian, R. (2010). Sample size and statistical power considerations in high-dimensionality data settings: A comparative study of classification algorithms. BMC Bioinformatics, 11, 447.

    Article  Google Scholar 

  11. Gutierrez, P. D., Lastra, M., Bacardit, J., Benitez, J. M., & Herrera, F. (2016). GPU-SME-kNN: Scalable and memory efficient kNN and lazy learning using GPUs. Information Sciences, 373, 165–182.

    Article  Google Scholar 

  12. Gutierrez, P. D., Lastra, M., Benitez, J. M., & Herrera, F. (2017). SMOTE-GPU: Big data preprocessing on commodity hardware for imbalanced classification. Progress in Artificial Intelligence, 6(4), 347–354.

    Article  Google Scholar 

  13. He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

    Article  Google Scholar 

  14. Hu, F., & Li, H. (2013). A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Mathematical Problems in Engineering, 2013, 1–10.

    MathSciNet  MATH  Google Scholar 

  15. Hu, F., Li, H., Lou, H., & Dai, J. (2014). A parallel oversampling algorithm based on NRSBoundary-SMOTE. Journal of Information and Computational Science, 11(13), 4655–4665.

    Article  Google Scholar 

  16. Kamal, S., Ripon, S. H., Dey, N., Ashour, A. S., & Santhi, V. (2016). A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Computer Methods and Programs in Biomedicine, 131, 191–206.

    Article  Google Scholar 

  17. Krawczyk, B. (2016). GPU-accelerated extreme learning machines for imbalanced data streams with concept drift. In M. Connolly (Ed.), The International Conference on Computational Science, Procedia Computer Science (Vol. 80, pp. 1692–1701)

    Google Scholar 

  18. López, V., Fernández, A., del Jesus, M. J., & Herrera, F. (2013). A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowledge-Based Systems, 38, 85–104. Special Issue on Advances in Fuzzy Knowledge Systems: Theory and Application.

    Google Scholar 

  19. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.

    Article  Google Scholar 

  20. Maíllo, J., Ramírez, S., Triguero, I., & Herrera, F. (2017). kNN-IS: An Iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.

    Article  Google Scholar 

  21. Triguero, I., Derrac, J., García, S., & Herrera, F. (2012). Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing, 97, 332–343.

    Article  Google Scholar 

  22. Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., & Herrera, F. (2016). Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In IEEE Congress on Evolutionary Computation (CEC 2016), Vancouver (pp. 640–647).

    Google Scholar 

  23. Triguero, I., Galar, M., Vluymans, S., Cornelis, C., Bustince, H., Herrera, F., & Saeys, Y. (2015). Evolutionary undersampling for imbalanced Big Data classification. In 2015 IEEE Congress on Evolutionary Computation (CEC) (pp. 715–722).

    Google Scholar 

  24. Triguero, I., Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: The winner algorithm for the ECBDL’14 Big Data competition: An extremely imbalanced Big Data bioinformatics problem. Knowledge-Based Systems, 87, 69–79.

    Article  Google Scholar 

  25. Zhai, J., Zhang, S., & Wang, C. (2015). The classification of imbalanced large data sets based on MapReduce and ensemble of elm classifiers. International Journal of Machine Learning and Cybernetics, 1–9.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Imbalanced Data Preprocessing for Big Data. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39105-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39104-1

  • Online ISBN: 978-3-030-39105-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics