Skip to main content

Software and Libraries for Imbalanced Classification

  • Chapter
  • First Online:
Learning from Imbalanced Data Sets

Abstract

Researchers in the topic of imbalanced classification have proposed throughout the years a large amount of different approaches to address this issue. To keep on developing this area of study, it is of extreme importance to make these methods available for the research community. This allows for a double advantage: (1) to analyze in depth the features and capabilities of the algorithms; and (2) to carry out a fair comparison with any novel proposal. Taking the former into account, different open source libraries and software packages on imbalanced classification can be found, being built under different tools. In this chapter, we compile the most significant ones focusing on their main characteristics and included methods, from standard DM to Big Data applications. Our intention is to make close to researchers, practitioners and corporations, a non-exhaustive list of the alternatives for applying diverse algorithms to their problem in order to achieve the most accurate results with the lowest effort. To present these software tools, this chapter is organized as follows. First, in Sect. 14.1 the significance of software implementations for imbalanced classification is stressed. Then, Sect. 14.2 introduces the Java tools, i.e. KEEL [2] and WEKA [17]. Next, Sect. 14.3 focus on different R packages. The “imbalanced-learn” Python toolbox [29] from “scikit learn” [39] is described in Sect. 14.4. Big Data solutions under Spark [26] are summarized in Sect. 14.5. Finally, Sect. 14.6 provides some concluding remarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.java.com/

  2. 2.

    https://www.r-project.org/

  3. 3.

    https://www.python.org/

  4. 4.

    http://www.keel.es

  5. 5.

    http://www.keel.es/datasets.php

  6. 6.

    https://cran.r-project.org/web/packages/unbalanced/index.html

  7. 7.

    https://cran.r-project.org/web/packages/smotefamily/index.html

  8. 8.

    https://cran.r-project.org/web/packages/ROSE/index.html

  9. 9.

    https://cran.r-project.org/web/packages/DMwR/index.html

  10. 10.

    https://cran.r-project.org/web/packages/imbalance/index.html

  11. 11.

    https://mlr-org.github.io/mlr-tutorial/release/html/

  12. 12.

    https://cran.r-project.org/web/packages/mlr/index.html

  13. 13.

    https://www.python.org/

  14. 14.

    http://contrib.scikit-learn.org/imbalanced-learn/auto_examples/index.html

  15. 15.

    http://spark.apache.org/

References

  1. Alcalá-fdez, J., Sánchez, L., García, S., Jesus, M.J.D., Ventura, S., Garrell, J.M., Otero, J., Bacardit, J., Rivas, V.M., Fernández, J.C., Herrera, F.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)

    Article  Google Scholar 

  2. Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)

    Google Scholar 

  3. Almogahed, B.A., Kakadiaris, I.A.: NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput. 19(11), 3301–3322 (2014)

    Article  Google Scholar 

  4. Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2014)

    Article  Google Scholar 

  5. Batista, G.E.A.P.A., Bazzan, A.L.C., Monard, M.C.: Balancing training data for automated annotation of keywords: a case study. In: Lifschitz, S., Almeida Nalvo Jr. F., Joannis Pappas Jr. G., Linden, R. (eds.) Second Workshop Brasileiro de Bioinformática (WOB), pp. 10–18 (2003)

    Google Scholar 

  6. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)

    Article  Google Scholar 

  7. Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., Jones, Z.M.: Mlr: machine learning in R. J. Mach. Learn. Res. 17(170), 1–5 (2016)

    MathSciNet  MATH  Google Scholar 

  8. Chapelle, O., Schlkopf, B., Zien, A.: Semi-supervised learning, 1st edn. The MIT Press, Cambridge (2010)

    Google Scholar 

  9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  10. Cordn, I., Fernndez, A., Garca, S., Herrera, F.: Imbalance: oversampling algorithms for imbalanced classification in R. Knowl.-Based Syst. (2018, in press). https://doi.org/10.1016/j.knosys.2018.07.035

  11. Crowston, K., Wei, K., Howison, J., Wiggins, A.: Free/libre open-source software development: what we know and what we do not know. ACM Comput. Surv. 44(2), 7 (2012)

    Article  Google Scholar 

  12. Dal Pozzolo, A., Caelen, O., Waterschoot, S., Bontempi, G.: Racing for unbalanced methods selection. In: Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X. (eds.) IDEAL, Hefei, China. Lecture Notes in Computer Science, vol. 8206, pp. 24–31. Springer (2013)

    Google Scholar 

  13. Das, B., Krishnan, N.C., Cook, D.J.: RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 27(1), 222–234 (2015)

    Article  Google Scholar 

  14. Fernandez, A., del Ro, S., Lpez, V., Bawakid, A., del Jess, M.J., Bentez, J.M., Herrera, F.: Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdisciplinary Rev. Data Min. Knowl. Discov. 4(5), 380–409 (2014)

    Article  Google Scholar 

  15. Fernandez, A., del Rio, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)

    Article  Google Scholar 

  16. Fernandez, A., Garcia, S., Herrera, F., Chawla, N.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)

    Article  MathSciNet  Google Scholar 

  17. Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, 4th edn. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

  18. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)

    Article  Google Scholar 

  19. Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)

    Article  Google Scholar 

  20. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Intelligent Systems Reference Library, vol. 72. Springer, Cham (2015)

    Google Scholar 

  21. Han, H., Wang, W., Mao, B.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Huang, D.S., Zhang, X.P., Huang, G.B. (eds.) ICIC, Hefei, China. Lecture Notes in Computer Science, vol. 3644, pp. 878–887. Springer (2005)

    Google Scholar 

  22. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., Amsterdam (2011)

    MATH  Google Scholar 

  23. Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1967)

    Article  Google Scholar 

  24. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp. 1322–1328. IEEE (2008)

    Google Scholar 

  25. Hornik, K.: R CRAN (2018). https://CRAN.R-project.org/

  26. Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: lightning-fast big data analytics, 1st edn. O’Reilly Media, Sebastopol (2015)

    Google Scholar 

  27. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Fisher, D.H. (ed.) ICML, vol. 97, pp. 179–186. Morgan Kaufmann, San Mateo (1997)

    Google Scholar 

  28. Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME, Lecture Notes in Computer Science, vol. 2101, pp. 63–66. Springer, Berlin/Heidelberg (2001)

    Google Scholar 

  29. Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017)

    MathSciNet  MATH  Google Scholar 

  30. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)

    Article  Google Scholar 

  31. Lin, J.J.: Mapreduce is good enough? If all you have is a hammer, throw away everything that’s not a nail! Big Data 1(1), 28–39 (2012)

    Article  Google Scholar 

  32. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)

    Article  Google Scholar 

  33. Lunardon, N., Menardi, G., Torelli, N.: ROSE: a package for binary imbalanced learning. R J. 6, 82–92 (2014)

    Google Scholar 

  34. Marx, V.: The big challenges of big data. Nature 498(7453), 255–260 (2013)

    Article  Google Scholar 

  35. McKinney, W.: Python for Data Analysis, 1st edn. O’Reilly, Sebastopol (2012)

    Google Scholar 

  36. Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28, 92122 (2014)

    Article  MathSciNet  Google Scholar 

  37. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  38. Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3(1), 4–21 (2011)

    Article  Google Scholar 

  39. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  40. Raschka, S.: Python Machine Learning, 1st edn. PACKT Publishing, Birmingham (2015)

    Google Scholar 

  41. Siriseriwan, W., Sinapiromsaran, K.: The effective redistribution for imbalance dataset: relocating safe-level smote with minority outcast handling. Chiang Mai J. Sci. 43(1), 234–246 (2014)

    Google Scholar 

  42. Siriseriwan, W.: Smotefamily: a collection of oversampling techniques for class imbalance problem based on smote (2018). https://cran.r-project.org/web/packages/smotefamily/index.html

  43. Smith, M.R., Martinez, T.R., Giraud-Carrier, C.G.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2014)

    Article  MathSciNet  Google Scholar 

  44. Tippmann, S.: Programming tools: adventures with R. Nature 517(7532), 109–110 (2015)

    Article  Google Scholar 

  45. Tomek, I.: An experiment with the edited nearest-neighor rule. IEEE Trans. Syst. Man Cybern. 6(6), 448–452 (1976)

    MathSciNet  MATH  Google Scholar 

  46. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 7(2), 679–772 (1976)

    MathSciNet  MATH  Google Scholar 

  47. Torgo, L.: Data Mining with R: Learning with Case Studies. Chapman and Hall/CRC Press, Boca Raton (2010)

    Book  Google Scholar 

  48. Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., Herrera, F.: Evolutionary undersampling for extremely imbalanced big data classification under Apache Spark. In: IEEE Congress on Evolutionary Computation (CEC 2016), pp. 640–647. Vancouver (2016)

    Google Scholar 

  49. White, T.: Hadoop, The Definitive Guide, 1st edn. O’Reilly Media, Inc., Sebastopol (2012)

    Google Scholar 

  50. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)

    Article  MathSciNet  Google Scholar 

  51. Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets (2003)

    Google Scholar 

  52. Zhang, H., Li, M.: RWO-sampling: a random walk over-sampling approach to imbalanced data classification. Inf. Fusion 20, 99–116 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F. (2018). Software and Libraries for Imbalanced Classification. In: Learning from Imbalanced Data Sets. Springer, Cham. https://doi.org/10.1007/978-3-319-98074-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98074-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98073-7

  • Online ISBN: 978-3-319-98074-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics