Software and Libraries for Imbalanced Classification

Fernández, Alberto; García, Salvador; Galar, Mikel; Prati, Ronaldo C.; Krawczyk, Bartosz; Herrera, Francisco

doi:10.1007/978-3-319-98074-4_14

Alberto Fernández⁷,
Salvador García⁷,
Mikel Galar⁸,
Ronaldo C. Prati⁹,
Bartosz Krawczyk¹⁰ &
…
Francisco Herrera¹¹

Abstract

Researchers in the topic of imbalanced classification have proposed throughout the years a large amount of different approaches to address this issue. To keep on developing this area of study, it is of extreme importance to make these methods available for the research community. This allows for a double advantage: (1) to analyze in depth the features and capabilities of the algorithms; and (2) to carry out a fair comparison with any novel proposal. Taking the former into account, different open source libraries and software packages on imbalanced classification can be found, being built under different tools. In this chapter, we compile the most significant ones focusing on their main characteristics and included methods, from standard DM to Big Data applications. Our intention is to make close to researchers, practitioners and corporations, a non-exhaustive list of the alternatives for applying diverse algorithms to their problem in order to achieve the most accurate results with the lowest effort. To present these software tools, this chapter is organized as follows. First, in Sect. 14.1 the significance of software implementations for imbalanced classification is stressed. Then, Sect. 14.2 introduces the Java tools, i.e. KEEL [2] and WEKA [17]. Next, Sect. 14.3 focus on different R packages. The “imbalanced-learn” Python toolbox [29] from “scikit learn” [39] is described in Sect. 14.4. Big Data solutions under Spark [26] are summarized in Sect. 14.5. Finally, Sect. 14.6 provides some concluding remarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Alcalá-fdez, J., Sánchez, L., García, S., Jesus, M.J.D., Ventura, S., Garrell, J.M., Otero, J., Bacardit, J., Rivas, V.M., Fernández, J.C., Herrera, F.: Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput. 13(3), 307–318 (2009)
Article Google Scholar
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Almogahed, B.A., Kakadiaris, I.A.: NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput. 19(11), 3301–3322 (2014)
Article Google Scholar
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2014)
Article Google Scholar
Batista, G.E.A.P.A., Bazzan, A.L.C., Monard, M.C.: Balancing training data for automated annotation of keywords: a case study. In: Lifschitz, S., Almeida Nalvo Jr. F., Joannis Pappas Jr. G., Linden, R. (eds.) Second Workshop Brasileiro de Bioinformática (WOB), pp. 10–18 (2003)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Article Google Scholar
Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., Jones, Z.M.: Mlr: machine learning in R. J. Mach. Learn. Res. 17(170), 1–5 (2016)
MathSciNet MATH Google Scholar
Chapelle, O., Schlkopf, B., Zien, A.: Semi-supervised learning, 1st edn. The MIT Press, Cambridge (2010)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Cordn, I., Fernndez, A., Garca, S., Herrera, F.: Imbalance: oversampling algorithms for imbalanced classification in R. Knowl.-Based Syst. (2018, in press). https://doi.org/10.1016/j.knosys.2018.07.035
Crowston, K., Wei, K., Howison, J., Wiggins, A.: Free/libre open-source software development: what we know and what we do not know. ACM Comput. Surv. 44(2), 7 (2012)
Article Google Scholar
Dal Pozzolo, A., Caelen, O., Waterschoot, S., Bontempi, G.: Racing for unbalanced methods selection. In: Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X. (eds.) IDEAL, Hefei, China. Lecture Notes in Computer Science, vol. 8206, pp. 24–31. Springer (2013)
Google Scholar
Das, B., Krishnan, N.C., Cook, D.J.: RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 27(1), 222–234 (2015)
Article Google Scholar
Fernandez, A., del Ro, S., Lpez, V., Bawakid, A., del Jess, M.J., Bentez, J.M., Herrera, F.: Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. Wiley Interdisciplinary Rev. Data Min. Knowl. Discov. 4(5), 380–409 (2014)
Article Google Scholar
Fernandez, A., del Rio, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)
Article Google Scholar
Fernandez, A., Garcia, S., Herrera, F., Chawla, N.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Article MathSciNet Google Scholar
Frank, E., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, 4th edn. Morgan Kaufmann, Burlington (2016)
Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Article Google Scholar
Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)
Article Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Intelligent Systems Reference Library, vol. 72. Springer, Cham (2015)
Google Scholar
Han, H., Wang, W., Mao, B.: Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Huang, D.S., Zhang, X.P., Huang, G.B. (eds.) ICIC, Hefei, China. Lecture Notes in Computer Science, vol. 3644, pp. 878–887. Springer (2005)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers Inc., Amsterdam (2011)
MATH Google Scholar
Hart, P.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1967)
Article Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp. 1322–1328. IEEE (2008)
Google Scholar
Hornik, K.: R CRAN (2018). https://CRAN.R-project.org/
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: lightning-fast big data analytics, 1st edn. O’Reilly Media, Sebastopol (2015)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Fisher, D.H. (ed.) ICML, vol. 97, pp. 179–186. Morgan Kaufmann, San Mateo (1997)
Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: Quaglini, S., Barahona, P., Andreassen, S. (eds.) AIME, Lecture Notes in Computer Science, vol. 2101, pp. 63–66. Springer, Berlin/Heidelberg (2001)
Google Scholar
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017)
MathSciNet MATH Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)
Article Google Scholar
Lin, J.J.: Mapreduce is good enough? If all you have is a hammer, throw away everything that’s not a nail! Big Data 1(1), 28–39 (2012)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Lunardon, N., Menardi, G., Torelli, N.: ROSE: a package for binary imbalanced learning. R J. 6, 82–92 (2014)
Google Scholar
Marx, V.: The big challenges of big data. Nature 498(7453), 255–260 (2013)
Article Google Scholar
McKinney, W.: Python for Data Analysis, 1st edn. O’Reilly, Sebastopol (2012)
Google Scholar
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28, 92122 (2014)
Article MathSciNet Google Scholar
Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: Mllib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)
MathSciNet MATH Google Scholar
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigms 3(1), 4–21 (2011)
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Raschka, S.: Python Machine Learning, 1st edn. PACKT Publishing, Birmingham (2015)
Google Scholar
Siriseriwan, W., Sinapiromsaran, K.: The effective redistribution for imbalance dataset: relocating safe-level smote with minority outcast handling. Chiang Mai J. Sci. 43(1), 234–246 (2014)
Google Scholar
Siriseriwan, W.: Smotefamily: a collection of oversampling techniques for class imbalance problem based on smote (2018). https://cran.r-project.org/web/packages/smotefamily/index.html
Smith, M.R., Martinez, T.R., Giraud-Carrier, C.G.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2014)
Article MathSciNet Google Scholar
Tippmann, S.: Programming tools: adventures with R. Nature 517(7532), 109–110 (2015)
Article Google Scholar
Tomek, I.: An experiment with the edited nearest-neighor rule. IEEE Trans. Syst. Man Cybern. 6(6), 448–452 (1976)
MathSciNet MATH Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 7(2), 679–772 (1976)
MathSciNet MATH Google Scholar
Torgo, L.: Data Mining with R: Learning with Case Studies. Chapman and Hall/CRC Press, Boca Raton (2010)
Book Google Scholar
Triguero, I., Galar, M., Merino, D., Maillo, J., Bustince, H., Herrera, F.: Evolutionary undersampling for extremely imbalanced big data classification under Apache Spark. In: IEEE Congress on Evolutionary Computation (CEC 2016), pp. 640–647. Vancouver (2016)
Google Scholar
White, T.: Hadoop, The Definitive Guide, 1st edn. O’Reilly Media, Inc., Sebastopol (2012)
Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet Google Scholar
Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets (2003)
Google Scholar
Zhang, H., Li, M.: RWO-sampling: a random walk over-sampling approach to imbalanced data classification. Inf. Fusion 20, 99–116 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Granada, Spain
Alberto Fernández & Salvador García
Institute of Smart Cities, Public University of Navarre, Pamplona, Spain
Mikel Galar
Department of Computer Science, Universidade Federal do ABC, Santo Andre, Brazil
Ronaldo C. Prati
Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
Bartosz Krawczyk
Department of Computer Science and AI, University of Granada, Granada, Spain
Francisco Herrera

Authors

Alberto Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Mikel Galar
View author publications
You can also search for this author in PubMed Google Scholar
Ronaldo C. Prati
View author publications
You can also search for this author in PubMed Google Scholar
Bartosz Krawczyk
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F. (2018). Software and Libraries for Imbalanced Classification. In: Learning from Imbalanced Data Sets. Springer, Cham. https://doi.org/10.1007/978-3-319-98074-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-98074-4_14
Published: 23 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98073-7
Online ISBN: 978-3-319-98074-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics