Data Level Preprocessing Methods

Fernández, Alberto; García, Salvador; Galar, Mikel; Prati, Ronaldo C.; Krawczyk, Bartosz; Herrera, Francisco

doi:10.1007/978-3-319-98074-4_5

Alberto Fernández⁷,
Salvador García⁷,
Mikel Galar⁸,
Ronaldo C. Prati⁹,
Bartosz Krawczyk¹⁰ &
…
Francisco Herrera¹¹

6453 Accesses
2 Citations

Abstract

The first mechanism to address the problem of imbalanced learning was the use of sampling methods. They consists of modifying a set of imbalanced data using different procedures to provide a balanced or more adequate data distribution to the subsequent learning tasks. In the specialized literature, many studies have shown that, for several types of classifiers, rebalancing the dataset significantly improves the overall performance of the classification compared to a non-preprocessed data set. Over the years, this procedure has been common and the use of sampling methods for imbalanced learning has been standardized. Still, classifiers do not always have to use this kind of preprocessing because many of them are able to directly deal with imbalanced datasets. There is no clear rule that tells us which strategy is best, whether to adapt the behavior of learning algorithms or to use data preprocessing techniques. However, data sampling and preprocessing techniques are standard techniques in imbalanced learning, they are widely used in Data Science problems. They are simple and easily configurable and can be used in synergy with any learning algorithm. This chapter will review the techniques of sampling, undersampling (the classical ones in Sect. 5.2 and advanced approaches in Sect. 5.3) and oversampling such as SMOTE in Sect. 5.4, as well as the most-known algorithm SMOTE and its derivatives in Sect. 5.5. Some hybridizations of undersampling and oversampling are described in Sect. 5.6. Experiments with graphical illustrations will be carried out to show the behavior of these techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We would like to thank Sergio González for the development of a visualization module for this task. See more at https://github.com/sergiogvz/imbalanced_synthetic_data_plots

References

Abdi, L., Hashemi, S.: To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Know. Data Eng. 28(1), 238–251 (2016)
Article Google Scholar
Almogahed, B.A., Kakadiaris, I.A.: NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput. 19(11), 3301–3322 (2015)
Article Google Scholar
Anand, A., Pugalenthi, G., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39(5), 1385–1391 (2010)
Article Google Scholar
Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. IEEE Trans. Know. Data Eng. 18(2), 145–160 (2006)
Article MATH Google Scholar
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)
Article Google Scholar
Barella, V., Costa, E., Carvalho, A.C.P.L.F.: ClusterOSS: a new undersampling method for imbalanced learning. Technical report (2014)
Google Scholar
Barua, S., Islam, M.M., Murase, K.: A novel synthetic minority oversampling technique for imbalanced data set learning. In: 18th International Conference on Neural Information Processing, ICONIP, Shanghai, pp. 735–744 (2011)
Chapter Google Scholar
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Know. Data Eng. 26(2), 405–425 (2014)
Article Google Scholar
Basu, M., Ho, T.K. (ed.): Data Complexity in Pattern Recognition. Springer, London (2006)
MATH Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Article Google Scholar
Bellinger, C., Drummond, C., Japkowicz, N.: Beyond the boundaries of SMOTE – a framework for manifold-based synthetically oversampling. In: European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Riva del Garda, pp. 248–263 (2016)
Chapter Google Scholar
Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective pre-processing of imbalanced data with ivotes ensemble. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) Rough Sets and Current Trends in Computing. LNSC, vol. 6086, pp. 148–157. Springer, Berlin/Heidelberg (2010)
Chapter Google Scholar
Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
Article Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman and Hall, New York/Wadsworth and Inc., Belmont (1984)
Google Scholar
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Article MATH Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe–level–SMOTE: safe–level–synthetic minority over–sampling TEchnique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific–Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’09, Bangkok, pp. 475–482 (2009)
Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling TEchnique. Appl. Intell. 36(3), 664–684 (2012)
Article Google Scholar
Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans. Evol. Comput. 7(6), 561–575 (2003)
Article Google Scholar
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, New York (2005)
Chapter Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over–sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. 6(1), 1–6 (2004)
Article Google Scholar
Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Min. Knowl. Disc. 17(2), 225–252 (2008)
Article MathSciNet Google Scholar
Chen, S., Guo, G., Chen, L.: A new over-sampling method based on cluster ensembles. In: 7th International Conference on Advanced Information Networking and Applications Workshops, Perth, pp. 599–604 (2010)
Google Scholar
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37, 7–18 (2006)
Article Google Scholar
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967)
Article MATH Google Scholar
de la Calleja, J., Fuentes, O.: A distance-based over-sampling method for learning from imbalanced data sets. In: Proceedings of the Twentieth International Florida Artificial Intelligence, pp. 634–635 (2007)
Google Scholar
Das, B., Krishnan, N.C., Cook, D.J.: RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans. Know. Data Eng. 27(1), 222–234 (2015)
Article Google Scholar
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157 (2000)
Article Google Scholar
Drown, D.J., Khoshgoftaar, T.M., Seliya, N.: Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans. Syst. Man Cybern. Part A 39(5), 1097–1107 (2009)
Article Google Scholar
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Fernández, A., García, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Article MathSciNet MATH Google Scholar
Fernández-Navarro, F., Hervás-Martínez, C., Gutiérrez, P.A.: A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognit. 44(8), 1821–1833 (2011)
Article MATH Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Article Google Scholar
Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)
Article Google Scholar
García, S., Herrera, F.: Evolutionary under-sampling for classification with imbalanced data sets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)
Article MathSciNet Google Scholar
García, V., Mollineda, R.A., Sánchez, J.S.: On the k–NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 11(3–4), 269–280 (2008)
Article MathSciNet Google Scholar
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl. Soft Comput. 9, 1304–1314 (2009)
Article Google Scholar
García, S., Derrac, J., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Know. Based Syst. 25(1), 3–12 (2012)
Article Google Scholar
García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 25(1), 13–21 (2012)
Article Google Scholar
García-Pedrajas, N., Pérez-Rodríguez, J., de Haro-García, A.: Oligois: scalable instance selection for class-imbalanced data sets. IEEE Trans. Cybern 43(1), 332–346 (2013)
Article Google Scholar
Gazzah, S., Amara, N.E.B.: New oversampling approaches based on polynomial fitting for imbalanced data sets. In: The Eighth IAPR International Workshop on Document Analysis Systems, Nara, pp. 677–684 (2008)
Google Scholar
Han, H., Wang, W.Y., Mao, B.H.: Borderline–SMOTE: a new over–sampling method in imbalanced data sets learning. In: Proceedings of the 2005 International Conference on Intelligent Computing (ICIC’05), Hefei. Lecture Notes in Computer Science, vol. 3644, pp. 878–887 (2005)
Article Google Scholar
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Know. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference Neural Networks (IJCNN’08), Hong Kong, pp. 1322–1328 (2008)
Google Scholar
Hu, F., Li, H.: A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Probl. Eng. Article ID 694809, 10 (2013)
Google Scholar
Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)
Article Google Scholar
Kang, Y.I., Won, S.: Weight decision algorithm for oversampling technique on class-imbalanced learning. In: ICCAS, Gyeonggi-do, pp. 182–186 (2010)
Google Scholar
Kim, H., Jo, N., Shin, K.: Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Syst. Appl. 59, 226–234 (2016)
Article Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Learning when negative examples abound. In: van Someren, M., Widmer, G. (eds.) Proceedings of the 9th European Conference on Machine Learning (ECML’97). Lecture Notes in Computer Science, vol. 1224, pp. 146–153. Springer, Berlin/New York (1997)
Google Scholar
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: AIME’01: Proceedings of the 8th Conference on AI in Medicine in Europe, Cascais, pp. 63–66 (2001)
Google Scholar
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
MathSciNet MATH Google Scholar
Liang, Y., Hu, S., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: International Workshop on Computer Science and Engineering, Qingdao, vol. 2, pp. 13–17 (2009)
Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. B 39(2), 539–550 (2009)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)
Article Google Scholar
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE–based oversampling and evolutionary undersampling. Soft Comput. 15(10), 1909–1936 (2011)
Article Google Scholar
Ma, L., Fan, S.: CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinf. 18, 169 (2017)
Article Google Scholar
Mahalanobis, P.: On the generalized distance in statistics. Proc. Nat. Inst. Sci. (Calcutta) 2, 49–55 (1936)
Google Scholar
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2014)
Article MathSciNet MATH Google Scholar
Nakamura, M., Kajiwara, Y., Otsuka, A., Kimura, H.: LVQ-SMOTE – learning vector quantization based synthetic minority over-sampling technique for biomedical data. BioData Min. 6, 16 (2013)
Article Google Scholar
Ng, W.W.Y., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)
Article Google Scholar
Pérez-Ortiz, M., Gutiérrez, P.A., Hervás-Martínez, C.: Borderline kernel based over-sampling. In: 8th International Conference on Hybrid Artificial Intelligent Systems (HAIS), Salamanca, pp. 472–481 (2013)
Google Scholar
Pérez-Ortiz, M., Gutiérrez, P.A., Tiño, P., Hervás-Martínez, C.: Oversampling the minority class in the feature space. IEEE Trans. Neural Netw. Learn. Syst. 27(9), 1947–1961 (2016)
Article MathSciNet Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: A survey on graphical methods for classification predictive performance evaluation. IEEE Trans. Know. Data Eng. 23(11), 1601–1618 (2011)
Article Google Scholar
Puntumapon, K., Waiyamai, K.: A pruning-based approach for searching precise and generalized region for synthetic minority over-sampling. In: 16th Pacific-Asia Conference Advances in Knowledge Discovery and Data Mining (PAKDD), Kuala Lumpur, pp. 371–382 (2012)
Chapter Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kauffman, San Mateo (1993)
Google Scholar
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Know. Inf. Syst. 33(2), 245–265 (2012)
Article Google Scholar
Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., Herrera, F.: Fuzzy-rough imbalanced learning for the diagnosis of high voltage circuit breaker maintenance: the SMOTE-FRST-2T algorithm. Eng. Appl. AI 48, 134–139 (2016)
Google Scholar
Rivera, W.A., Xanthopoulos, P.: A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets. Expert Syst. Appl. 66, 124–135 (2016)
Article Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)
Article MathSciNet Google Scholar
Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Article Google Scholar
Smith, M.R., Martinez, T.R., Giraud-Carrier, C.G.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2014)
Article MathSciNet Google Scholar
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery (DaWaK08), Turin, pp. 283–292 (2008)
Google Scholar
Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)
Article Google Scholar
Sundarkumar, G.G., Ravi, V.: A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng. Appl. Artif. Intell. 37, 368–377 (2015)
Article Google Scholar
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012)
Article Google Scholar
Tang, S., Chen, S.: The generation mechanism of synthetic minority class examples. In: 5th International Conference on Information Technology and Applications in Biomedicine (ITAB), Shenzhen, pp. 444–447 (2008)
Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6, 769–772 (1976)
MathSciNet MATH Google Scholar
Wang, J., Xu, M., Wang, H., Zhang, J.: Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: 8th International Conference on Signal Processing (ICSP), Beijing, vol. 3, pp. 1–6. IEEE (2006)
Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet MATH Google Scholar
Wu, X., Kumar, V. (eds.): The top ten algorithms in data mining. In: Data Mining and Knowledge Discovery Series. Chapman and Hall/CRC Press, London (2009)
Book Google Scholar
Xie, Z., Jiang, L., Ye, T., Li, X.: A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: 20th International Conference on Database Systems for Advanced Applications (DASFAA), Hanoi, pp. 3–18 (2015)
Google Scholar
Yen, S., Lee, Y.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: ICIC, Kunming. LNCIS, vol. 344, pp. 731–740 (2006)
MATH Google Scholar
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Article MathSciNet Google Scholar
Yeung, D.S., Ng, W.W.Y., Wang, D., Tsang, E.C.C., Wang, X.: Localized generalization error model and its application to architecture selection for radial basis function neural network. IEEE Trans. Neural Netw. 18(5), 1294–1305 (2007)
Article Google Scholar
Yoon, K., Kwek, S.: An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: HIS’05: Proceedings of the Fifth International Conference on Hybrid Intelligent Systems, Rio de Janeiro, pp. 303–308 (2005)
Google Scholar
Yu, H., Ni, J., Zhao, J.: Acosampling: an ant colony optimization-based undersampling method for classifying imbalanced dna microarray data. Neurocomputing 101, 309–318 (2013)
Article Google Scholar
Zhang, H., Li, M.: RWO-Sampling: a random walk over-sampling approach to imbalanced data classification. Inf. Fusion 20, 99–116 (2014)
Article Google Scholar
Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the 20th International Conference on Machine Learning (ICML’03), Workshop Learning from Imbalanced Data Sets (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Granada, Spain
Alberto Fernández & Salvador García
Institute of Smart Cities, Public University of Navarre, Pamplona, Spain
Mikel Galar
Department of Computer Science, Universidade Federal do ABC, Santo Andre, Brazil
Ronaldo C. Prati
Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
Bartosz Krawczyk
Department of Computer Science and AI, University of Granada, Granada, Spain
Francisco Herrera

Authors

Alberto Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Mikel Galar
View author publications
You can also search for this author in PubMed Google Scholar
Ronaldo C. Prati
View author publications
You can also search for this author in PubMed Google Scholar
Bartosz Krawczyk
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F. (2018). Data Level Preprocessing Methods. In: Learning from Imbalanced Data Sets. Springer, Cham. https://doi.org/10.1007/978-3-319-98074-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-98074-4_5
Published: 23 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98073-7
Online ISBN: 978-3-319-98074-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics