Abstract
The first mechanism to address the problem of imbalanced learning was the use of sampling methods. They consists of modifying a set of imbalanced data using different procedures to provide a balanced or more adequate data distribution to the subsequent learning tasks. In the specialized literature, many studies have shown that, for several types of classifiers, rebalancing the dataset significantly improves the overall performance of the classification compared to a non-preprocessed data set. Over the years, this procedure has been common and the use of sampling methods for imbalanced learning has been standardized. Still, classifiers do not always have to use this kind of preprocessing because many of them are able to directly deal with imbalanced datasets. There is no clear rule that tells us which strategy is best, whether to adapt the behavior of learning algorithms or to use data preprocessing techniques. However, data sampling and preprocessing techniques are standard techniques in imbalanced learning, they are widely used in Data Science problems. They are simple and easily configurable and can be used in synergy with any learning algorithm. This chapter will review the techniques of sampling, undersampling (the classical ones in Sect. 5.2 and advanced approaches in Sect. 5.3) and oversampling such as SMOTE in Sect. 5.4, as well as the most-known algorithm SMOTE and its derivatives in Sect. 5.5. Some hybridizations of undersampling and oversampling are described in Sect. 5.6. Experiments with graphical illustrations will be carried out to show the behavior of these techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We would like to thank Sergio González for the development of a visualization module for this task. See more at https://github.com/sergiogvz/imbalanced_synthetic_data_plots
References
Abdi, L., Hashemi, S.: To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans. Know. Data Eng. 28(1), 238–251 (2016)
Almogahed, B.A., Kakadiaris, I.A.: NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput. 19(11), 3301–3322 (2015)
Anand, A., Pugalenthi, G., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39(5), 1385–1391 (2010)
Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. IEEE Trans. Know. Data Eng. 18(2), 145–160 (2006)
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recogn. 36(3), 849–851 (2003)
Barella, V., Costa, E., Carvalho, A.C.P.L.F.: ClusterOSS: a new undersampling method for imbalanced learning. Technical report (2014)
Barua, S., Islam, M.M., Murase, K.: A novel synthetic minority oversampling technique for imbalanced data set learning. In: 18th International Conference on Neural Information Processing, ICONIP, Shanghai, pp. 735–744 (2011)
Barua, S., Islam, M.M., Yao, X., Murase, K.: MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Know. Data Eng. 26(2), 405–425 (2014)
Basu, M., Ho, T.K. (ed.): Data Complexity in Pattern Recognition. Springer, London (2006)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Bellinger, C., Drummond, C., Japkowicz, N.: Beyond the boundaries of SMOTE – a framework for manifold-based synthetically oversampling. In: European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Riva del Garda, pp. 248–263 (2016)
Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective pre-processing of imbalanced data with ivotes ensemble. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) Rough Sets and Current Trends in Computing. LNSC, vol. 6086, pp. 148–157. Springer, Berlin/Heidelberg (2010)
Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman and Hall, New York/Wadsworth and Inc., Belmont (1984)
Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131–167 (1999)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe–level–SMOTE: safe–level–synthetic minority over–sampling TEchnique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific–Asia Conference on Advances in Knowledge Discovery and Data Mining PAKDD’09, Bangkok, pp. 475–482 (2009)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling TEchnique. Appl. Intell. 36(3), 664–684 (2012)
Cano, J.R., Herrera, F., Lozano, M.: Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Trans. Evol. Comput. 7(6), 561–575 (2003)
Chawla, N.V.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer, New York (2005)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over–sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. 6(1), 1–6 (2004)
Chawla, N.V., Cieslak, D.A., Hall, L.O., Joshi, A.: Automatically countering imbalance and its empirical relationship to cost. Data Min. Knowl. Disc. 17(2), 225–252 (2008)
Chen, S., Guo, G., Chen, L.: A new over-sampling method based on cluster ensembles. In: 7th International Conference on Advanced Information Networking and Applications Workshops, Perth, pp. 599–604 (2010)
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37, 7–18 (2006)
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967)
de la Calleja, J., Fuentes, O.: A distance-based over-sampling method for learning from imbalanced data sets. In: Proceedings of the Twentieth International Florida Artificial Intelligence, pp. 634–635 (2007)
Das, B., Krishnan, N.C., Cook, D.J.: RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans. Know. Data Eng. 27(1), 222–234 (2015)
Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157 (2000)
Drown, D.J., Khoshgoftaar, T.M., Seliya, N.: Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Trans. Syst. Man Cybern. Part A 39(5), 1097–1107 (2009)
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)
Fernández, A., García, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Fernández-Navarro, F., Hervás-Martínez, C., Gutiérrez, P.A.: A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognit. 44(8), 1821–1833 (2011)
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
Gao, M., Hong, X., Chen, S., Harris, C.J., Khalaf, E.: PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138, 248–259 (2014)
García, S., Herrera, F.: Evolutionary under-sampling for classification with imbalanced data sets: proposals and taxonomy. Evol. Comput. 17(3), 275–306 (2009)
García, V., Mollineda, R.A., Sánchez, J.S.: On the k–NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 11(3–4), 269–280 (2008)
García, S., Fernández, A., Herrera, F.: Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl. Soft Comput. 9, 1304–1314 (2009)
García, S., Derrac, J., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Know. Based Syst. 25(1), 3–12 (2012)
García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 25(1), 13–21 (2012)
García-Pedrajas, N., Pérez-Rodríguez, J., de Haro-García, A.: Oligois: scalable instance selection for class-imbalanced data sets. IEEE Trans. Cybern 43(1), 332–346 (2013)
Gazzah, S., Amara, N.E.B.: New oversampling approaches based on polynomial fitting for imbalanced data sets. In: The Eighth IAPR International Workshop on Document Analysis Systems, Nara, pp. 677–684 (2008)
Han, H., Wang, W.Y., Mao, B.H.: Borderline–SMOTE: a new over–sampling method in imbalanced data sets learning. In: Proceedings of the 2005 International Conference on Intelligent Computing (ICIC’05), Hefei. Lecture Notes in Computer Science, vol. 3644, pp. 878–887 (2005)
Hart, P.E.: The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14, 515–516 (1968)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Know. Data Eng. 21(9), 1263–1284 (2009)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference Neural Networks (IJCNN’08), Hong Kong, pp. 1322–1328 (2008)
Hu, F., Li, H.: A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Math. Probl. Eng. Article ID 694809, 10 (2013)
Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)
Kang, Y.I., Won, S.: Weight decision algorithm for oversampling technique on class-imbalanced learning. In: ICCAS, Gyeonggi-do, pp. 182–186 (2010)
Kim, H., Jo, N., Shin, K.: Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert Syst. Appl. 59, 226–234 (2016)
Kubat, M., Holte, R.C., Matwin, S.: Learning when negative examples abound. In: van Someren, M., Widmer, G. (eds.) Proceedings of the 9th European Conference on Machine Learning (ECML’97). Lecture Notes in Computer Science, vol. 1224, pp. 146–153. Springer, Berlin/New York (1997)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. In: AIME’01: Proceedings of the 8th Conference on AI in Medicine in Europe, Cascais, pp. 63–66 (2001)
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
Liang, Y., Hu, S., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: International Workshop on Computer Science and Engineering, Qingdao, vol. 2, pp. 13–17 (2009)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. B 39(2), 539–550 (2009)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE–based oversampling and evolutionary undersampling. Soft Comput. 15(10), 1909–1936 (2011)
Ma, L., Fan, S.: CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinf. 18, 169 (2017)
Mahalanobis, P.: On the generalized distance in statistics. Proc. Nat. Inst. Sci. (Calcutta) 2, 49–55 (1936)
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2014)
Nakamura, M., Kajiwara, Y., Otsuka, A., Kimura, H.: LVQ-SMOTE – learning vector quantization based synthetic minority over-sampling technique for biomedical data. BioData Min. 6, 16 (2013)
Ng, W.W.Y., Hu, J., Yeung, D.S., Yin, S., Roli, F.: Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans. Cybern. 45(11), 2402–2412 (2015)
Pérez-Ortiz, M., Gutiérrez, P.A., Hervás-Martínez, C.: Borderline kernel based over-sampling. In: 8th International Conference on Hybrid Artificial Intelligent Systems (HAIS), Salamanca, pp. 472–481 (2013)
Pérez-Ortiz, M., Gutiérrez, P.A., Tiño, P., Hervás-Martínez, C.: Oversampling the minority class in the feature space. IEEE Trans. Neural Netw. Learn. Syst. 27(9), 1947–1961 (2016)
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: A survey on graphical methods for classification predictive performance evaluation. IEEE Trans. Know. Data Eng. 23(11), 1601–1618 (2011)
Puntumapon, K., Waiyamai, K.: A pruning-based approach for searching precise and generalized region for synthetic minority over-sampling. In: 16th Pacific-Asia Conference Advances in Knowledge Discovery and Data Mining (PAKDD), Kuala Lumpur, pp. 371–382 (2012)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kauffman, San Mateo (1993)
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Know. Inf. Syst. 33(2), 245–265 (2012)
Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., Herrera, F.: Fuzzy-rough imbalanced learning for the diagnosis of high voltage circuit breaker maintenance: the SMOTE-FRST-2T algorithm. Eng. Appl. AI 48, 134–139 (2016)
Rivera, W.A., Xanthopoulos, P.: A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets. Expert Syst. Appl. 66, 124–135 (2016)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)
Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Smith, M.R., Martinez, T.R., Giraud-Carrier, C.G.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2014)
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery (DaWaK08), Turin, pp. 283–292 (2008)
Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719 (2009)
Sundarkumar, G.G., Ravi, V.: A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance. Eng. Appl. Artif. Intell. 37, 368–377 (2015)
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn. 45(10), 3738–3750 (2012)
Tang, S., Chen, S.: The generation mechanism of synthetic minority class examples. In: 5th International Conference on Information Technology and Applications in Biomedicine (ITAB), Shenzhen, pp. 444–447 (2008)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6, 769–772 (1976)
Wang, J., Xu, M., Wang, H., Zhang, J.: Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: 8th International Conference on Signal Processing (ICSP), Beijing, vol. 3, pp. 1–6. IEEE (2006)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Wu, X., Kumar, V. (eds.): The top ten algorithms in data mining. In: Data Mining and Knowledge Discovery Series. Chapman and Hall/CRC Press, London (2009)
Xie, Z., Jiang, L., Ye, T., Li, X.: A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: 20th International Conference on Database Systems for Advanced Applications (DASFAA), Hanoi, pp. 3–18 (2015)
Yen, S., Lee, Y.: Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In: ICIC, Kunming. LNCIS, vol. 344, pp. 731–740 (2006)
Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Yeung, D.S., Ng, W.W.Y., Wang, D., Tsang, E.C.C., Wang, X.: Localized generalization error model and its application to architecture selection for radial basis function neural network. IEEE Trans. Neural Netw. 18(5), 1294–1305 (2007)
Yoon, K., Kwek, S.: An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: HIS’05: Proceedings of the Fifth International Conference on Hybrid Intelligent Systems, Rio de Janeiro, pp. 303–308 (2005)
Yu, H., Ni, J., Zhao, J.: Acosampling: an ant colony optimization-based undersampling method for classifying imbalanced dna microarray data. Neurocomputing 101, 309–318 (2013)
Zhang, H., Li, M.: RWO-Sampling: a random walk over-sampling approach to imbalanced data classification. Inf. Fusion 20, 99–116 (2014)
Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the 20th International Conference on Machine Learning (ICML’03), Workshop Learning from Imbalanced Data Sets (2003)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Fernández, A., García, S., Galar, M., Prati, R.C., Krawczyk, B., Herrera, F. (2018). Data Level Preprocessing Methods. In: Learning from Imbalanced Data Sets. Springer, Cham. https://doi.org/10.1007/978-3-319-98074-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-98074-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98073-7
Online ISBN: 978-3-319-98074-4
eBook Packages: Computer ScienceComputer Science (R0)