Abstract
Imbalance learning is a challenging task for most standard machine learning algorithms. The Synthetic Minority Oversampling Technique (SMOTE) is a well-known preprocessing approach for handling imbalanced datasets, where the minority class is over-sampled by producing synthetic examples in feature vector rather than data space. However, many recent works have shown that the imbalanced ratio in itself is not a problem and deterioration of the model performance is caused by other reasons linked to the minority class sample distribution. The blind oversampling by SMOTE leads to two major problems: noise and borderline examples. Noisy examples are those from one class located in the safe zone of the other. Borderline examples are those located in the neighborhood of the class boundary. These samples are associated with deteriorating performance of the models developed. Therefore, it is critical to concentrate on the minority class data structure and regulate the positioning of the newly introduced minority class samples for better performance of classifiers. Hence, this paper proposes the advanced SMOTE, denoted as A-SMOTE, to adjust the newly introduced minority class examples based on distance to the original minority class samples. To achieve this objective, we first employ the SMOTE algorithm to introduce new samples to the minority and eliminate those examples that are closer to the majority than the minority. We apply the proposed method to 44 datasets at various imbalance ratios. Ten widely used data sampling methods selected from the literature are employed for performance comparison. The C4.5 and Naive Bayes classifiers are utilized for experimental validation. The results confirm the advantage of the proposed method over the other methods in almost all the datasets and illustrate its suitability for data preprocessing in classification tasks.
Article PDF
Avoid common mistakes on your manuscript.
References
A.S. Hussein, T. Li, N.S. Jaber, W.Y. Chubato, A rough set based hybrid approach for classification, in 13th International FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018), World Scientific, Belfast, Northern Ireland, 2018, pp. 683–690.
R. Pruengkarn, K.W. Wong, C.C. Fung, Imbalanced data classification using complementary fuzzy support vector machine techniques and smote, in IEEE International Conference on Systems, Man and Cybernetics, Banff, Canada, 2017, pp. 978–983.
W.Y. Chubato, T. Li, A combined-learning based framework for improved software fault prediction, Int. J. Comput. Intell. Syst. 10 (2017), 647–662
W.Y. Chubato, T. Li, K. Bashir, A three-stage based ensemble learning for improved software fault prediction: an empirical comparative study, Int. J. Comput. Intell. Syst. 11 (2018), 1229–1247
B. Krawczyk, Learning from imbalanced data: open challenges and future directions, Prog. Artif. Intell. 5 (2016), 221–232
R. Blagus, L. Lusa, Smote for high-dimensional class-imbalanced data, BMC Bioinformat. 14 (2013), 106.
S. Suresh, N. Sundararajan, P. Saratchandran, Risk-sensitive loss functions for sparse multi-category classification problems, Inf. Sci. 178 (2008), 2621–2638
Y.M. Huang, C.M. Hung, H.C.Jiau, Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem, Nonlinear Anal. Real World Appl. 7 (2006), 720–747
M.A. Mazurowski, P.A. Habas, J.M. Zurada, J.Y. Lo, J.A. Baker, G.D. Tourassi, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Netw. 21 (2008), 427–436
K. Bashir, T. Li, W.Y. Chubato, M. Yahaya, T. Ali, A novel preprocessing approach for imbalanced learning in software defect prediction, in 13th International FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018), World Scientific, Belfast, Northern Ireland, 2018, pp. 500–508.
S.J. Yen, Y.S. Lee, Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset, in: D.-S. Huang, K. Li, G. William Irwin (Eds.), Intelligent Control and Automation, Springer, Berlin, Heidelberg, 2006, pp. 731–740.
D. Dheeru, E. Karra Taniskidou, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine, 2017.
V. López, A. Fernández, F. Herrera, On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed, Inf. Sci. 257 (2014), 1–13
N. Japkowicz, Class imbalances: are we focusing on the right issue, in Workshop on Learning from Imbalanced Data Sets II, Washington, DC, 2003, vol. 1723, p. 6.
V. García, J. Sánchez, R. Mollineda, An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, in Iberoamerican Congress on Pattern Recognition, Springer, Valparaiso, Chile, 2007, pp. 397–406.
K. Napierała, J. Stefanowski, S. Wilk, Learning from imbalanced data in presence of noisy and borderline examples, in International Conference on Rough Sets and Current Trends in Computing, Springer, Warsaw, Poland, 2010, pp. 158–167.
S. Tang, S.P. Chen, The generation mechanism of synthetic minority class examples, in International Conference on Information Technology and Applications in Biomedicine (ITAB 2008), IEEE, Shenzhen, China, 2008, pp. 444–447.
G.M. Weiss, Mining with rarity: a unifying framework, ACM SIGKDD Explor. Newsl. 6 (2004), 7–19
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl. 6 (2004), 40–49
R.C. Prati, G.E. Batista, M.C. Monard, Class imbalances versus class overlapping: an analysis of a learning system behavior, in Mexican International Conference on Artificial Intelligence, Springer, Mexico City, Mexico, 2004, pp. 312–321.
R. Barandela, R.M. Valdovinos, J.S. Sánchez, F.J. Ferri, The imbalanced training sample problem: Under or over sampling?, in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, Lisbon, Portugal, 2004, pp. 806–814.
J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, in Conference on Artificial Intelligence in Medicine in Europe, Springer, Cascais, Portugal, 2001, pp. 63–66.
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, J. Artif. intell. Res. 16 (2002), 321–357
M. Mahdizadeh, M. Eftekhari, Designing fuzzy imbalanced classifier based on the subtractive clustering and genetic programming, in 2013 13th Iranian Conference on Fuzzy Systems (IFSC), IEEE, Qazvin, Iran, 2013, pp. 1–6.
M. Sahare, H. Gupta, A review of multi-class classification for imbalanced data, Int. J. Adv. Comput. Res. 2 (2012), 160–164
P. Jeatrakul, K.W. Wong, Enhancing classification performance of multi-class imbalanced data using the oaa-db algorithm, in The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, Brisbane, Australia, 2012, pp. 1–8.
N.V. Chawla, Data mining for imbalanced datasets: an overview, in: O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, Springer, Boston, 2010, pp. 875–886.
E. Ramentol, N. Verbiest, R. Bello, Y. Caballero, C. Cornelis, F. Herrera, Smote-frst: a new resampling method using fuzzy rough set theory, in 10th International FLINS conference on Uncertainty Modeling in Knowledge Engineering and Decision Making, World Scientific, Istanbul, Turkey, 2012, pp. 800–805.
J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, Smote–ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inf. Sci. 291 (2015), 184–203
D. Guan, W. Yuan, Y.K. Lee, S. Lee, Nearest neighbor editing aided by unlabeled data, Inf. Sci. 179 (2009), 2273–2282
I. Tomek, Two modifications of cnn, IEEE Trans. Syst. Man Cybern. SMC-6 (1976), 769–772.
T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise filtering techniques, J. Comput. Sci. Technol. 22 (2007), 387–396
C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Bangkok, Thailand, 2009, pp. 475–482.
H. Han, W.Y. Wang, B.H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in International Conference on Intelligent Computing, Springer, Hefei, China, 2005, pp. 878–887.
E. Ramentol, Y. Caballero, R. Bello, F. Herrera, Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl. inf. Syst. 33 (2012), 245–265
S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2014), 405–425
N. Verbiest, E. Ramentol, C. Cornelis, F. Herrera, Preprocessing noisy imbalanced datasets using smote enhanced with fuzzy rough prototype selection, Appl. Soft Comput. 22 (2014), 511–517
Z. Zheng, Y. Cai, Y. Li, Oversampling method for imbalanced classification, Comput. Informat. 34 (2016), 1017–1037
MATLAB, 2016. https://www.mathworks.com.
P. Branco, L. Torgo, R. P. Ribeiro. A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. (CSUR). 49 (2016), 31.
N. Moniz, P. Branco, L. Torgo, Evaluation of ensemble methods in imbalanced regression tasks, in Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, Skopje, Macedonia, 2017.
N. Japkowicz, Assessment metrics for imbalanced learning, in: H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, Hoboken, New Jersey, 2013, pp. 187–206.
C.G. Weng, J. Poon, A new evaluation measure for imbalanced datasets, in The 7th Australasian Data Mining Conference, Australian Computer Society, Inc., Glenelg, Australia, 2008, vol. 87, pp. 27–32.
J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple Valued Logic Soft Comput. 17 (2011), 255–287
T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano, A. Folleco, Learning with limited minority class data, in Sixth International Conference on Machine Learning and Applications (ICMLA 2007), IEEE, Cincinnati, Ohio, USA, 2007, pp. 348–353.
A. Fernández, M.J. Del Jesus, F. Herrera, Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning, in International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Springer, Dortmund, Germany, 2010, pp. 89–98.
J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006), 1–30
P.V. Ngoc, C.V.T. Ngoc, T.V.T. Ngoc, D.N. Duy. A C4. 5 algorithm for english emotional classification, Evolving Syst. 10 (2019), 425–451
W. Liu, S. Chawla, D.A. Cieslak, N.V. Chawla, A robust decision tree algorithm for imbalanced data sets, in Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, Columbus, Ohio, USA, 2010, pp. 766–777.
G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (2004), 20–29
H. Zhang, J. Su, Naive bayesian classifiers for ranking, in European Conference on Machine Learning, Springer, Pisa, Italy, 2004, pp. 501–512.
S. Barua, M.M. Islam, K. Murase, A novel synthetic minority over-sampling technique for imbalanced data set learning, in International Conference on Neural Information Processing, Springer, Shanghai, China, 2011, pp. 735–744.
L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, Extended belief-rule-based system with new activation rule determination and weight calculation for classification problems, Appl. Soft Comput. 72 (2018), 261–272
L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, A micro-extended belief rule-based system for big data multiclass classification problems, IEEE Trans. Syst. Man Cybern. Syst. PP (2018), 1–21.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
About this article
Cite this article
Hussein, A.S., Li, T., Yohannese, C.W. et al. A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE. Int J Comput Intell Syst 12, 1412–1422 (2019). https://doi.org/10.2991/ijcis.d.191114.002
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.2991/ijcis.d.191114.002