A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Hussein, Ahmed Saad; Li, Tianrui; Yohannese, Chubato Wondaferaw; Bashir, Kamal

doi:10.2991/ijcis.d.191114.002

A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Regular Issue
Open access
Published: 28 November 2019

Volume 12, pages 1412–1422, (2019)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Download PDF

Ahmed Saad Hussein^1,2,
Tianrui Li¹,
Chubato Wondaferaw Yohannese¹ &
…
Kamal Bashir¹

235 Accesses
31 Citations
Explore all metrics

Abstract

Imbalance learning is a challenging task for most standard machine learning algorithms. The Synthetic Minority Oversampling Technique (SMOTE) is a well-known preprocessing approach for handling imbalanced datasets, where the minority class is over-sampled by producing synthetic examples in feature vector rather than data space. However, many recent works have shown that the imbalanced ratio in itself is not a problem and deterioration of the model performance is caused by other reasons linked to the minority class sample distribution. The blind oversampling by SMOTE leads to two major problems: noise and borderline examples. Noisy examples are those from one class located in the safe zone of the other. Borderline examples are those located in the neighborhood of the class boundary. These samples are associated with deteriorating performance of the models developed. Therefore, it is critical to concentrate on the minority class data structure and regulate the positioning of the newly introduced minority class samples for better performance of classifiers. Hence, this paper proposes the advanced SMOTE, denoted as A-SMOTE, to adjust the newly introduced minority class examples based on distance to the original minority class samples. To achieve this objective, we first employ the SMOTE algorithm to introduce new samples to the minority and eliminate those examples that are closer to the majority than the minority. We apply the proposed method to 44 datasets at various imbalance ratios. Ten widely used data sampling methods selected from the literature are employed for performance comparison. The C4.5 and Naive Bayes classifiers are utilized for experimental validation. The results confirm the advantage of the proposed method over the other methods in almost all the datasets and illustrate its suitability for data preprocessing in classification tasks.

Article PDF

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

A.S. Hussein, T. Li, N.S. Jaber, W.Y. Chubato, A rough set based hybrid approach for classification, in 13th International FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018), World Scientific, Belfast, Northern Ireland, 2018, pp. 683–690.
R. Pruengkarn, K.W. Wong, C.C. Fung, Imbalanced data classification using complementary fuzzy support vector machine techniques and smote, in IEEE International Conference on Systems, Man and Cybernetics, Banff, Canada, 2017, pp. 978–983.
W.Y. Chubato, T. Li, A combined-learning based framework for improved software fault prediction, Int. J. Comput. Intell. Syst. 10 (2017), 647–662
Google Scholar
W.Y. Chubato, T. Li, K. Bashir, A three-stage based ensemble learning for improved software fault prediction: an empirical comparative study, Int. J. Comput. Intell. Syst. 11 (2018), 1229–1247
Google Scholar
B. Krawczyk, Learning from imbalanced data: open challenges and future directions, Prog. Artif. Intell. 5 (2016), 221–232
Google Scholar
R. Blagus, L. Lusa, Smote for high-dimensional class-imbalanced data, BMC Bioinformat. 14 (2013), 106.
S. Suresh, N. Sundararajan, P. Saratchandran, Risk-sensitive loss functions for sparse multi-category classification problems, Inf. Sci. 178 (2008), 2621–2638
Google Scholar
Y.M. Huang, C.M. Hung, H.C.Jiau, Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem, Nonlinear Anal. Real World Appl. 7 (2006), 720–747
Google Scholar
M.A. Mazurowski, P.A. Habas, J.M. Zurada, J.Y. Lo, J.A. Baker, G.D. Tourassi, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Netw. 21 (2008), 427–436
Google Scholar
K. Bashir, T. Li, W.Y. Chubato, M. Yahaya, T. Ali, A novel preprocessing approach for imbalanced learning in software defect prediction, in 13th International FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support (FLINS 2018), World Scientific, Belfast, Northern Ireland, 2018, pp. 500–508.
S.J. Yen, Y.S. Lee, Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset, in: D.-S. Huang, K. Li, G. William Irwin (Eds.), Intelligent Control and Automation, Springer, Berlin, Heidelberg, 2006, pp. 731–740.
D. Dheeru, E. Karra Taniskidou, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine, 2017.
V. López, A. Fernández, F. Herrera, On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed, Inf. Sci. 257 (2014), 1–13
Google Scholar
N. Japkowicz, Class imbalances: are we focusing on the right issue, in Workshop on Learning from Imbalanced Data Sets II, Washington, DC, 2003, vol. 1723, p. 6.
V. García, J. Sánchez, R. Mollineda, An empirical study of the behavior of classifiers on imbalanced and overlapped data sets, in Iberoamerican Congress on Pattern Recognition, Springer, Valparaiso, Chile, 2007, pp. 397–406.
K. Napierała, J. Stefanowski, S. Wilk, Learning from imbalanced data in presence of noisy and borderline examples, in International Conference on Rough Sets and Current Trends in Computing, Springer, Warsaw, Poland, 2010, pp. 158–167.
S. Tang, S.P. Chen, The generation mechanism of synthetic minority class examples, in International Conference on Information Technology and Applications in Biomedicine (ITAB 2008), IEEE, Shenzhen, China, 2008, pp. 444–447.
G.M. Weiss, Mining with rarity: a unifying framework, ACM SIGKDD Explor. Newsl. 6 (2004), 7–19
Google Scholar
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM SIGKDD Explor. Newsl. 6 (2004), 40–49
Google Scholar
R.C. Prati, G.E. Batista, M.C. Monard, Class imbalances versus class overlapping: an analysis of a learning system behavior, in Mexican International Conference on Artificial Intelligence, Springer, Mexico City, Mexico, 2004, pp. 312–321.
R. Barandela, R.M. Valdovinos, J.S. Sánchez, F.J. Ferri, The imbalanced training sample problem: Under or over sampling?, in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, Lisbon, Portugal, 2004, pp. 806–814.
J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, in Conference on Artificial Intelligence in Medicine in Europe, Springer, Cascais, Portugal, 2001, pp. 63–66.
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, J. Artif. intell. Res. 16 (2002), 321–357
Google Scholar
M. Mahdizadeh, M. Eftekhari, Designing fuzzy imbalanced classifier based on the subtractive clustering and genetic programming, in 2013 13th Iranian Conference on Fuzzy Systems (IFSC), IEEE, Qazvin, Iran, 2013, pp. 1–6.
M. Sahare, H. Gupta, A review of multi-class classification for imbalanced data, Int. J. Adv. Comput. Res. 2 (2012), 160–164
Google Scholar
P. Jeatrakul, K.W. Wong, Enhancing classification performance of multi-class imbalanced data using the oaa-db algorithm, in The 2012 International Joint Conference on Neural Networks (IJCNN), IEEE, Brisbane, Australia, 2012, pp. 1–8.
N.V. Chawla, Data mining for imbalanced datasets: an overview, in: O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, Springer, Boston, 2010, pp. 875–886.
E. Ramentol, N. Verbiest, R. Bello, Y. Caballero, C. Cornelis, F. Herrera, Smote-frst: a new resampling method using fuzzy rough set theory, in 10th International FLINS conference on Uncertainty Modeling in Knowledge Engineering and Decision Making, World Scientific, Istanbul, Turkey, 2012, pp. 800–805.
J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, Smote–ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Inf. Sci. 291 (2015), 184–203
Google Scholar
D. Guan, W. Yuan, Y.K. Lee, S. Lee, Nearest neighbor editing aided by unlabeled data, Inf. Sci. 179 (2009), 2273–2282
Google Scholar
I. Tomek, Two modifications of cnn, IEEE Trans. Syst. Man Cybern. SMC-6 (1976), 769–772.
T.M. Khoshgoftaar, P. Rebours, Improving software quality prediction by noise filtering techniques, J. Comput. Sci. Technol. 22 (2007), 387–396
Google Scholar
C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Bangkok, Thailand, 2009, pp. 475–482.
H. Han, W.Y. Wang, B.H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in International Conference on Intelligent Computing, Springer, Hefei, China, 2005, pp. 878–887.
E. Ramentol, Y. Caballero, R. Bello, F. Herrera, Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl. inf. Syst. 33 (2012), 245–265
Google Scholar
S. Barua, M.M. Islam, X. Yao, K. Murase, Mwmote–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng. 26 (2014), 405–425
Google Scholar
N. Verbiest, E. Ramentol, C. Cornelis, F. Herrera, Preprocessing noisy imbalanced datasets using smote enhanced with fuzzy rough prototype selection, Appl. Soft Comput. 22 (2014), 511–517
Google Scholar
Z. Zheng, Y. Cai, Y. Li, Oversampling method for imbalanced classification, Comput. Informat. 34 (2016), 1017–1037
Google Scholar
MATLAB, 2016. https://www.mathworks.com.
P. Branco, L. Torgo, R. P. Ribeiro. A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. (CSUR). 49 (2016), 31.
N. Moniz, P. Branco, L. Torgo, Evaluation of ensemble methods in imbalanced regression tasks, in Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, Skopje, Macedonia, 2017.
N. Japkowicz, Assessment metrics for imbalanced learning, in: H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algorithms, and Applications, Wiley-IEEE Press, Hoboken, New Jersey, 2013, pp. 187–206.
C.G. Weng, J. Poon, A new evaluation measure for imbalanced datasets, in The 7th Australasian Data Mining Conference, Australian Computer Society, Inc., Glenelg, Australia, 2008, vol. 87, pp. 27–32.
J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple Valued Logic Soft Comput. 17 (2011), 255–287
Google Scholar
T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano, A. Folleco, Learning with limited minority class data, in Sixth International Conference on Machine Learning and Applications (ICMLA 2007), IEEE, Cincinnati, Ohio, USA, 2007, pp. 348–353.
A. Fernández, M.J. Del Jesus, F. Herrera, Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning, in International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Springer, Dortmund, Germany, 2010, pp. 89–98.
J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006), 1–30
Google Scholar
P.V. Ngoc, C.V.T. Ngoc, T.V.T. Ngoc, D.N. Duy. A C4. 5 algorithm for english emotional classification, Evolving Syst. 10 (2019), 425–451
Google Scholar
W. Liu, S. Chawla, D.A. Cieslak, N.V. Chawla, A robust decision tree algorithm for imbalanced data sets, in Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, Columbus, Ohio, USA, 2010, pp. 766–777.
G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl. 6 (2004), 20–29
Google Scholar
H. Zhang, J. Su, Naive bayesian classifiers for ranking, in European Conference on Machine Learning, Springer, Pisa, Italy, 2004, pp. 501–512.
S. Barua, M.M. Islam, K. Murase, A novel synthetic minority over-sampling technique for imbalanced data set learning, in International Conference on Neural Information Processing, Springer, Shanghai, China, 2011, pp. 735–744.
L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, Extended belief-rule-based system with new activation rule determination and weight calculation for classification problems, Appl. Soft Comput. 72 (2018), 261–272
Google Scholar
L.H. Yang, J. Liu, Y.M. Wang, L. Martínez, A micro-extended belief rule-based system for big data multiclass classification problems, IEEE Trans. Syst. Man Cybern. Syst. PP (2018), 1–21.

Download references

Author information

Authors and Affiliations

School of Information Science and Technology, Southwest Jiaotong University, 611756, Chengdu, China
Ahmed Saad Hussein, Tianrui Li, Chubato Wondaferaw Yohannese & Kamal Bashir
University of Information Technology and Communications, 00964, Baghdad, Iraq
Ahmed Saad Hussein

Authors

Ahmed Saad Hussein
View author publications
You can also search for this author in PubMed Google Scholar
Tianrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Chubato Wondaferaw Yohannese
View author publications
You can also search for this author in PubMed Google Scholar
Kamal Bashir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianrui Li.

Rights and permissions

This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).

Reprints and permissions

About this article

Cite this article

Hussein, A.S., Li, T., Yohannese, C.W. et al. A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE. Int J Comput Intell Syst 12, 1412–1422 (2019). https://doi.org/10.2991/ijcis.d.191114.002

Download citation

Received: 10 January 2019
Accepted: 09 November 2019
Published: 28 November 2019
Issue Date: January 2019
DOI: https://doi.org/10.2991/ijcis.d.191114.002

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A-SMOTE: A New Preprocessing Approach for Highly Imbalanced Datasets by Improving SMOTE

Abstract

Article PDF

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation