RSMOTE: improving classification performance over imbalanced medical datasets

Naseriparsa, Mehdi; Al-Shammari, Ahmed; Sheng, Ming; Zhang, Yong; Zhou, Rui

doi:10.1007/s13755-020-00112-w

RSMOTE: improving classification performance over imbalanced medical datasets

Research
Published: 12 June 2020

Volume 8, article number 22, (2020)
Cite this article

Health Information Science and Systems Aims and scope Submit manuscript

Mehdi Naseriparsa¹,
Ahmed Al-Shammari^1,3,
Ming Sheng²,
Yong Zhang² &
…
Rui Zhou¹

851 Accesses
18 Citations
Explore all metrics

Abstract

Introduction

Medical diagnosis is a crucial step for patient treatment. However, diagnosis is prone to bias due to imbalanced datasets. To overcome the imbalanced dataset problem, simple minority oversampling technique (SMOTE) was proposed that can generate new synthetic samples at data level to create the balance between minority and majority classes. However, the synthetic samples are generated on a random basis which causes class mixture problem; thus, resulting in deteriorating the classification performance and biased diagnosis.

Purpose

In order to overcome the SMOTE shortcomings, some modified methods were proposed that try to generate synthetic samples along the line segment of selected minority samples. Most of these methods adopt one of the two policies for selecting minority samples to generate synthetic samples: borderline region sampling or safe region sampling. However, they both suffer from over-generalisation problem. We propose a modified SMOTE-based resampling method called RSMOTE to alleviate the medical imbalanced dataset problem. We provide an in-depth analysis and verify the performance of RSMOTE over imbalanced medical datasets.

Methods

In this paper, the proposed RSMOTE divides the minority sample domain into four regions (normal, semi-normal, semi-critical, and critical) based on the minority sample density analysis. RSMOTE discovers the minority sample region globally and applies the resampling near a specific group of samples.

Results

Our analysis and experiments verify that if synthetic samples are generated in the regions with high minority sample density, classification performance will be improved due to low risk of class mixture. Unlike some safe region methods, RSMOTE decides the region of minority samples on a global basis, thus removing the over-generalisation problem. Classic and additional evaluation metrics are considered to measure the effectiveness of the modified method: Recall, FP Rate, Precision, F-Measure, ROC area, and Average Aggregated Metric. We carried out experiments over various imbalanced medical datasets.

Conclusion

Based on the minority sample density analysis, we propose RSMOTE method that divides the minority sample domain into four regions. The proposed RSMOTE includes four re-sampling methods that each of them carries out resampling on a specific region. According to the experimental results, resampling on the regions with high minority sample density obtained better results while those with lower minority sample density got the inferior results. Thus, we conclude that the RSMOTE is a more flexible resampling method for the imbalanced medical datasets that is capable of generating samples with various minority sample densities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Investigating the Stability of SMOTE-Based Oversampling on COVID-19 Data

Distribution-Sensitive Unbalanced Data Oversampling Method for Medical Diagnosis

Article 10 January 2019

Improving Accuracy of Imbalanced Clinical Data Classification Using Synthetic Minority Over-Sampling Technique

References

Paiva JS, Cardoso J, Pereira T. Supervised learning methods for pathological arterial pulse wave differentiation: a svm and neural networks approach. Int J Med Inform. 2018;109:30–8.
Article Google Scholar
Srivastava SK, Singh SK, Suri JS. Healthcare text classification system and its performance evaluation: A source of better intelligence by characterizing healthcare text. J Med Syst. 2018;42(5):97.
Article Google Scholar
Al-Shammari A, Liu C, Naseriparsa M, Vo BQ, Anwar T, Zhou R. A framework for clustering and dynamic maintenance of xml documents. In: International Conference on Advanced Data Mining and Applications, Springer, New York; 2017. pp. 399–412.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.
Article Google Scholar
Al-Shammari A, Zhou R, Liu C, Naseriparsa M, Vo BQ. A framework for processing cumulative frequency queries over medical data streams. In: International Conference on Web Information Systems Engineering, Springer; 2018. pp. 121–131.
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.
Article Google Scholar
Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 2004;6(1):20–9.
Article Google Scholar
Maciejewski T, Stefanowski J. Local neighbourhood extension of SMOTE for mining imbalanced data, In: Proceedings of the IEEE symposium on computational intelligence and data mining, CIDM 2011, April 11–15, 2011, Paris, France, 2011; pp. 104–111.
Chawla NV, Japkowicz N, Kotcz A. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. 2004;6(1):1–6.
Article Google Scholar
Yang S, Guo J-Z, Jin J-W. An improved id3 algorithm for medical data classification. Comput Electr Eng. 2018;65:474–87.
Article Google Scholar
Al-Shammari A, Zhou R, Naseriparsaa M, Liu C. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. Int J Med Informatics. 2019;126:176–86.
Article Google Scholar
Zarchi M, Bushehri SF, Dehghanizadeh M. Scadi: a standard dataset for self-care problems classification of children with physical and motor disability. Int J Med Informatics. 2018;114:81–7.
Article Google Scholar
World Health Organisation, https://www.who.int/, Accessed 20 Nov 2018.
Lynch CM, Abdollahi B, Fuqua JD, Alexandra R, Bartholomai JA, Balgemann RN, van Berkel VH, Frieboes HB. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform. 2017;108:1–8.
Article Google Scholar
Sanchez-Pinto LN, Venable LR, Fahrenbach J, Churpek MM. Comparison of variable selection methods for clinical predictive modeling. Int J Med Inform. 2018;116:10–7.
Article Google Scholar
Araújo FH, Santana AM, Neto PAS. Using machine learning to support healthcare professionals in making preauthorisation decisions. Int J Med Inform. 2016;94:1–7.
Article Google Scholar
Chebouba L, Boughaci D, Guziolowski C. Proteomics versus clinical data and stochastic local search based feature selection for acute myeloid leukemia patients’ classification. J Med Syst. 2018;42(7):129.
Article Google Scholar
Han H, Wang W, Mao B. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23–26, 2005, Proceedings, Part I, 2005; pp. 878–887.
Sáez JA, Luengo J, Stefanowski J, Herrera F. SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci. 2015;291:184–203.
Article Google Scholar
Cheng K, Zhang C, Yu H, Yang X, Zou H, Gao S. Grouped SMOTE with noise filtering mechanism for classifying imbalanced data. IEEE Access. 2019;7:170668–81.
Article Google Scholar
Fahrudin T, Buliali JL, Fatichah C. Enhancing the performance of smote algorithm by using attribute weighting scheme and new selective sampling method for imbalanced data set. Int J Innov Comput Inf Control. 2019;15:423–44.
Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F. Managing borderline and noisy examples in imbalanced classification by combining SMOTE with ensemble filtering. In: Intelligent Data Engineering and Automated Learning - IDEAL 2014, 15th International Conference, Salamanca, Spain, September 10–12, 2014. Proceedings, Vol. 8669 of Lecture Notes in Computer Science, Springer; 2014. pp. 61–68.
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C. Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Advances in Knowledge Discovery and Data Mining, 13th Pacific-Asia Conference, PAKDD 2009, Bangkok, Thailand, April 27–30, 2009, Proceedings; 2009. pp. 475–82.
Weiss GM. Mining with rarity: a unifying framework. SIGKDD Explor. 2004;6(1):7–19.
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW. Smoteboost: Improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Knowledge Discovery in Databases: PKDD 2003, Cavtat-Dubrovnik, Croatia, September 22–26, 2003, 2003; pp. 107–19.
Sun Y, Kamel MS, Wong AKC, Wang Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn. 2007;40(12):3358–78.
Article Google Scholar
Chawla NV. Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L, editors. The data mining and knowledge discovery handbook. New York: Springer; 2005. p. 853–67.
Chapter Google Scholar
Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.
Article MathSciNet Google Scholar
UCI machine learning repository. http://archive.ics.uci.edu/ml/, accessed 7 Feb 2018.

Download references

Author information

Authors and Affiliations

Swinburne University of Technology, Hawthorn, Australia
Mehdi Naseriparsa, Ahmed Al-Shammari & Rui Zhou
Tsinghua University, Beijing, China
Ming Sheng & Yong Zhang
University of Al-Qadisiyah, Al Diwaniyah, Iraq
Ahmed Al-Shammari

Authors

Mehdi Naseriparsa
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Al-Shammari
View author publications
You can also search for this author in PubMed Google Scholar
Ming Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehdi Naseriparsa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naseriparsa, M., Al-Shammari, A., Sheng, M. et al. RSMOTE: improving classification performance over imbalanced medical datasets. Health Inf Sci Syst 8, 22 (2020). https://doi.org/10.1007/s13755-020-00112-w

Download citation

Received: 27 October 2019
Accepted: 20 May 2020
Published: 12 June 2020
DOI: https://doi.org/10.1007/s13755-020-00112-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RSMOTE: improving classification performance over imbalanced medical datasets