Machine Learning Approaches for Handling Imbalances in Health Data Classification

Awe, O. Olawale; Ojumu, Jacob Bolarinwa; Ayanwoye, Gideon Ayandele; Ojumoola, Joy Sekemi; Dias, Ronaldo

doi:10.1007/978-3-031-41352-0_19

O. Olawale Awe⁴,
Jacob Bolarinwa Ojumu⁵,
Gideon Ayandele Ayanwoye⁵,
Joy Sekemi Ojumoola⁶ &
…
Ronaldo Dias⁷

Part of the book series: STEAM-H: Science, Technology, Engineering, Agriculture, Mathematics & Health ((STEAM))

145 Accesses

Abstract

Imbalanced data classification is an important area of statistical learning that has gained research attention over the years. Despite extensive research, imbalanced data classification remains one of the most challenging problems in data science and machine learning, especially for large data sets. Health data imbalance poses challenges for machine learning models to classify data points accurately and can lead to biased and inaccurate predictions with severe consequences in medical settings. The majority of real world datasets are skewed. Reports from previous researchers suggest that sampling approaches are effective for data imbalances. To tackle this problem, in this study, we discuss sampling approaches, including oversampling and undersampling methods, such as Random Oversampling, SMOTE, ADASYN, Random Undersampling, Tomek links, NearMiss and so on, and conduct experiments on four different skewed health datasets to achieve promising performances. The four imbalanced secondary health data sets used are on Diabetics, Anaemia, Lung Cancer, and Obesity classification respectively. Results show that Repeated Edited Nearest Neighbours (RENN) undersampling technique with logistic regression is more effective in handling data skewness. RENN technique should be adopted in cases of imbalance in health research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Awe, O. O., Dukhi, N., & Dias, R. (2023). Shrinkage heteroscedastic discriminant algorithms for classifying multi-class high-dimensional data: Insights from a national health survey. Machine Learning with Applications, 12, 100459. https://doi.org/10.1016/j.mlwa.2023.100459
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Article Google Scholar
Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7
Article Google Scholar
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing (pp. 878–887). Springer. https://doi.org/10.1007/11538059_91
Chapter Google Scholar
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence). IEEE. https://doi.org/10.1109/ijcnn.2008.4633969
Chapter Google Scholar
Koehrsen, W. (2018). Beyond accuracy: Precision and recall. https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c. Visited on 2022.
Korstanje, J. (2021). The F1-score. Available at https://towardsdatascience.com/the-f1-score-bec2bbc38aa6. Accessed on March 2023.
Menardi, G., & Torelli, N. (2012). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28(1), 92–122. https://doi.org/10.1007/s10618-012-0295-5
Article MathSciNet Google Scholar
Mishra, A. (2018). Metrics to evaluate your machine learning algorithm. Available at https://www.towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithmf10ba6e38234. Accessed on March 2023.
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In 2020 11th international conference on information and communication systems (ICICS). IEEE. https://doi.org/10.1109/icics49469.2020.23955
Chapter Google Scholar
Nguyen, H., Cooper, E., & Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3, 4–21. https://doi.org/10.1504/IJKESDP.2011.039875
Article Google Scholar
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the symposium on computer applications and medical care (pp. 261–265). IEEE Computer Society Press.
Google Scholar
Sun, Y., Wong, A. K. C., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719. https://doi.org/10.1142/s0218001409007326
Article Google Scholar
Tyagi, S., & Mittal, S. (2019). Sampling approaches for imbalanced data classification problem in machine learning. In Proceedings of ICRIC 2019 (Lecture Notes in Electrical Engineering) (pp. 209–221). Springer. https://doi.org/10.1007/978-3-030-29407-6_17
Chapter Google Scholar
Wilson, D. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, SMC-2(3), 408–421.
Article MathSciNet Google Scholar
Xu, X., Chen, W., & Sun, Y. (2019). Over-sampling algorithm for imbalanced data classification. Journal of Systems Engineering and Electronics, 30(6), 1182–1191. https://doi.org/10.21629/jsee.2019.06.12
Article Google Scholar
Zhu, M., Xia, J., Jin, X., Yan, M., Cai, G., Yan, J., & Ning, G. (2018). Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access, 6, 4641–4652. https://doi.org/10.1109/access.2018.2789428
Article Google Scholar

Download references

Author information

Authors and Affiliations

LISA 2020 Global Network USA\University of Campinas, Campinas, Brazil
O. Olawale Awe
Obafemi Awolowo University, Ile-Ife, Nigeria
Jacob Bolarinwa Ojumu & Gideon Ayandele Ayanwoye
University of Ibadan, Ibadan, Nigeria
Joy Sekemi Ojumoola
Department of Statistics (IMECC), University of Campinas, Campinas, Brazil
Ronaldo Dias

Authors

O. Olawale Awe
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Bolarinwa Ojumu
View author publications
You can also search for this author in PubMed Google Scholar
Gideon Ayandele Ayanwoye
View author publications
You can also search for this author in PubMed Google Scholar
Joy Sekemi Ojumoola
View author publications
You can also search for this author in PubMed Google Scholar
Ronaldo Dias
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to O. Olawale Awe .

Editor information

Editors and Affiliations

LISA 2020 Global Network USA\University of Campinas, Campinas, Brazil
O. Olawale Awe
Applied Mathematics, University of Colorado Boulder, Boulder, CO, USA
Eric A. Vance

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Awe, O.O., Ojumu, J.B., Ayanwoye, G.A., Ojumoola, J.S., Dias, R. (2023). Machine Learning Approaches for Handling Imbalances in Health Data Classification. In: Awe, O.O., Vance, E.A. (eds) Sustainable Statistical and Data Science Methods and Practices. STEAM-H: Science, Technology, Engineering, Agriculture, Mathematics & Health. Springer, Cham. https://doi.org/10.1007/978-3-031-41352-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-41352-0_19
Published: 06 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41351-3
Online ISBN: 978-3-031-41352-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics