Abstract
Imbalanced data classification is an important area of statistical learning that has gained research attention over the years. Despite extensive research, imbalanced data classification remains one of the most challenging problems in data science and machine learning, especially for large data sets. Health data imbalance poses challenges for machine learning models to classify data points accurately and can lead to biased and inaccurate predictions with severe consequences in medical settings. The majority of real world datasets are skewed. Reports from previous researchers suggest that sampling approaches are effective for data imbalances. To tackle this problem, in this study, we discuss sampling approaches, including oversampling and undersampling methods, such as Random Oversampling, SMOTE, ADASYN, Random Undersampling, Tomek links, NearMiss and so on, and conduct experiments on four different skewed health datasets to achieve promising performances. The four imbalanced secondary health data sets used are on Diabetics, Anaemia, Lung Cancer, and Obesity classification respectively. Results show that Repeated Edited Nearest Neighbours (RENN) undersampling technique with logistic regression is more effective in handling data skewness. RENN technique should be adopted in cases of imbalance in health research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Awe, O. O., Dukhi, N., & Dias, R. (2023). Shrinkage heteroscedastic discriminant algorithms for classifying multi-class high-dimensional data: Insights from a national health survey. Machine Learning with Applications, 12, 100459. https://doi.org/10.1016/j.mlwa.2023.100459
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing (pp. 878–887). Springer. https://doi.org/10.1007/11538059_91
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence). IEEE. https://doi.org/10.1109/ijcnn.2008.4633969
Koehrsen, W. (2018). Beyond accuracy: Precision and recall. https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c. Visited on 2022.
Korstanje, J. (2021). The F1-score. Available at https://towardsdatascience.com/the-f1-score-bec2bbc38aa6. Accessed on March 2023.
Menardi, G., & Torelli, N. (2012). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28(1), 92–122. https://doi.org/10.1007/s10618-012-0295-5
Mishra, A. (2018). Metrics to evaluate your machine learning algorithm. Available at https://www.towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithmf10ba6e38234. Accessed on March 2023.
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In 2020 11th international conference on information and communication systems (ICICS). IEEE. https://doi.org/10.1109/icics49469.2020.23955
Nguyen, H., Cooper, E., & Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3, 4–21. https://doi.org/10.1504/IJKESDP.2011.039875
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the symposium on computer applications and medical care (pp. 261–265). IEEE Computer Society Press.
Sun, Y., Wong, A. K. C., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719. https://doi.org/10.1142/s0218001409007326
Tyagi, S., & Mittal, S. (2019). Sampling approaches for imbalanced data classification problem in machine learning. In Proceedings of ICRIC 2019 (Lecture Notes in Electrical Engineering) (pp. 209–221). Springer. https://doi.org/10.1007/978-3-030-29407-6_17
Wilson, D. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, SMC-2(3), 408–421.
Xu, X., Chen, W., & Sun, Y. (2019). Over-sampling algorithm for imbalanced data classification. Journal of Systems Engineering and Electronics, 30(6), 1182–1191. https://doi.org/10.21629/jsee.2019.06.12
Zhu, M., Xia, J., Jin, X., Yan, M., Cai, G., Yan, J., & Ning, G. (2018). Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access, 6, 4641–4652. https://doi.org/10.1109/access.2018.2789428
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Awe, O.O., Ojumu, J.B., Ayanwoye, G.A., Ojumoola, J.S., Dias, R. (2023). Machine Learning Approaches for Handling Imbalances in Health Data Classification. In: Awe, O.O., Vance, E.A. (eds) Sustainable Statistical and Data Science Methods and Practices. STEAM-H: Science, Technology, Engineering, Agriculture, Mathematics & Health. Springer, Cham. https://doi.org/10.1007/978-3-031-41352-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-41352-0_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41351-3
Online ISBN: 978-3-031-41352-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)