Medical Imbalanced Data Classification Based on Random Forests

  • Engy El-shafeiyEmail author
  • Amr Abohany
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1153)


This paper studies the problem of imbalance in medical datasets. Today, modern machine learning techniques are becoming increasingly popular for this type of problem, with examples in the areas of health and medical. One of the major difficulties with this technique is that the database handled is highly imbalance. Under-sampling and over-sampling techniques are used to work around this problem. In this paper, we apply random forests, which are combinations of decision trees fitted to subsamples of the data, built using under-sampling and over-sampling. At the end of the work, we compare fit metrics obtained in the various specifications of the models tested and evaluate their results inside and outside the sample. We observed that random forest techniques using imbalanced sub-samples smaller than the original sample presented the best performance among the random forests used and an improvement compared to that practiced in the medical dataset.


Random forest Imbalanced medical dataset Under-sampling Oversampling 


  1. 1.
    Lu, H., et al.: Kernel principal component analysis combining rotation forest method for linearly inseparable data. Cogn. Syst. Res. 53, 111–122 (2019)CrossRefGoogle Scholar
  2. 2.
    Zhu, H.-J., et al.: DroidDet: effective and robust detection of android malware using static analysis along with rotation forest model. Neurocomputing 272, 638–646 (2018)CrossRefGoogle Scholar
  3. 3.
    Hong, H., et al.: Landslide susceptibility mapping using J48 decision tree with AdaBoost, bagging and rotation forest ensembles in the Guangchang area (China). Catena 163, 399–413 (2018)CrossRefGoogle Scholar
  4. 4.
    Wang, L., et al.: An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 22(10), 3373–3381 (2018)CrossRefGoogle Scholar
  5. 5.
    Pham, B.T., et al.: A hybrid machine learning ensemble approach based on a radial basis function neural network and rotation forest for landslide susceptibility modeling: a case study in the Himalayan area India. Int. J. Sediment Res. 33(2), 157–170 (2018)CrossRefGoogle Scholar
  6. 6.
    Lee, S.-J., et al.: A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J. Biomed. Inform. 78, 144–155 (2018)CrossRefGoogle Scholar
  7. 7.
    Gul, A., et al.: Ensemble of a subset of kNN classifiers. Adv. Data Anal. Classif. 12(4), 827–840 (2018)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Sun, J., et al.: Unbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with differentiated sampling rates. Inf. Sci. 425, 76–91 (2018)CrossRefGoogle Scholar
  9. 9.
    Lango, M., Stefanowski, J.: Multi-class and feature selection extensions of roughly balanced bagging for unbalanced data. J. Intell. Inf. Syst. 50(1), 97–127 (2018)CrossRefGoogle Scholar
  10. 10.
    Chen, W., et al.: Novel hybrid integration approach of bagging-based fisher’s linear discriminant function for groundwater potential analysis. Nat. Resour. Res. 1–20 (2019)‏Google Scholar
  11. 11.
    García, S., et al.: Dynamic ensemble selection for multi-class unbalanced datasets. Inf. Sci. 445, 22–37 (2018)CrossRefGoogle Scholar
  12. 12.
    Maldonado, S., López, J.: Dealing with high-dimensional class-unbalanced datasets: embedded feature selection for SVM classification. Appl. Soft Comput. 67, 94–105 (2018)CrossRefGoogle Scholar
  13. 13.
    Piri, S., Delen, D., Liu, T.: A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from unbalanced datasets. Decis. Support Syst. 106, 15–29 (2018)CrossRefGoogle Scholar
  14. 14.
    Zhang, C., et al.: Research on classification method of high-dimensional class-unbalanced datasets based on SVM. Int. J. Mach. Learn. Cybern. 10(7), 1765–1778 (2019)CrossRefGoogle Scholar
  15. 15.
    Douzas, G., Bacao, F.: Effective data generation for unbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 91, 464–471 (2018)CrossRefGoogle Scholar
  16. 16.
    Veganzones, D., Séverin, E.: An investigation of bankruptcy prediction in unbalanced datasets. Decis. Support Syst. 112, 111–124 (2018)CrossRefGoogle Scholar
  17. 17.
    Tahan, M.H., Asadi, S.: EMDID: evolutionary multi-objective discretization for unbalanced datasets. Inf. Sci. 432, 442–461 (2018)CrossRefGoogle Scholar
  18. 18.
    Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied logistic regression, vol. 398. Wiley, Hoboken (2013)CrossRefGoogle Scholar
  19. 19.
    Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, Springer Series in Statistics. vol. 1. no. 10. Springer, New York (2001)‏Google Scholar
  20. 20.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  21. 21.
    Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11. Citeseer, Washington DC (2003)‏Google Scholar
  22. 22.
    Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36(3), 4626–4636 (2009)CrossRefGoogle Scholar
  23. 23.
    Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newslett. 6(1), 7–19 (2004)CrossRefGoogle Scholar
  24. 24.
    Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  25. 25.
    Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Univ. Calif. Berkeley 110(1-12), 24 (2004)Google Scholar
  26. 26.
    Bekkar, M., Djemaa, H.K., Alitouche, T.A.: Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3(10) (2013)‏Google Scholar
  27. 27.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97 (1997)‏Google Scholar
  28. 28.
    Hulse, J.V., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: ICML 2007 Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvalis, OR, USA (2007)Google Scholar
  29. 29.
    Sanz, J.A., et al.: An evolutionary underbagging approach to tackle the survival prediction of trauma patients: a case study at the hospital of Navarre. IEEE Access 7, 76009–76021 (2019)CrossRefGoogle Scholar
  30. 30.
    Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 324–331 (2009)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Computers Engineering, Faculty of EngineeringMansoura UniversityMansouraEgypt
  2. 2.Department of Information Systems, Faculty of Computer Science and InformationKafrelsheikh UniversityKafr El SheikhEgypt
  3. 3.Scientific Research Group in EgyptCairoEgypt

Personalised recommendations