Balanced training of a hybrid ensemble method for imbalanced datasets: a case of emergency department readmission prediction

  • Arkaitz Artetxe
  • Manuel GrañaEmail author
  • Andoni Beristain
  • Sebastián Ríos
Original Article


Dealing with imbalanced datasets is a recurrent issue in health-care data processing. Most literature deals with small academic datasets, so that results often do not extrapolate to the large real-life datasets, or have little real-life validity. When minority class sample generation by interpolation is meaningless, the recourse to undersampling the majority class is mandatory in order to reach some acceptable results. Ensembles of classifiers provide the advantage of the diversity of their members, which may allow adaptation to the imbalanced distribution. In this paper, we present a pipeline method combining random undersampling with bootstrap aggregation (bagging) for a hybrid ensemble of extreme learning machines and decision trees, whose diversity improves adaptation to the imbalanced class dataset. The approach is demonstrated on a realistic greatly imbalanced dataset of emergency department patients from a Chilean hospital targeted to predict patient readmission. Computational experiments show that our approach outperforms other well-known classification algorithms.


Class imbalance Hospital readmission Ensemble learning Extreme learning machine 


Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest


  1. 1.
    Arora S, Patel P, Lahewala S, Patel N, Patel NJ, Thakore K, Amin A, Tripathi B, Kumar V, Shah H, Shah M, Panaich S, Deshmukh A, Badheka A, Gidwani U, Gopalan R (2017) Etiologies, trends, and predictors of 30-day readmission in patients with heart failure. Am J Cardiol 119(5):760–769CrossRefGoogle Scholar
  2. 2.
    Artetxe A, Ayerdi B, Graa M, Rios, S (2017) Using anticipative hybrid extreme rotation forest to predict emergency service readmission risk. J Comput SciGoogle Scholar
  3. 3.
    Artetxe A, Beristain A, Graña M, Besga A (2016) Predicting 30-day emergency readmission risk. In: International conference on European transnational education, Springer, pp 3–12Google Scholar
  4. 4.
    Billings J, Blunt I, Steventon A, Georghiou T, Lewis G, Bardsley M (2012) Development of a predictive model to identify inpatients at risk of re-admission within 30 days of discharge (parr-30). BMJ Open 2(4):e001,667CrossRefGoogle Scholar
  5. 5.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140zbMATHGoogle Scholar
  6. 6.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
  7. 7.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  8. 8.
    Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239CrossRefGoogle Scholar
  9. 9.
    He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008, IEEE world congress on computational intelligence, IEEE, pp 1322–1328Google Scholar
  10. 10.
    Huang G, Huang GB, Song S, You K (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48CrossRefzbMATHGoogle Scholar
  11. 11.
    Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2(2):107–122CrossRefGoogle Scholar
  12. 12.
    Kansagara D, Englander H, Salanitro A, Kagen D, Theobald C, Freeman M, Kripalani S (2011) Risk prediction models for hospital readmission: a systematic review. JAMA 306(15):1688–1698CrossRefGoogle Scholar
  13. 13.
    Khalilia M, Chakraborty S, Popescu M (2011) Predicting disease risks from highly imbalanced data using random forest. BMC Med Inf Decis Mak 11(1):1CrossRefGoogle Scholar
  14. 14.
    Lin SJ, Chang C, Hsu MF (2013) Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction. Knowl Based Syst 39:214–223CrossRefGoogle Scholar
  15. 15.
    López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRefGoogle Scholar
  16. 16.
    Mateo F, Soria-Olivas E, Martınez-Sober M, Téllez-Plaza M, Gómez-Sanchis J, Redón J (2016) Multi-step strategy for mortality assessment in cardiovascular risk patients with imbalanced data. In: European symposium on artificial neural networks, computational intelligence and machine learningGoogle Scholar
  17. 17.
    Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2):427–436CrossRefGoogle Scholar
  18. 18.
    Meadem N, Verbiest N, Zolfaghar K, Agarwal J, Chin SC, Roy SB (2013) Exploring preprocessing techniques for prediction of risk of readmission for congestive heart failure patients. In: Data mining and healthcare (DMH), at international conference on knowledge discovery and data mining (KDD)Google Scholar
  19. 19.
    Mortazavi BJ, Downing NS, Bucholz EM, Dharmarajan K, Manhapra A, Li SX, Negahban SN, Krumholz HM (2016) Analysis of machine learning techniques for heart failure readmissions. Circ Cardiovasc Qual Outcomes 9:629–664CrossRefGoogle Scholar
  20. 20.
    Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106Google Scholar
  21. 21.
    Shi X, Xu G, Shen F, Zhao J (2015) Solving the data imbalance problem of p300 detection via random under-sampling bagging SVMs. In: 2015 international joint conference on Neural networks (IJCNN), IEEE, pp 1–5Google Scholar
  22. 22.
    Steinberg D, Colla P (1995) Cart: tree-structured non-parametric data analysis. Salford Systems, San DiegoGoogle Scholar
  23. 23.
    Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(04):687–719CrossRefGoogle Scholar
  24. 24.
    Turgeman L, May JH (2016) A mixed-ensemble model for hospital readmission. Artif Intell Med 72:72–82CrossRefGoogle Scholar
  25. 25.
    Urma D, Huang CC (2017) Interventions and strategies to reduce 30-day readmission rates. Hosp Med Clin 6(2):216–228CrossRefGoogle Scholar
  26. 26.
    Wang B, Pineau J (2016) Online bagging and boosting for imbalanced data streams. IEEE Trans Knowl Data Eng 28(12):3353–3366CrossRefGoogle Scholar
  27. 27.
    Yang Q, Wu X (2006) Ten challenging problems in data mining research. Int J Inf Technol Decis Mak 5(04):597–604CrossRefGoogle Scholar
  28. 28.
    Yoon K, Kwek S (2007) A data reduction approach for resolving the imbalanced data issue in functional genomics. Neural Comput Appl 16(3):295–306CrossRefGoogle Scholar
  29. 29.
    Young WA, Nykl SL, Weckman GR, Chelberg DM (2015) Using voronoi diagrams to improve classification performances when modeling imbalanced datasets. Neural Comput Appl 26(5):1041–1054CrossRefGoogle Scholar
  30. 30.
    Zhang Y, Fu P, Liu W, Chen G (2014) Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput Appl 25(3):927–935CrossRefGoogle Scholar
  31. 31.
    Zhang Z, Krawczyk B, Garcia S, Rosales-Perez A, Herrera F (2016) Empowering one-versus-one decomposition with ensemble learning for multi-class imbalanced data. Knowl Based Syst 106:251–263CrossRefGoogle Scholar
  32. 32.
    Zheng B, Zhang J, Yoon SW, Lam SS, Khasawneh M, Poranki S (2015) Predictive modeling of hospital readmissions using metaheuristics and data mining. Expert Syst Appl 42(20):7110–7120CrossRefGoogle Scholar

Copyright information

© The Natural Computing Applications Forum 2017

Authors and Affiliations

  1. 1.Vicomtech-IK4 Research CentreSan SebastiánSpain
  2. 2.Computation Intelligence GroupBasque University (UPV/EHU)San SebastiánSpain
  3. 3.CEINEUniversidad de ChileSantiagoChile

Personalised recommendations