Diagnosis system for imbalanced multi-minority medical dataset

  • Swati Shilaskar
  • Ashok Ghatol
Methodologies and Application


Medical datasets inherently suffer from imbalance problem. Occurrence of some of the sub-pathologies is scarce than the other. In this work, a disease diagnosis system for multiclass classification is developed. Hybrid synthetic sampling technique is used for extremely imbalanced datasets. Cluster-based self-class algorithm is proposed in this work. Compared to near miss algorithm, this exhibits equivalent performance with reduced time for sampling. The results of classification are compared across baseline approaches which do not consider clustering and synthetic sampling. A new technique based on confidence measure is proposed to evaluate test samples by OVO classifiers. This technique along with hybrid sampling suggests an improvement over the classical approaches currently used in disease diagnosis systems.


Confidence measure Cluster Medical diagnosis system Near miss-2 Self-class Synthetic sampling 


Compliance with ethical standards

Conflicts of interest

The authors declare that they have no conflicts of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Human and animals rights statement

This article does not contain any studies with animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.


  1. Ahmadi MA, Bahadori A (2015) A LSSVM approach for determining well placement and conning phenomena in horizontal wells. Fuel 153:276–283CrossRefGoogle Scholar
  2. Ahmadi MA, Masoumi M, Askarinezhad R (2014) Evolving connectionist model to monitor the efficiency of an in situ combustion process: application to heavy oil recovery. J Energy Technol 2(2014):811–818. CrossRefGoogle Scholar
  3. Ahmadi MA, Masoumi M, Askarinezhad R (2015a) Evolving smart model to predict the combustion front velocity for in situ combustion. J Energy Technol. Google Scholar
  4. Ahmadi MH et al (2015b) Connectionist intelligent model estimates output power and torque of stirling engine. Renew Sustain Energy Rev 50:871–883. CrossRefGoogle Scholar
  5. Ali M, Ebadi M (2014) Evolving smart approach for determination dew point pressure through condensate gas reservoirs. Fuel 117:1074–1084. CrossRefGoogle Scholar
  6. Ali M, Ebadi M, Soleimani P (2014) Evolving predictive model to determine condensate-to-gas ratio in retrograded condensate gas reservoirs. Fuel 124:241–257. CrossRefGoogle Scholar
  7. Ali M et al (2015) Connectionist model for predicting minimum gas miscibility pressure: application to gas injection process. Fuel. Google Scholar
  8. Almogahed BA, Kakadiaris IA (2014) Empowering imbalanced data in supervised learning a semi-supervised learning approach. In: Artificial neural networks and machine learning. ICANN Springer International Publishing (September 2014), pp 523–530.
  9. Anooj PK (2012) Clinical decision support system: risk level prediction of heart disease using weighted fuzzy rules. J King Saud Univ Comput Inf Sci 24(1):27–40. Google Scholar
  10. Arias-Londono JD, Godino-Llorente JI, Saenz-Lechon N, Osma-Ruiz V, Castellanos-Dominguez G (2010) An improved method for voice pathology detection by means of a HMM-based feature space transformation. Pattern Recognit 43(9):3100–3112. CrossRefzbMATHGoogle Scholar
  11. Arias-Londono JD, Godino-Llorente JI, Markaki M, Stylianou Y (2011) On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices. Logop Phoniatr Vocol 36(2):60–69. CrossRefGoogle Scholar
  12. Autio L, Juhola M, Laurikkala J (2007) On the neural network classification of medical data and an endeavour to balance non-uniform data sets with artificial data extension. Comput Biol Med 37(3):388–397CrossRefGoogle Scholar
  13. Barry WJ, Putzer M (2007) Saarbrucken voice database. Institute of Phonetics University of Saarland.
  14. Bhatia S, Prakash P, Pillai GN (2008) SVM based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proceedings of the world congress on engineering and computer science, WCECS 2008, pp 22–24Google Scholar
  15. Brown I, Mues C (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl 39(3):3446–3453. CrossRefGoogle Scholar
  16. Chawla NV (2010) Data mining and knowledge discovery handbook. Springer, New York, pp 875–886Google Scholar
  17. Chawla NV, Nathalie J, Aleksander K (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor Newsl 6(1):1–6. CrossRefGoogle Scholar
  18. Das B, Krishnan NC, Cook DJ (2015) RACOG and wRACOG: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234CrossRefGoogle Scholar
  19. Dubey R, Zhou J, Wang Y, Paul M (2014) Analysis of sampling techniques for imbalanced data: an n = 648 ADNI study. NeuroImage 87:220–241. CrossRefGoogle Scholar
  20. Ertekin S (2013) Adaptive oversampling for imbalanced data classification. Inf Sci Syst 264:261–269. Google Scholar
  21. Fernández A, del Río S, Chawla NV (2017) An insight into imbalanced Big Data classification: outcomes and challenges. Complex Intell Syst 3:105. CrossRefGoogle Scholar
  22. Godino-Llorente JI, Gomez-Vilda P, Cruz-Roldan F, Blanco-Velasco M, Fraile R (2010) Pathological likelihood index as a measurement of the degree of voice normality and perceived hoarseness. J Voice 24(6):667–677. CrossRefGoogle Scholar
  23. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. SIGKDD Explor Newsl 6(1):30–39. CrossRefGoogle Scholar
  24. Juhola M, Viikki K, Laurikkala J, Pyykko I, Kentala E (2001) On classification capability of neural networks: a case study with otoneurological data. Stud Health Technol Inform 1:474–478Google Scholar
  25. Kohli N, Verma NK, Roy A (2010) SVM based methods for arrhythmia classification in ECG. In: 2010 international conference on computer and communication technology (ICCCT), pp 486–490. IEEEGoogle Scholar
  26. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232CrossRefGoogle Scholar
  27. Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA.
  28. Liu A, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. In: Proceedings of the 2007 international conference on data mining, DMIN2007, 25–28 June 2007, Las Vegas, Nevada, USA, pp 66–72Google Scholar
  29. Markaki ME, Stylianou Y (2009) Normalized modulation spectral features for cross-database voice pathology detection. In: ISCA INTERSPEECH, pp 935–938.
  30. Marqués Marzal AI, Garc’ıa Jim’enez V, Sánchez Garreta JS (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Oper Res Soc 64(7):1060–1070CrossRefGoogle Scholar
  31. Martinez GD, Eduardo L, Alfonso O, Antonio M (2012a) Score level versus audio level fusion for voice pathology detection on the Saarbrucken voice database. In: Advances in speech and language technologies for Iberian languages—Iber SPEECH, 2012 conference, Madrid, Spain, 21–23 Nov 2012. Proceedings, pp 110–120.
  32. Martinez GD, Lleida E, Ortega A, Miguel A, Villalba JA (2012b) Voice pathology detection on the Saarbrucken voice database with calibration and fusion of scores using multifocal toolkit. In: Advances in speech and language technologies for Iberian languages—IberSPEECH 2012 conference, Madrid, Spain, 21–23 Nov 2012. Proceedings, pp 99–109.
  33. Naganjaneyulu S, Kuppa MR, Mirza A (2014) An efficient wrapper approach for class imbalance learning using intelligent under-sampling. Int J Artif Intell Appl Smart Dev 2(1):23–40. Google Scholar
  34. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203CrossRefGoogle Scholar
  35. Shilaskar S, Ghatol A, Chatur P (2016) Medical decision support system for extremely imbalanced datasets. Inf Sci. Google Scholar
  36. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437CrossRefGoogle Scholar
  37. Sug H, Dankel II DD (2014) More reliable over-sampled synthetic data instances by using artificial neural networks for a minority class. In: Proceedings of the 2014 world congress in computer science, computer engineering, and applied computing (July 2014).
  38. Tang Y, Zhang Y-Q, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B Cybern 39(1):281–288CrossRefGoogle Scholar
  39. Teixeira JP, Fernandes PO (2014) Jitter, shimmer and HNR classification within gender, tones and vowels in healthy voices. Procedia Technol 16(2014):1228–1237CrossRefGoogle Scholar
  40. Van Asch (2013) Macro- and micro-averaged evaluation measures. Available:
  41. Varpa K, Iltanen K, Juhola M (2014) Genetic algorithm based approach in attribute weighting for a medical data set. J Comput MedGoogle Scholar
  42. Wang Q (2014) A hybrid sampling SVM approach to imbalanced data classification. Abstr Appl Anal. Google Scholar
  43. Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727CrossRefMathSciNetGoogle Scholar
  44. Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on learning from imbalanced datasets II ICML Washington, DC, pp 42–48Google Scholar
  45. Zhang YP, Zhang LN, Wang YC (2010) Cluster-based majority under-sampling approaches for class imbalance learning. In: 2010 2nd ieee international conference on information and financial engineering (ICIFE), pp 400–404. IEEEGoogle Scholar
  46. Zhang ZL, Luo XG, García S, Herrera F (2017) Cost-sensitive back-propagation neural networks with binarization techniques in addressing multiclass problems and non-competent classifiers. Appl Soft Comput J. Google Scholar
  47. Zheng Y, Yi X, Li M, Li R, Shan Z, Chang E, Li T (2015) Forecasting fine-grained air quality based on big data. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’15). ACM, New York, NY, pp 2267–2276.

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Govt College of EngineeringAmravatiIndia

Personalised recommendations