Diagnosis system for imbalanced multi-minority medical dataset
Medical datasets inherently suffer from imbalance problem. Occurrence of some of the sub-pathologies is scarce than the other. In this work, a disease diagnosis system for multiclass classification is developed. Hybrid synthetic sampling technique is used for extremely imbalanced datasets. Cluster-based self-class algorithm is proposed in this work. Compared to near miss algorithm, this exhibits equivalent performance with reduced time for sampling. The results of classification are compared across baseline approaches which do not consider clustering and synthetic sampling. A new technique based on confidence measure is proposed to evaluate test samples by OVO classifiers. This technique along with hybrid sampling suggests an improvement over the classical approaches currently used in disease diagnosis systems.
KeywordsConfidence measure Cluster Medical diagnosis system Near miss-2 Self-class Synthetic sampling
Compliance with ethical standards
Conflicts of interest
The authors declare that they have no conflicts of interest.
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.
Human and animals rights statement
This article does not contain any studies with animals performed by any of the authors.
Informed consent was obtained from all individual participants included in the study.
- Almogahed BA, Kakadiaris IA (2014) Empowering imbalanced data in supervised learning a semi-supervised learning approach. In: Artificial neural networks and machine learning. ICANN Springer International Publishing (September 2014), pp 523–530. https://doi.org/10.1007/978-3-319-11179-7_66
- Arias-Londono JD, Godino-Llorente JI, Saenz-Lechon N, Osma-Ruiz V, Castellanos-Dominguez G (2010) An improved method for voice pathology detection by means of a HMM-based feature space transformation. Pattern Recognit 43(9):3100–3112. https://doi.org/10.1016/j.patcog.2010.03.019 CrossRefzbMATHGoogle Scholar
- Arias-Londono JD, Godino-Llorente JI, Markaki M, Stylianou Y (2011) On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices. Logop Phoniatr Vocol 36(2):60–69. https://doi.org/10.3109/14015439.2010.528788 CrossRefGoogle Scholar
- Barry WJ, Putzer M (2007) Saarbrucken voice database. Institute of Phonetics University of Saarland. http://www.stimmdatenbank.coli.uni-saarland.de/
- Bhatia S, Prakash P, Pillai GN (2008) SVM based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proceedings of the world congress on engineering and computer science, WCECS 2008, pp 22–24Google Scholar
- Chawla NV (2010) Data mining and knowledge discovery handbook. Springer, New York, pp 875–886Google Scholar
- Juhola M, Viikki K, Laurikkala J, Pyykko I, Kentala E (2001) On classification capability of neural networks: a case study with otoneurological data. Stud Health Technol Inform 1:474–478Google Scholar
- Kohli N, Verma NK, Roy A (2010) SVM based methods for arrhythmia classification in ECG. In: 2010 international conference on computer and communication technology (ICCCT), pp 486–490. IEEEGoogle Scholar
- Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
- Liu A, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. In: Proceedings of the 2007 international conference on data mining, DMIN2007, 25–28 June 2007, Las Vegas, Nevada, USA, pp 66–72Google Scholar
- Markaki ME, Stylianou Y (2009) Normalized modulation spectral features for cross-database voice pathology detection. In: ISCA INTERSPEECH, pp 935–938. http://dblp.uni-trier.de/db/conf/interspeech/interspeech2009.html#MarkakiS09
- Martinez GD, Eduardo L, Alfonso O, Antonio M (2012a) Score level versus audio level fusion for voice pathology detection on the Saarbrucken voice database. In: Advances in speech and language technologies for Iberian languages—Iber SPEECH, 2012 conference, Madrid, Spain, 21–23 Nov 2012. Proceedings, pp 110–120. https://doi.org/10.1007/978-3-642-35292-8_12
- Martinez GD, Lleida E, Ortega A, Miguel A, Villalba JA (2012b) Voice pathology detection on the Saarbrucken voice database with calibration and fusion of scores using multifocal toolkit. In: Advances in speech and language technologies for Iberian languages—IberSPEECH 2012 conference, Madrid, Spain, 21–23 Nov 2012. Proceedings, pp 99–109. https://doi.org/10.1007/978-3-642-35292-8_11
- Sug H, Dankel II DD (2014) More reliable over-sampled synthetic data instances by using artificial neural networks for a minority class. In: Proceedings of the 2014 world congress in computer science, computer engineering, and applied computing (July 2014). http://worldcomp-proceedings.com/proc/p2014/DMI.html
- Van Asch (2013) Macro- and micro-averaged evaluation measures. Available: www.cnts.ua.ac.be/~vincent/pdf/microaverage.pdf
- Varpa K, Iltanen K, Juhola M (2014) Genetic algorithm based approach in attribute weighting for a medical data set. J Comput MedGoogle Scholar
- Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Workshop on learning from imbalanced datasets II ICML Washington, DC, pp 42–48Google Scholar
- Zhang YP, Zhang LN, Wang YC (2010) Cluster-based majority under-sampling approaches for class imbalance learning. In: 2010 2nd ieee international conference on information and financial engineering (ICIFE), pp 400–404. IEEEGoogle Scholar
- Zheng Y, Yi X, Li M, Li R, Shan Z, Chang E, Li T (2015) Forecasting fine-grained air quality based on big data. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’15). ACM, New York, NY, pp 2267–2276. https://doi.org/10.1145/2783258.2788573