Guidelines to Select Machine Learning Scheme for Classification of Biomedical Datasets

Tanwani, Ajay Kumar; Afridi, Jamal; Shafiq, M. Zubair; Farooq, Muddassar

doi:10.1007/978-3-642-01184-9_12

Ajay Kumar Tanwani¹⁹,
Jamal Afridi¹⁹,
M. Zubair Shafiq¹⁹ &
…
Muddassar Farooq¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5483))

Included in the following conference series:

European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics

1220 Accesses
30 Citations

Abstract

Biomedical datasets pose a unique challenge to machine learning and data mining algorithms for classification because of their high dimensionality, multiple classes, noisy data and missing values. This paper provides a comprehensive evaluation of a set of diverse machine learning schemes on a number of biomedical datasets. To this end, we follow a four step evaluation methodology: (1) pre-processing the datasets to remove any redundancy, (2) classification of the datasets using six different machine learning algorithms; Naive Bayes (probabilistic), multi-layer perceptron (neural network), SMO (support vector machine), IBk (instance based learner), J48 (decision tree) and RIPPER (rule-based induction), (3) bagging and boosting each algorithm, and (4) combining the best version of each of the base classifiers to make a team of classifiers with stacking and voting techniques. Using this methodology, we have performed experiments on 31 different biomedical datasets. To the best of our knowledge, this is the first study in which such a diverse set of machine learning algorithms are evaluated on so many biomedical datasets. The important outcome of our extensive study is a set of promising guidelines which will help researchers in choosing the best classification scheme for a particular nature of biomedical dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wasan, S., Bhatnagar, V., Kaur, H.: The impact of data mining techniques on medical diagnostics. Data Science Journal 5, 119–126 (2006)
Article Google Scholar
Pena-Reyes, C.A., Sipper, M.: Evolutionary computation in medicine: an overview. Journal of Artificial Intelligence in Medicine 19(1), 1–23 (2000)
Article Google Scholar
Janecek, A.G.K., Gansterer, W.N., Demel, M.A., Ecker, G.F.: On the relationship between feature selection and classification accuracy. Journal of Machine Learning and Research 4, 90–105 (2008)
Google Scholar
Opitz, D., Maclin, R.: Popular ensemble methods: an empirical study. Journal of Artificial Intelligence Research 11, 169–198 (1999)
MATH Google Scholar
Assareh, A., Moradi, M.H., Volkert, L.G.: A hybrid random subspace classifier fusion approach for protein mass spectra classification. In: Marchiori, E., Moore, J.H. (eds.) EvoBIO 2008. LNCS, vol. 4973, pp. 1–11. Springer, Heidelberg (2008)
Chapter Google Scholar
Hayward, J., Alvarez, S., Ruiz, C., Sullivan, M., Tseng, J., Whalen, G.: Knowledge discovery in clinical performance of cancer patients. In: IEEE International Conference on Bioinformatics and Biomedicine, USA, pp. 51–58 (2008)
Google Scholar
Serrano, J.I., Tomeckova, M., Zvarova, J.: Machine learning methods for knowledge discovery in medical data on Atherosclerosis. European Journal for Biomedical Informatics 2(1), 6–33 (2006)
Google Scholar
Kononenko, I.: Machine learning for medical diagnosis: History, state of the art and perspective. Artificial Intelligence in Medicine 23(1), 89–109 (1995)
Article MathSciNet Google Scholar
Lavrac, N.: Selected techniques for data mining in medicine. Artificial Intelligence in Medicine 16, 3–23 (1999)
Article Google Scholar
UCI repository of machine learning databases, University of California-Irvine, Department of Information and Computer Science, www.ics.uci.edu/~mlearn/MLRepository.html
Ovarian cancer studies, center for cancer research, National Cancer Institute, USA, http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp
Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning and Research 3, 1157–1182 (2003)
MATH Google Scholar
Rish, I.: An empirical study of the naive Bayes classifier. In: IJCAI Workshop on Empirical Methods in Artifical Intelligence, pp. 41–46 (2001)
Google Scholar
Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Pearson Education, London (1998)
MATH Google Scholar
Aha, D.W., Kibler, D., Albert, M.K.: Instance based learning algorithms. Machine Learning 6(1), 37–66 (1991)
Google Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Vapnik, V.N.: Statistical learning theory. Wiley Interscience, USA (1998)
MATH Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Proceedings of Twelfth International Conference on Machine Learning, USA, pp. 115–123 (1995)
Google Scholar
Breiman, L.: Bagging Predictors. Machine Learning 24(2), 123–140 (1996)
MATH Google Scholar
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the Thirteenth International Conference on Machine Learning, Italy, pp. 148–156 (1996)
Google Scholar
Ting, K.M., Witten, I.H.: Stacked generalization: when does it work. In: Proceedings of the Fifteenth IJCAI, pp. 866–871. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Abe, H., Yamaguchi, T.: Constructive meta-learning with machine learning method repository. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS, vol. 3029, pp. 502–511. Springer, Heidelberg (2004)
Chapter Google Scholar
Fawcett, T.: ROC graphs: notes and practical considerations for researchers, TR HPL-2003-4, HP Labs, USA (2004)
Google Scholar
Walter, S.D.: The partial area under the summary ROC curve. Statistics in Medicine 24(13), 2025–2040 (2005)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Next Generation Intelligent Networks Research Center (nexGIN RC), National University of Computer & Emerging Sciences (FAST-NU), Islamabad, Pakistan
Ajay Kumar Tanwani, Jamal Afridi, M. Zubair Shafiq & Muddassar Farooq

Authors

Ajay Kumar Tanwani
View author publications
You can also search for this author in PubMed Google Scholar
Jamal Afridi
View author publications
You can also search for this author in PubMed Google Scholar
M. Zubair Shafiq
View author publications
You can also search for this author in PubMed Google Scholar
Muddassar Farooq
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Italian National Research Council (CNR), Institute for High-Performance Computing and Networking (ICAR), Via P. Bucci 41C, 87036, Rende (CS), Italy
Clara Pizzuti
Center for Human Genetics Research Department of Molecular Physiology & Biophysics, Vanderbilt University, 519 Light Hall, TN 37232, Nashville, USA
Marylyn D. Ritchie
Department of Animal Production Epidemiology and Ecology, University of Torino, Via Leonardo da Vinci 44, 10095, Grugliasco, Italy
Mario Giacobini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tanwani, A.K., Afridi, J., Shafiq, M.Z., Farooq, M. (2009). Guidelines to Select Machine Learning Scheme for Classification of Biomedical Datasets. In: Pizzuti, C., Ritchie, M.D., Giacobini, M. (eds) Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. EvoBIO 2009. Lecture Notes in Computer Science, vol 5483. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01184-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-01184-9_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01183-2
Online ISBN: 978-3-642-01184-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics