The Role of Biomedical Dataset in Classification

  • Ajay Kumar Tanwani
  • Muddassar Farooq
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5651)


In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. We quantify the complexity of a biomedical dataset using five complexity measures: correlation-based feature selection subset merit, noise, imbalance ratio, missing values and information gain. The effect of these complexity measures on classification accuracy is evaluated using five diverse machine learning algorithms: J48 (decision tree), SMO (support vector machines), Naive Bayes (probabilistic), IBk (instance based learner) and JRIP (rule-based induction). The results of our experiments show that noise and correlation-based feature selection subset merit – not a particular choice of algorithm – play a major role in determining the classification accuracy. In the end, we provide researchers with a meta-model and an empirical equation to estimate the classification potential of a dataset on the basis of its complexity. This well help researchers to efficiently pre-process the dataset for automatic knowledge extraction.


Classification Complexity Measures Biomedical Datasets 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 289–300 (2002)CrossRefGoogle Scholar
  2. 2.
    Tanwani, A.K., Afridi, J., Shafiq, M.Z., Farooq, M.: Guidelines to select machine learning scheme for classifcation of biomedical datasets. In: Pizzuti, C., Ritchie, M.D., Giacobini, M. (eds.) EVOBIO 2009. LNCS, vol. 5483, pp. 128–139. Springer, Heidelberg (2009)Google Scholar
  3. 3.
    UCI repository of machine learning databases, University of California-Irvine, Department of Information and Computer Science,
  4. 4.
    Witten, I.H., Frank, E.: Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar
  5. 5.
    Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131–167 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Ajay Kumar Tanwani
    • 1
  • Muddassar Farooq
    • 1
  1. 1.Next Generation Intelligent Networks Research Center (nexGIN RC)National University of Computer & Emerging Sciences (FAST-NUCES)IslamabadPakistan

Personalised recommendations