Building Biomedical Text Classifiers under Sample Selection Bias

  • R. Romero
  • E. L. Iglesias
  • L. Borrajo
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 91)

Abstract

Scientific papers are a primary source of information for investigators to know the current status in a topic or compare their results with other colleagues. However, mining biomedical texts remains to be a great challenge by the huge volume of scientific databases stored in the public databases and their imbalanced nature, with only a very small number of relevant papers to each user query. Classifying in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines (SVMs) have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we study the effects of undersampling, resampling and subsampling balancing strategies on four different biomedical text classifiers (with lineal, sigmoid, exponential and polynomial SVM kernels, respectively). Best results were obtained by normalized lineal and sigmoid kernels using the subsampling balancing technique. These results have been compared with those obtained by other authors using the TREC Genomics 2005 public corpus.

Keywords

Biomedical text mining classification techniques SVMs imbalanced data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • R. Romero
    • 1
  • E. L. Iglesias
    • 1
  • L. Borrajo
    • 1
  1. 1.Univ. of VigoOurenseSpain

Personalised recommendations