Building Biomedical Text Classifiers under Sample Selection Bias
Scientific papers are a primary source of information for investigators to know the current status in a topic or compare their results with other colleagues. However, mining biomedical texts remains to be a great challenge by the huge volume of scientific databases stored in the public databases and their imbalanced nature, with only a very small number of relevant papers to each user query. Classifying in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines (SVMs) have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we study the effects of undersampling, resampling and subsampling balancing strategies on four different biomedical text classifiers (with lineal, sigmoid, exponential and polynomial SVM kernels, respectively). Best results were obtained by normalized lineal and sigmoid kernels using the subsampling balancing technique. These results have been compared with those obtained by other authors using the TREC Genomics 2005 public corpus.
KeywordsBiomedical text mining classification techniques SVMs imbalanced data
Unable to display preview. Download preview PDF.