Building Biomedical Text Classifiers under Sample Selection Bias

Romero, R.; Iglesias, E. L.; Borrajo, L.

doi:10.1007/978-3-642-19934-9_2

R. Romero⁵,
E. L. Iglesias⁵ &
L. Borrajo⁵

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 91))

888 Accesses
1 Citations

Abstract

Scientific papers are a primary source of information for investigators to know the current status in a topic or compare their results with other colleagues. However, mining biomedical texts remains to be a great challenge by the huge volume of scientific databases stored in the public databases and their imbalanced nature, with only a very small number of relevant papers to each user query. Classifying in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines (SVMs) have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we study the effects of undersampling, resampling and subsampling balancing strategies on four different biomedical text classifiers (with lineal, sigmoid, exponential and polynomial SVM kernels, respectively). Best results were obtained by normalized lineal and sigmoid kernels using the subsampling balancing technique. These results have been compared with those obtained by other authors using the TREC Genomics 2005 public corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anand, A., Pugalenthi, G., Fogel, G.B., Suganthan, P.N.: An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39, 1385–1391 (2010)
Article Google Scholar
Ando, R.K., Dredze, M., Zhang, T.: Trec 2005 genomics track experiments at ibm watson. In: Proceedings of TREC 2005. NIST Special Publication(2005)
Google Scholar
Aronson, A.R.: Fusion of knowledge-intensive and statistical approaches for retrieving and annotating textual genomics documents. In: Proc TREC 2005, pp. 36–45 (2005)
Google Scholar
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36(3), 849–851 (2003)
Article Google Scholar
Caporaso, J.G.: Concept recognition and the trec genomics tasks. the fourteenth text retrieval. In: Conference Proceedings (TREC 2005). National Institute for Standards and Technology, Gaithersburg (2005)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1), 1–6 (2004)
Article Google Scholar
Collier, N., Ruch, P., Nazarenko, A. (eds.): JNLPBA 2004: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. ACL, Morristown (2004)
Google Scholar
Cunningham, H., Wilks, Y., Gaizauskas, R.J.: Gate - a general architecture for text engineering (1996)
Google Scholar
Hersh, W., Cohen, A., Yang, J., Bhupatiraju, R.T., Roberts, P., Hearst, M.: Trec 2005 genomics track overview. In: TREC 2005 Notebook, pp. 14–25 (2005)
Google Scholar
Kang, P., Cho, S.: EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 837–846. Springer, Heidelberg (2006)
Chapter Google Scholar
Settles, B.: ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)
Article Google Scholar
Si, L., Kanungo, T.: Thresholding strategies for text classifiers: Trec 2005 biomedical triage task experiments. the fourteenth text retrieval. In: Conference Proceedings (TREC 2005). National Institute for Standards and Technology, Gaithersburg (2005), http://trec.nist.gov/pubs/trec14/papers/carnegie-mu-kanungo.geo.pdf
Google Scholar
Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Systems with Applications 28(4), 667–671 (2005)
Article Google Scholar
Tang, Y., Zhang, Y., Chawla, N.V.: Svms modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39(1), 281–288 (2009)
Article Google Scholar
Voorhees, E.M., Buckland, L.P. (eds.): Proceedings of the Fourteenth Text REtrieval Conference, TREC 2005, Gaithersburg, Maryland, November 15-18. National Institute of Standards and Technology (NIST), Special Publication 500-266 (2005)
Google Scholar
Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)
Article Google Scholar
Zhaif, C.: Uiuc/musc at trec 2005 genomics track. the fourteenth text retrieval. In: Conference Proceedings (TREC 2005). National Institute for Standards and Technology, Gaithersburg (2005), http://trec.nist.gov/pubs/trec14/papers/uillinoisuc.geo.pdf
Google Scholar
Zhang, J., Mani, I.: knn approach to unbalanced data distributions: A case study involving information extraction. In: Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets (2003)
Google Scholar
Zheng, Z.H.: Applying probabilistic thematic clustering for classification in the trec 2005 genomics track. the fourteenth text retrieval. In: Conference Proceedings (TREC 2005). National Institute for Standards and Technology, Gaithersburg (2005), http://trec.nist.gov/pubs/trec14/papers/queensu.geo.pdf
Google Scholar

Download references

Author information

Authors and Affiliations

Univ. of Vigo, Campus As Lagoas s/n, 32004, Ourense, Spain
R. Romero, E. L. Iglesias & L. Borrajo

Authors

R. Romero
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
L. Borrajo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Machine Intelligence Research Labs (MIR Labs), Scientific Network for Innovation and Research Excellence (SNIRE), P.O. Box 2259, 98071-2259, Auburn, WA, USA
Ajith Abraham
Department of Computing Science and Control, Faculty of Science, University of Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan M. Corchado
Department of Computing Science Faculty of Science, University of Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Sara Rodríguez González & Juan F. De Paz Santana &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Romero, R., Iglesias, E.L., Borrajo, L. (2011). Building Biomedical Text Classifiers under Sample Selection Bias. In: Abraham, A., Corchado, J.M., González, S.R., De Paz Santana, J.F. (eds) International Symposium on Distributed Computing and Artificial Intelligence. Advances in Intelligent and Soft Computing, vol 91. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19934-9_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-19934-9_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19933-2
Online ISBN: 978-3-642-19934-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics