Using Existing Biomedical Resources to Detect and Ground Terms in Biomedical Literature

  • Kaarel Kaljurand
  • Fabio Rinaldi
  • Thomas Kappeler
  • Gerold Schneider
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5651)


We present an approach towards the automatic detection of names of proteins, genes, species, etc. in biomedical literature and their grounding to widely accepted identifiers. The annotation is based on a large term list that contains the common expression of the terms, a normalization step that matches the terms with their actual representation in the texts, and a disambiguation step that resolves the ambiguity of matched terms. We describe various characteristics of the terms found in existing term resources and of the terms that are used in biomedical texts. We evaluate our results against a corpus of manually annotated protein mentions and achieve a precision of 57% and recall of 72%.


Biomedical Text Token Sequence Term List Disambiguation Method Gene Normalization Task 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Hakenberg, J.: What’s in a gene name? Automated refinement of gene name dictionaries. In: Proceedings of BioNLP 2007: Biological, Translational, and Clinical Language Processing; Prague, Czech Republic (2007)Google Scholar
  2. 2.
    Kappeler, T., Kaljurand, K., Rinaldi, F.: TX Task: Automatic Detection of Focus Organisms in Biomedical Publications. In: BioNLP 2009, NAACL/HLT, Boulder, Colorado, June 4–5 (2009)Google Scholar
  3. 3.
    Leitner, F., Krallinger, M., Rodriguez-Penagos, C., Hakenberg, J., Plake, C., Kuo, C.-J., Hsu, C.-N., Tsai, R.T.-H., Hung, H.-C., Lau, W.W., Johnson, C.A., Saetre, R., Yoshida, K., Chen, Y.H., Kim, S., Shin, S.-Y., Zhang, B.-T., Baumgartner, W.A., Hunter, L., Haddow, B., Matthews, M., Wang, X., Ruch, P., Ehrler, F., Ozgur, A., Erkan, G., Radev, D.R., Krauthammer, M., Luong, T., Hoffmann, R.: Introducing meta-services for biomedical information extraction. Genome Biology 9(suppl. 2), S6 (2008)CrossRefGoogle Scholar
  4. 4.
    Liu, H., Hu, Z.-Z., Zhang, J., Wu, C.: BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22(1), 103–105 (2006)CrossRefPubMedGoogle Scholar
  5. 5.
    Mathivanan, S., Periaswamy, B., Gandhi, T.K.B., Kandasamy, K., Suresh, S., Mohmood, R., Ramachandra, Y.L., Pandey, A.: An evaluation of human protein-protein interaction data in the public domain. BMC Bioinformatics 7(suppl. 5), 19 (2006)CrossRefGoogle Scholar
  6. 6.
    Morgan, A.A., Lu, Z., Wang, X., Cohen, A.M., Fluck, J., Ruch, P., Divoli, A., Fundel, K., Leaman, R., Hakenberg, J., Sun, C.: Overview of BioCreative II gene normalization. Genome Biology 9(suppl. 2), S3 (2008)CrossRefGoogle Scholar
  7. 7.
    Rinaldi, F., Kappeler, T., Kaljurand, K., Schneider, G., Klenner, M., Clematide, S., Hess, M., von Allmen, J.-M., Parisot, P., Romacker, M., Vachon, T.: OntoGene in BioCreative II. Genome Biology 9(suppl. 2), S13 (2008)CrossRefGoogle Scholar
  8. 8.
    Sarntivijai, S., Ade, A.S., Athey, B.D., States, D.J.: A bioinformatics analysis of the cell line nomenclature. Bioinformatics 24(23), 2760–2766 (2008)CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Tanabe, L., John Wilbur, W.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)CrossRefPubMedGoogle Scholar
  10. 10.
    Wang, X., Matthews, M.: Distinguishing the species of biomedical named entities for term identification. BMC Bioinformatics 9(suppl. 11), S6 (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Kaarel Kaljurand
    • 1
  • Fabio Rinaldi
    • 1
  • Thomas Kappeler
    • 1
  • Gerold Schneider
    • 1
  1. 1.Institute of Computational LinguisticsUniversity of ZurichSwitzerland

Personalised recommendations