Rule-Based Protein Term Identification with Help from Automatic Species Tagging

  • Xinglong Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4394)


In biomedical articles, terms often refer to different protein entities. For example, an arbitrary occurrence of term p53 might denote thousands of proteins across a number of species. A human annotator is able to resolve this ambiguity relatively easily, by looking at its context and if necessary, by searching an appropriate protein database. However, this phenomenon may cause much trouble to a text mining system, which does not understand human languages and hence can not identify the correct protein that the term refers to. In this paper, we present a Term Identification system which automatically assigns unique identifiers, as found in a protein database, to ambiguous protein mentions in texts. Unlike other solutions described in literature, which only work on gene/protein mentions on a specific model organism, our system is able to tackle protein mentions across many species, by integrating a machine-learning based species tagger. We have compared the performance of our automatic system to that of human annotators, with very promising results.


Heuristic Rule Biomedical Text Species Tagger Human Annotator Protein Entity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. Journal of Biomedical Informatics (Special Issue on Named Entity Recogntion in Biomedicine) 37(6), 512–526 (2004)Google Scholar
  2. 2.
    Hirschman, L., Morgan, A.A., Yeh, A.S.: Rutabaga by any other name: extracting biological names. J. Biomed. Inform. 35(4), 247–259 (2002)CrossRefGoogle Scholar
  3. 3.
    Tuason, O., Chen, L., Liu, H., Blake, J.A., Friedman, C.: Biological nomenclature: A source of lexical knowledge and ambiguity. In: Proceedings of Pac. Symp. Biocomput., pp. 238–249 (2004)Google Scholar
  4. 4.
    Nenadic, G., Ananiadou, S., McNaught, J.: Enhancing automatic term recognition through term variation. In: Proceedings of 20th Int. Conference on Computational Linguistics (Coling 2004), Geneva, Switzerland (2004)Google Scholar
  5. 5.
    Chen, L., Liu, H., Friedman, C.: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 248–256 (2005)Google Scholar
  6. 6.
    Fang, H., Murphy, K., Jin, Y., Kim, J.S., White, P.S.: Human gene name normalization using text matching with automatically extracted synonym dictionaries. In: Proceedings of BioNLP’06, New York, USA (2006)Google Scholar
  7. 7.
    Hirschman, L., Colosimo, M., Morgan, A., Columbe, J., Yeh, A.: Task 1B: Gene list task BioCreAtIve workshop. In: BioCreative: Critical Assessment for Information Extraction in Biology (2004)Google Scholar
  8. 8.
    Hanisch, D., Fundel, K., Mevissen, H.-T., Zimmer, R., Fluck, J.: ProMiner: Organism-specific protein name detection using approximate string matching. BMC Bioinformatics 6(Suppl. 1), S14 (2005)CrossRefGoogle Scholar
  9. 9.
    Crim, J., McDonald, R., Pereira, F.: Automatically annotating documents with normalized gene lists. BMC Bioinformatics 6(Suppl. 1), S13 (2005)CrossRefGoogle Scholar
  10. 10.
    Fundel, K., Güttler, D., Zimmer, R., Apostolakis, J.: A simple approach for protein name identification: prospects and limits. BMC Bioinformatics 6(Suppl. 1), S15 (2005)CrossRefGoogle Scholar
  11. 11.
    Tamames, J.: Text detective: A rule-based system for gene annotation. BMC Bioinformatics 6(Suppl. 1), S10 (2005)CrossRefGoogle Scholar
  12. 12.
    Hackey, B., Nguyen, H., Nissim, M., Alex, B., Grover, C.: Grounding gene mentions with respect to gene database idntifiers. In: BioCreAtIvE Workshop Handouts, Granada, Spain (2004)Google Scholar
  13. 13.
    Liu, H.: BioTagger: A biological entity tagging system. In: BioCreAtIvE Workshop Handouts, Granada, Spain (2004)Google Scholar
  14. 14.
    Morgan, A., Hirschman, L., Colosimo, M., Yeh, A., Colombe, J.: Gene name identification and normalization using a model organism database. J. Biomedical Informatics 37, 396–410 (2004)CrossRefGoogle Scholar
  15. 15.
    Hanisch, D., Fluck, J., Mevissen, H.T., Zimmer, R.: Playing biology’s name game: identifying protein names in scientific text. In: Pac. Symp. Biocomput., pp. 403–414 (2003)Google Scholar
  16. 16.
    Mihalcea, R., Chklovski, T., Killgariff, A.: The Senseval-3 English lexical sample task. In: Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (Senseval-3) (2004)Google Scholar
  17. 17.
    Schwartz, A., Hearst, M.: A simople algorithm for identifying abbreviation definitions in biomedical texts. In: Proceedings of the Pacific Symposium on Biocomputing (2003)Google Scholar
  18. 18.
    Ghanem, M., Guo, Y., Lodhi, H., Zhang, Y.: Automatic scientific text classification using local patterns: KDD Cup 2002. ACM SIGKDD Explorations Newsletter 4(2), 95–96 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Xinglong Wang
    • 1
  1. 1.School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LWScotland

Personalised recommendations