Algorithm for Grounding Mutation Mentions from Text to Protein Sequences

  • Jonas Bergman Laurila
  • Rajaraman Kanagasabai
  • Christopher J. O. Baker
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6254)


Protein mutations derived from in vitro experimental analysis are described in detail in scientific papers. Reuse of mutation impact annotations is an important subfield of bioinformatics for which mutation grounding is a critical step. Presented here is a method for grounding of textual mentions from papers describing mutational changes to proteins. We distinguish between grounding of mutation entities to protein database identifiers and to the correct positions on sequences extracted from protein databases. The grounding workflow coordinates the extraction of mutation, protein and organism mentions from texts and uses these to identify target sequences. Mutation mentions are sequentially mapped onto candidate proteins to facilitate their correct grounding to a protein sequence, independent of a protein-mutation tuple extraction task. Using a gold standard corpus of full text articles and corresponding protein sequences we show high performance precision and recall and discuss novel aspects of the algorithm in the context of previous work.


Natural Language Processing Mutation Extraction Mutation Grounding Sequence Analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baker, C.J.O., Witte, R.: Mutation Mining-A Prospector’s Tale. Information Systems Frontiers 8, 47–57 (2006)CrossRefGoogle Scholar
  2. 2.
    Bauher-Mehren, A., Furlong, L.I., Rautschka, M., Sanz, F.: From SNPs to pathways: integration of functional effect of sequence variations on models of cell signalling pathways. BMC Bioinformatics 10 (suppl. 8), S6 (2009)CrossRefGoogle Scholar
  3. 3.
    Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The Swiss-Prot Protein Knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)CrossRefGoogle Scholar
  4. 4.
    Bromberg, Y., Rost, B.: SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 25(11), 3823–3835 (2007)CrossRefGoogle Scholar
  5. 5.
    Caporaso, J.G., Baumgartner Jr., W.A., Randolph, D.A., Cohen, K.B., Hunter, L.: MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics 23, 1862–1865 (2007)CrossRefGoogle Scholar
  6. 6.
    Coulet, A., Shah, N., Hunter, L., Barral, C., Altman, R.B.: Extraction of Genotype-Phenotype-Drug Relationships from Text: From Entity Recognition to Bioinformatics Application. In: Pacific Symposium on Biocomputing, vol. 15, pp. 485–487 (2010)Google Scholar
  7. 7.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework And Graphical Development Environment For Robust NLP Tools And Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, ACL 2002 (2002)Google Scholar
  8. 8.
    Forbes, S.A., Bhamra, G., Bamford, S., Dawson, E., Kok, C., Clements, J., Menzies, A., Teague, J.W., Futreal, P.A., Stratton, M.R.: The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr. Protoc. Hum. Genet. 57, 10.11.1–10.11.26 (2008)Google Scholar
  9. 9.
    Gabdoulline, R.R., Ulbrich, S., Richter, S., Wade, R.C.: ProSAT2–Protein Structure Annotation Server. Nucleic Acids Res. 34, W79–W83 (2006)CrossRefGoogle Scholar
  10. 10.
    Hafner, C., Hartmann, A., Real, F.X., Hofstaedter, F., Landthaler, M., Vogt, T.: Spectrum of FGFR3 Mutations in Multiple Intraindividual Seborrheic Keratoses. Journal of Investigative Dermatology 27, 1883–1885 (2007)CrossRefGoogle Scholar
  11. 11.
    Cotton, R.G.H., Horaitis, O.: The Challenge of Documenting Mutation Across the Genome: The Human Genome Variation Society Approach. Hum Mut. 23, 447–452 (2004)CrossRefGoogle Scholar
  12. 12.
    Horn, F., Lau, A.L., Cohen, F.E.: Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 20, 557–568 (2004)CrossRefGoogle Scholar
  13. 13.
    Izarzugaza, J.M.G., Baresic, A., McMillan, L.E.M., Yeats, C., Clegg, A.B., Orengo, C.A., Martin, A.C.R., Valencia, A.: An integrated approach to the interpretation of Single Amino Acid Polymorphisms within the framework of CATH and Gene3D. BMC Bioinformatics 10(Suppl. 8), S5 (2009)CrossRefGoogle Scholar
  14. 14.
    Kanagasabai, R., Choo, K.H., Ranganathan, S., Baker, C.J.O.: A Workflow for Mutation Extraction and Structure Annotation. J. Bioinformatics and Comp. Bio. 5(6), 1319–1337 (2007)CrossRefGoogle Scholar
  15. 15.
    Krallinger, M., Izarzugaza, J.M.G., Rodriguez-Penagos, C., Valencia, A.: Extraction of human kinase mutations from literature, databases and genotyping studies. BMC Bioinformatics 10 (suppl. 8), S1 (2009)CrossRefGoogle Scholar
  16. 16.
    Rebholz-Schuhmann, D., Marcel, S., Albert, S., Tolle, R., Casari, G., Kirsch, H.: Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res. 32, 135–142 (2004)CrossRefGoogle Scholar
  17. 17.
    Winnenburg, R., Plake, C., Shroeder, M.: Improved mutation tagging with gene identifiers applied to membrane protein stability prediction. BMC Bioinformatics 10 (suppl. 8), S3 (2009)CrossRefGoogle Scholar
  18. 18.
    Witte, R., Baker, C.J.O.: Towards a Systematic Evaluation of protein Mutation Extraction Systems. J. Bioinformatics and Comp. Bio. 5(6), 1339–1359 (2007)CrossRefGoogle Scholar
  19. 19.
    Yip, Y.L., Lachenal, N., Pillet, V., Veuthey, A.-L.: Retrieving mutation-specific information for human proteins in UniProt/Swiss-Prot Knowledgebase. J. Bioinformatics and Comp. Bio. 5(6), 1215–1231 (2007)CrossRefGoogle Scholar
  20. 20.
    Witte, R., Kappler, T.: Enhanced semantic access to the protein engineering literature using ontologies populated by text mining. International Journal of Bioinformatics Research and Applications 3(2), 389–413 (2007)CrossRefGoogle Scholar
  21. 21.
    Erdogmus, M., Sezerman, U.: Application of automatic mutation-gene pair extraction to diseases. J. Bioinformatics and Comp. Bio. 5(6), 1261–1275 (2007)CrossRefGoogle Scholar
  22. 22.
    Siezen, R.J., Leunissen, J.A.M.: Subtilases: the superfamily of subtilisin-like serine proteases. Protein Science 6(3), 501–523 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jonas Bergman Laurila
    • 1
  • Rajaraman Kanagasabai
    • 2
  • Christopher J. O. Baker
    • 1
  1. 1.University of New BrunswickSaint John, New BrunswickCanada
  2. 2.Institute for Infocomm ResearchSingapore

Personalised recommendations