Algorithm for Grounding Mutation Mentions from Text to Protein Sequences
Protein mutations derived from in vitro experimental analysis are described in detail in scientific papers. Reuse of mutation impact annotations is an important subfield of bioinformatics for which mutation grounding is a critical step. Presented here is a method for grounding of textual mentions from papers describing mutational changes to proteins. We distinguish between grounding of mutation entities to protein database identifiers and to the correct positions on sequences extracted from protein databases. The grounding workflow coordinates the extraction of mutation, protein and organism mentions from texts and uses these to identify target sequences. Mutation mentions are sequentially mapped onto candidate proteins to facilitate their correct grounding to a protein sequence, independent of a protein-mutation tuple extraction task. Using a gold standard corpus of full text articles and corresponding protein sequences we show high performance precision and recall and discuss novel aspects of the algorithm in the context of previous work.
KeywordsNatural Language Processing Mutation Extraction Mutation Grounding Sequence Analysis
Unable to display preview. Download preview PDF.
- 6.Coulet, A., Shah, N., Hunter, L., Barral, C., Altman, R.B.: Extraction of Genotype-Phenotype-Drug Relationships from Text: From Entity Recognition to Bioinformatics Application. In: Pacific Symposium on Biocomputing, vol. 15, pp. 485–487 (2010)Google Scholar
- 7.Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework And Graphical Development Environment For Robust NLP Tools And Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, ACL 2002 (2002)Google Scholar
- 8.Forbes, S.A., Bhamra, G., Bamford, S., Dawson, E., Kok, C., Clements, J., Menzies, A., Teague, J.W., Futreal, P.A., Stratton, M.R.: The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr. Protoc. Hum. Genet. 57, 10.11.1–10.11.26 (2008)Google Scholar
- 13.Izarzugaza, J.M.G., Baresic, A., McMillan, L.E.M., Yeats, C., Clegg, A.B., Orengo, C.A., Martin, A.C.R., Valencia, A.: An integrated approach to the interpretation of Single Amino Acid Polymorphisms within the framework of CATH and Gene3D. BMC Bioinformatics 10(Suppl. 8), S5 (2009)CrossRefGoogle Scholar