Text Mining for Discovery of Host–Pathogen Interactions

  • Stephen Anthony
  • Vitali Sintchenko
  • Enrico Coiera


Text processing systems now supplement the information needs of professionals across a variety of industries. Applications such as relationship extraction, information retrieval, document summarization, question answering, and multilingual machine translation demonstrate practical utility in terms of accuracy and speed. Significant drivers behind these advances stem from performance improvements in underlying technologies such as syntactic parsing, named entity recognition, and semantic interpretation. Text mining consolidates these and other language processing technologies to extract meaningful information. This chapter surveys the field of biomedical text mining and develops a case study to illustrate the underlying resources that are available, as well as the technologies that are commonly applied. The case study is designed to identify and extract relationships between genotypes, pathogens, and syndromes. Through the use of text processing it is possible to transform such relationships from disparate unstructured text sources into structured repositories. By identifying and organizing relationships that are scattered across diverse areas and literatures, it is possible to enhance our understanding of the complex machinery that drives biological processes.


Gene Symbol Entity Recognition HUGO Gene Nomenclature Committee Beijing Genotype Biomedical Text Mining 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Ananiadou S, McNaught J (2006) Text mining for biology and biomedicine. Artech House, Boston, MAGoogle Scholar
  2. Appelt DE, Hobbs JR et al (1993). FASTUS: a finite-state processor for information extraction from real-world text. In: The 13th international joint conference on artificial intelligence (IJCAI-93). Chambéry, FranceGoogle Scholar
  3. Bales ME, Lussier YA et al (2007). Topological analysis of large-scale biomedical terminology structures. J Am Med Inform Assoc 14(6):788–797CrossRefPubMedGoogle Scholar
  4. Bruford, EA, Lush MJ et al (2008). The HGNC Database in 2008: a resource for the human genome. Nucleic Acids Res 36(Database Issue):D445–D448CrossRefPubMedGoogle Scholar
  5. Bundschus M, Dejori M et al (2008) Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinform 9:207CrossRefGoogle Scholar
  6. Chun HW, Tsuruoka Y et al (2006) Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Pac Symp Biocomput 4–15Google Scholar
  7. Chung GY (2009) Sentence retrieval for abstracts of randomized controlled trials. BMC Med Inform Decis Making 9:10CrossRefGoogle Scholar
  8. Clegg AB, Shepherd JI (2007) Benchmarking natural-language parsers for biological applications using dependency graphs. BMC Bioinform 8:24CrossRefGoogle Scholar
  9. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1):37–46CrossRefGoogle Scholar
  10. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge.Google Scholar
  11. Cunningham H, Maynard D et al (2002) GATE: A framework and graphical development environment for robust NLP tools and applications. In: The 40th anniversary meeting of the association for computational linguistics, PhiladelphiaGoogle Scholar
  12. Daraselia N, Yuryev A et al (2004) Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics 20(5):604–611CrossRefPubMedGoogle Scholar
  13. Day D, Kozierok R et al (2004) Callisto: a configurable annotation workbench. In: The fourth international conference on language resources and evaluation (LREC 2004). Lisbon, PortugalGoogle Scholar
  14. Eyre TA, Ducluzeau F et al (2006) The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res 34(Databases Issue):D319–D321CrossRefPubMedGoogle Scholar
  15. Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New YorkGoogle Scholar
  16. Fellbaum C (1998) WordNet: an electronic lexical database. MIT, Cambridge, MAGoogle Scholar
  17. Fillmore CJ, Johnson CR et al (2003) Background to framenet. Int J Lexicogr 16(3):235–250CrossRefGoogle Scholar
  18. Ginzburg J (1996) Interrogatives: questions, facts, and dialogue. The handbook of contemporary semantic theory. Blackwell, Oxford, pp 385–422Google Scholar
  19. Grice H (1989) Studies in the way of words. Harvard University Press, Cambridge, MAGoogle Scholar
  20. Hersh WR, Cohen A et al (2006) TREC 2006 genomics track overview. In: The 15th text retrieval conference (TREC 2006), Gaithersburg, MD, pp 52–78Google Scholar
  21. Hidalgo CA, Blumm N et al. (2009) A dynamic network approach for the study of human phenotypes. PLoS Comput Biol 5(4):e1000353CrossRefPubMedGoogle Scholar
  22. Horn L, Ward G (eds) (2004) The handbook of pragmatics. Blackwell Handbooks in Linguistics. Blackwell, OxfordGoogle Scholar
  23. Hristovski D, Friedman C et al (2006) Exploiting semantic relations for literature-based discovery. In: AMIA annual symposium proceedings, pp 349–353Google Scholar
  24. Hunter L, Lu Z et al (2008) OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinform 9:78Google Scholar
  25. ICGSN (1957) Report of the International Committee on Genetic Symbols and Nomenclature. Union of International Sci Biol Ser B, Colloquia No. 30Google Scholar
  26. Jimeno A, Jimenez-Ruiz E et al (2008) Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 9(Suppl 3):S3CrossRefPubMedGoogle Scholar
  27. Kao A, Poteet SR (2007) Natural language processing and text mining. Springer, LondonCrossRefGoogle Scholar
  28. Kim JD, Ohta T et al (2003) GENIA corpus – semantically annotated corpus for bio-textmining. Bioinformatics 19(Suppl 1):i180–i182CrossRefPubMedGoogle Scholar
  29. Kim J, Ohta T et al (2004) Introduction to the bio-entity recognition task at JNLPBA. In: The international joint workshop on natural language processing in biomedicine and its applications (NLPBA), Geneva, Switzerland, pp 70–75CrossRefGoogle Scholar
  30. Kim S, Yoon J et al (2008) Kernel approaches for genic interaction extraction. Bioinformatics 24(1):118–126CrossRefPubMedGoogle Scholar
  31. Lacroix M (2009) Poor usage of HUGO standard gene nomenclature in breast cancer studies. Breast Cancer Res Treat 114(2):385–386CrossRefPubMedGoogle Scholar
  32. Leser U, Hakenberg J (2005) What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 6(4):357–369CrossRefPubMedGoogle Scholar
  33. Liberman, M., Mandel M (2008). PennBioIE, Linguistic Data Consortium, PhiladelphiaGoogle Scholar
  34. Manine AP, Alphonse E et al (2009) Learning ontological rules to extract multiple relations of genic interactions from text. Int J Med Inform. Epub ahead of print 22 apr. PMID: 19398370Google Scholar
  35. Mann WC, Thompson SA (1987) Rhetorical structure theory: a theory of text organization. Information Sciences Institute, Marina del Rey, CAGoogle Scholar
  36. Miwa M, Sætre R et al (2008) Combining multiple layers of syntactic information for protein–protein interaction extraction. In: The 3rd international symposium on semantic mining biomed (SMBM), Turku, Finland, pp 101–108Google Scholar
  37. Miyao Y, Sagae K et al (2009) Evaluating contributions of natural language parsers to protein–protein interaction extraction. Bioinformatics 25(3):394–400CrossRefPubMedGoogle Scholar
  38. Ogren PV (2006) Knowtator: a plug-in for creating training and evaluation data sets for Biomedical Natural Language systems. In: The 9th International Protégé Conference, Stanford, CAGoogle Scholar
  39. Palmer M, Gildea D et al (2005) The proposition bank: a annotated corpus semantic roles. Comput Linguistics 31(1):71–106Google Scholar
  40. Pyysalo S, Ginter F et al (2006) Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions. Int J Med Inform 75(6):430–442Google Scholar
  41. Roberts C (1996) Information structure: towards an integrated theory of formal pragmatics. OSU Working Papers in Linguistics, vol. 49, pp 91–136Google Scholar
  42. Roche E, Schabes Y (1997) Finite-state language processing. MIT, Cambridge, MAGoogle Scholar
  43. Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14):3191–3192CrossRefPubMedGoogle Scholar
  44. Shows TB, Alper CA et al (1979) International system for human gene nomenclature (1999) ISGN (1979). Cytogenet Cell Genet 25(1–4):96–116CrossRefPubMedGoogle Scholar
  45. Smalheiser NR, Swanson DR (1998) Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput Methods Programs Biomed 57(3):149–153CrossRefPubMedGoogle Scholar
  46. Smalheiser NR, Torvik VI et al (2006) Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators. J Biomed Discov Collab 1:8CrossRefPubMedGoogle Scholar
  47. Smith LH, Wilbur WJ (2009) The value of parsing as feature generation for gene mention recognitionJ Biomed Inform. Epub ahead of print PMID: 19345281Google Scholar
  48. Stalnaker RC (2002) Common ground. Linguistics Philos 24(5–6):701–721CrossRefGoogle Scholar
  49. Steinwart I, Christmann A et al (2008) Support vector machines. Springer, DordrechtGoogle Scholar
  50. Surdeanu M, Sanda H et al (2003) Using predicate-argument structures for information extraction. In: The 41st annual meeting of the association for computational linguistics, Sapporo, Japan, pp 8–15Google Scholar
  51. Sutton C, McCallum A (2007) An introduction to conditional random fields for relational learning. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT, Cambridge, MA, pp 93–127Google Scholar
  52. Sutton C, McCallum A et al (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence dData. J Machine Learn Res 8:693–723Google Scholar
  53. Tamames J, Valencia A (2006) The success (or not) of HUGO nomenclature. Genome Biol 7(5):402CrossRefPubMedGoogle Scholar
  54. Tanabe L, Xie N et al (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinform 6(Suppl 1):S3CrossRefGoogle Scholar
  55. Tannen D, Schiffrin D et al (2001) The handbook of discourse analysis. Blackwell, Malden, MAGoogle Scholar
  56. Tsuruoka Y, Tsujii JI (2005) Bidirectional inference with the easiest-first strategy for tagging sequence data. HLT/EMNLP 2005, Vancouver, BC, Canada, pp 467–474Google Scholar
  57. Wilbur WJ, Rzhetsky A et al (2006) New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinform 7:356CrossRefGoogle Scholar
  58. Xu H, Fan JW et al (2007) Gene symbol disambiguation using knowledge-based profiles. Bioinformatics 23(8):1015–1022CrossRefPubMedGoogle Scholar
  59. Zipf GK (1932) Selective studies and the principle of relative frequency in language. MIT, Cambridge, MAGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Stephen Anthony
    • 1
  • Vitali Sintchenko
  • Enrico Coiera
  1. 1.Centre for Health InformaticsUniversity of New South WalesSydneyAustralia

Personalised recommendations