Computational Grammars for Interrogation of Genomes

  • Jaron Schaeffer
  • Afra Held
  • Guy Tsafnat


Antibiotic resistance genes are embedded in mobile genetic elements (MGEs) that spread genes between organisms, even of different species. MGEs are large structures that consist of genes, and protein interaction sites. Although a considerable number of microbial DNA sequences have been published, searching for multi-resistant MGEs remains largely a manual task. This usually involves BLAST searches and a combination of keyword-based searches through sequence annotations and the literature. Using computational grammars, we can automate the recognition of arbitrarily complex sequence structures. In this chapter, we describe computational grammars, showing how they can be used to automate MGE annotation, and give examples of the annotation enabled by such grammars.


Insertion Sequence Gene Cassette Parse Tree Grammar Rule Protein Interaction Site 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was supported by a Capacity Building Grant from New South Wales Health.


  1. Aho AV, Sethi R, Ullman JD (1986) Compilers: principles, techniques, and tools. Addison-Wesley, Reading MAGoogle Scholar
  2. Aiyar A (2000) The use of CLUSTAL W and CLUSTAL X for multiple sequence alignment. Methods Mol Biol 132:221–241PubMedGoogle Scholar
  3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410PubMedGoogle Scholar
  4. Ashburner M, Ball C, Blake J, Botstein D et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25(1):25–29CrossRefPubMedGoogle Scholar
  5. Badger JH, Olsen GJ (1999) CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 16:512–524PubMedGoogle Scholar
  6. Bairoch A, Apweiler R (1999) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 27:49–54CrossRefPubMedGoogle Scholar
  7. Baldi P, Brunak S (2001) Bioinformatics: the machine learning approach. MIT Press, Boston MAGoogle Scholar
  8. Bennett P (2008) Plasmid encoded antibiotic resistance: acquisition and transfer of antibiotic resistance genes in bacteria. Br J Pharmacol 153:347–357CrossRefGoogle Scholar
  9. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2007) Genbank. Nucleic Acids Res 35:D21–D25CrossRefGoogle Scholar
  10. Besemer J, Borodovsky M (2005) GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res 33:451–545CrossRefGoogle Scholar
  11. Bohnebeck U, Lombardot T, Kottmann R, Glöckner FO (2008) MetaMine – a tool to detect and analyse gene patterns in their environmental context. BMC Bioinform 9:459CrossRefGoogle Scholar
  12. Brazma A, Jonassen I, Eidhammer I, Gilbert D (1998) Approaches to the automatic discovery of patterns in biosequences. J Comp Biol 5(2):277–304Google Scholar
  13. Breitkreutz B, Stark C, Tyers M (2003) Osprey: a network visualization system. Gen Biol 4:R22CrossRefGoogle Scholar
  14. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94CrossRefPubMedGoogle Scholar
  15. Chamberlin D, Clark J, Florescu D, Robie J, Simeon J, Stefanescu M (2001) XQuery 1.0: An XML query language. W3C Working Draft, vol 7Google Scholar
  16. Dale R, Moisl H, Somers H (2000) Handbook of natural language processing. CRC Press, LondonGoogle Scholar
  17. Delcher AL, Bratke KA, Powers EC, Salzberg SL (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinform 23(6):673–673CrossRefGoogle Scholar
  18. Elmasri R, Navathe SB (2007) Fundamentals of database systems, 5th ed. Addison-Wesley, Reading, MAGoogle Scholar
  19. Ewens WJ, Grant GR (2005) Statistical methods in bioinformatics: An introduction. Springer, HeidelbergCrossRefGoogle Scholar
  20. Finch RG (2004) Antibiotic resistance: a view from the prescriber. Nat Rev Microbiol 2:989–994CrossRefPubMedGoogle Scholar
  21. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378–382CrossRefGoogle Scholar
  22. Fomichev A, Grinev M, Kuznetsov S (2006) Sedna: A native XML DBMS. LNCS 3831:272–281Google Scholar
  23. Frost LS, Leplae R, Summers AO, Toussaint A (2005) Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol 3:722–732CrossRefPubMedGoogle Scholar
  24. Furuya E, Lowy F (2006) Antimicrobial-resistant bacteria in the community setting. Nat Rev Microbiol 4:36–45CrossRefPubMedGoogle Scholar
  25. Gaasterland T, Sensen CW (1996) MAGPIE: automated genome interpretation. Trends Genet 12(2):76–78CrossRefPubMedGoogle Scholar
  26. Gheorghe M, Mitrana V (2004) A formal language-based approach in biology. Comp Funct Genom 5:91–94CrossRefGoogle Scholar
  27. Grune D, Jacobs CJH (2008) Parsing techniques: a practical guide. Prentice Hall, Englewood Cliffs, NJGoogle Scholar
  28. Hall RM, Collis CM (1998) Antibiotic resistance in gram-negative bacteria: the role of gene cassettes and integrons. Drug Resist Updat 1:109–119CrossRefPubMedGoogle Scholar
  29. Held A, Tsafnat G (2007) ArrayVisual: an on-line visualization tool for DNA sequences annotated using grammars,
  30. Hopcroft JE, Motwani R, Ullman JD (2006) Introduction to automata theory, languages, and computation. Addison-Wesley, Reading, MAGoogle Scholar
  31. Johnson B, Shneiderman B (1991) Tree-maps: a space-filling approach to the visualization of hierarchical information structures. Proc IEEE Conf Vis 1991:284–291Google Scholar
  32. Johnson SC (1978) YACC-yet another compiler-compiler. Bell LaboratoriesGoogle Scholar
  33. Jurafsky D, Martin JH (2008) Speech and language processing. Prentice Hall, New YorkGoogle Scholar
  34. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30CrossRefPubMedGoogle Scholar
  35. Koski T (2001) Hidden Markov models for bioinformatics. Springer, HeidelbergGoogle Scholar
  36. Larsen TS, Krogh A (2003) EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinform 4:21CrossRefGoogle Scholar
  37. Lee J, Katari G, Sachidanandam R (2005) GObar: a gene ontology based analysis and visualization tool for gene sets. BMC Bioinformat 6:189CrossRefGoogle Scholar
  38. Leung S, Mellish C, Robertson D (2001) Basic gene grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences. Bioinform 17(3):226–236CrossRefGoogle Scholar
  39. Levy SB, Marshall B (2004) Antibacterial resistance worldwide: causes, challenges and responses. Nat Med 10:122–129CrossRefGoogle Scholar
  40. Lewin B (2007) Genes IX. Jones and Bartlett, Sudbury, MAGoogle Scholar
  41. Linke B, McHardy A, Neuweger H, Krause L, Meyer F (2006) REGANOR: a gene prediction server for prokaryotic genomes and a database of high quality gene predictions for prokaryotes. Appl Bioinform 5:193–198CrossRefGoogle Scholar
  42. Liu H, Hu Z, Hu CH (2005) DynGO: a tool for visualizing and mining of gene ontology and its associations. BMC Bioinform 6:201CrossRefGoogle Scholar
  43. Manning CD, Raghavan P, Schuetze H (2008) Introduction to information retrieval. Cambridge University Press Cambridge, MAGoogle Scholar
  44. Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Pühler A (2003) GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 31(8):2187–2195CrossRefPubMedGoogle Scholar
  45. Mitchell T (1997) Machine learning. McGraw-Hill, Columbus OHGoogle Scholar
  46. Muggleton SH, Bryant CH, Srinivasan A, Whittaker A et al (2001) Are grammatical representations useful for learning from biological sequence data? A case study. J Comp Biol 8(5):493–521CrossRefGoogle Scholar
  47. Overbeek R, Bartels D, Vonstein V, Meyer F (2007) Annotation of bacterial and archaeal genomes: Improving accuracy and consistency. Chem Rev 107:3431–3447CrossRefPubMedGoogle Scholar
  48. Partridge SR, Tsafnat G, Coiera E, Iredell J (2009) Gene cassettes and cassette arrays in mobile resistance integrons. FEMS Microbiol Rev 33(4):757–784Google Scholar
  49. Pavesi G, Mauri G, Pesole G (2004) In silico representation and discovery of transcription factor binding sites. Brief Bioinform 5(3):217–36CrossRefPubMedGoogle Scholar
  50. Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448CrossRefPubMedGoogle Scholar
  51. Rabiner L, Juang B (1986) An introduction to hidden Markov models. Proc IEEE 77(2):257–286CrossRefGoogle Scholar
  52. Rissanen J, Grünwald P, Heikkonen J, Myllymäki P, Roos T, Rousu J (ed) (2008) Information theoretic methods for bioinformatics. Hindawi Publishing Corporation, Cairo, EgyptGoogle Scholar
  53. Rivas E, Eddy SR (2000) The language of RNA: a formal grammar that includes pseudoknots. Bioinform 16:334–340CrossRefGoogle Scholar
  54. Rutherford K, Parkhill J, Crook J, Horsnell T et al (2000) Artemis: sequence visualization and annotation. Bioinform 16:944–945CrossRefGoogle Scholar
  55. Sandve GK, Drablos F (2006) A survey of motif discovery methods in an integrated framework. Biol Direct 1:11CrossRefPubMedGoogle Scholar
  56. Schuler GD, Alschult SF, Lipman DJ (1991) A workbench for multiple alignment construction and analysis. Struct Funct Genet 9:180–190CrossRefGoogle Scholar
  57. Searls DB (1988) Representing genetic information with formal grammars. Proc 7th Natl Conf Artif Intell, pp 386–391Google Scholar
  58. Searls DB (2002) The language of genes. Nature 420:211–217CrossRefPubMedGoogle Scholar
  59. Smith HO, Annau TM, Chadrasegaran S (1990) Finding sequence motifs in groups of functionally related proteins. Proc Natl Acad Sci U S A 87:826–830CrossRefPubMedGoogle Scholar
  60. Stokes H, O’Gorman D, Recchia G, Parsekhian M, Hall R (1997) Structure and function of 59-base element recombination sites associated with mobile gene cassettes. Mol Microbiol 26:731–745CrossRefPubMedGoogle Scholar
  61. Stokes HW, Hall RM (1989) A novel family of potentially mobile DNA elements encoding site-specific gene-integration functions: integrons. Mol Microbiology 3:1669–1683CrossRefGoogle Scholar
  62. The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29CrossRefGoogle Scholar
  63. Tsafnat G, Coiera E, Partridge SR, Schaeffer J, Iredell J (2009) Context-driven discovery of gene cassettes in mobile integrons using a computational grammar. BMC Bioinform 10:281Google Scholar
  64. Van Leeuwen J, Leeuwen J (1994) Handbook of theoretical computer science. MIT Press, Cambridge, MAGoogle Scholar
  65. Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, Newton, MAGoogle Scholar
  66. Wheeler DL, Church DM, Federhen S, Lash AE et al (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res 31(1):28–33Google Scholar
  67. Wu CH, Apweiler R, Bairoch A, Natale DA et al (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34:187–191CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Jaron Schaeffer
    • 1
  • Afra Held
    • 1
  • Guy Tsafnat
    • 1
  1. 1.Centre for Health InformaticsUniversity of New South WalesSydneyAustralia

Personalised recommendations