De Novo Discovery of Structured ncRNA Motifs in Genomic Sequences

  • Walter L. Ruzzo
  • Jan Gorodkin
Part of the Methods in Molecular Biology book series (MIMB, volume 1097)


De novo discovery of “motifs” capturing the commonalities among related noncoding structured RNAs is among the most difficult problems in computational biology. This chapter outlines the challenges presented by this problem, together with some approaches towards solving them, with an emphasis on an approach based on the CMfinder program as a case study. Applications to genomic screens for novel de novo structured ncRNA s, including structured RNA elements in untranslated portions of protein-coding genes, are presented.

Key words

CMfinder Mutual information ncRNA discovery ncRNA gene ncRNA motif Riboswitch 



This work is supported by the Danish Council for Independent Research (Technology and Production Sciences), the Danish Council for Strategic Research (Programme Commission on Strategic Growth Technologies), as well as the Danish Center for Scientific Computing.


  1. 1.
    Yao Z, Weinberg Z, Ruzzo WL (2006) CMfinder—a covariance model based RNA motif finding algorithm. Bioinformatics 22(4):445–452. PMID:16357030Google Scholar
  2. 2.
    Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL (2007) A computational pipeline for high-throughput discovery of cis-regulatory noncoding RNA in prokaryotes. PLoS Comput Biol 3(7):e126. PMID:17616982Google Scholar
  3. 3.
    Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR (2007) Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline. Nucleic Acids Res 35:4809–4819. PMID:17621584Google Scholar
  4. 4.
    Torarinsson E, Yao Z, Wiklund ED, Bramsen JB, Hansen C, Kjems J, Tommerup N, Ruzzo WL, Gorodkin J (2008) Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions. Genome Res 18:242–251. PMID:18096747Google Scholar
  5. 5.
    Gorodkin J, Knudsen B (2000) RNA informatik. Naturens Verden 11–12:2–9Google Scholar
  6. 6.
    Eddy SR (2001) Non-coding RNA genes and the modern RNA world. Nat Rev Genet 2(12):919–929. PMID:11733745
  7. 7.
    Eddy SR (2002) Computational genomics of noncoding RNA genes. Cell 109(2):137–140. PMID:12007398Google Scholar
  8. 8.
    Bompfünewerer AF, Flamm C, Fried C, Fritzsch G, Hofacker IL, Lehmann J, Missal K, Mosig A, Müller B, Prohaska SJ, Stadler BMR, Stadler PF, Tanzer A, Washietl S, Witwer C (2005) Evolutionary patterns of non-coding RNAs. Theory Biosci 123(4):301–369. PMID:18202870Google Scholar
  9. 9.
    Mattick JS, Makunin IV (2006) Non-coding RNA. Hum Mol Genet 15(1):R17–R29. PMID:16651366Google Scholar
  10. 10.
    Bompfünewerer AF, Backofen R, Bernhart SH, Flamm C, Fried C, Fritzsch G, Hackermüller J, Hertel J, Hofacker IL, Missal K, Mosig A, Prohaska SJ, Rose D, Stadler PF, Tanzer A, Washietl S, Will S (2007) RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol 308:1–25. PMID:17171697Google Scholar
  11. 11.
    Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL (2010) De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol 28:9–19 (Feature Review). PMID:19942311Google Scholar
  12. 12.
    Gorodkin J, Hofacker IL (2011) From structure prediction to genomic screens for novel non-coding RNAs. PLoS Comput Biol 7(8):e1002100. PMID:21829340
  13. 13.
    Washietl S, Will S, Hendrix DA, Goff LA, Rinn JL, Berger B, Kellis M (2012) Computational analysis of noncoding RNAs. Wiley Interdiscip Rev RNA 3(6):759–778. PMID:22991327
  14. 14.
    Pace NR, Thomas BR, Woese CR (1999) Probing RNA structure, function, and history by comparative analysis. In: Gesteland RF, Cech TR, Atkins JF (eds) The RNA world, Chap. 4. Cold Spring Harbor Laboratory, Cold Spring Harbor, pp 113–141Google Scholar
  15. 15.
    Shang L, Xu W, Ozer S, Gutell RR (2012) Structural constraints identified with covariation analysis in ribosomal RNA. PLoS One 7(6):e39383. PMID:22724009
  16. 16.
    Barrick JE, Breaker RR (2007) The distributions, mechanisms, and structures of metabolite-binding riboswitches. Genome Biol 8(11):R239. PMID:17997835Google Scholar
  17. 17.
    Zuker M (1989) Computer prediction of RNA structure. Methods Enzymol 180:262–288. PMID:2482418
  18. 18.
    Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P (1994) Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie 125:167–188CrossRefGoogle Scholar
  19. 19.
    Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics 25(10):1335–1337. PMID:19307242Google Scholar
  20. 20.
    Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  21. 21.
    Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol 26(8):897–899. PMID:18688245Google Scholar
  22. 22.
    Bailey TL, Elkan C (1995) The value of prior knowledge in discovering motifs in MEME. In: Proceedings of the third international conference on intelligent systems for molecular biology. AAAI, Menlo Park, pp 21–29. PMID:7584439
  23. 23.
    Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Res 22(11):2079–2088. PMID:8029015
  24. 24.
    Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, Haussler D (1994) Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res 22(23):5112–5120. PMID:7800507Google Scholar
  25. 25.
    Touzet H, Perriquet O (2004) CARNAC: folding families of related RNAs. Nucleic Acids Res 32(Web server issue):W142–W145. PMID:15215367
  26. 26.
    Ji Y, Xu X, Stormo GD (2004) A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics 20(10):1591–1602. PMID:14962926Google Scholar
  27. 27.
    Sankoff D (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math 45:810–825CrossRefGoogle Scholar
  28. 28.
    Gorodkin J, Heyer LJ, Stormo GD (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res 25(18):3724–3732. PMID:9278497
  29. 29.
    Mathews DH, Turner DH (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol 317(2):191–203. PMID:11902836
  30. 30.
    McCaskill JS (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29:1105–1119. PMID:1695107
  31. 31.
    Hofacker IL, Fekete M, Stadler PF (2002) Secondary structure prediction for aligned RNA sequences. J Mol Biol 319(5):1059–1066. PMID:12079347Google Scholar
  32. 32.
    Altschul SF, Erickson BW (1985) Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 2(6):526–538. PMID:3870875
  33. 33.
    Babak T, Blencowe BJ, Hughes TR (2007) Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics 8:33. PMID:17263882Google Scholar
  34. 34.
    Gesell T, Washietl S (2008) Dinucleotide controlled null models for comparative RNA gene prediction. BMC Bioinformatics 9:248. PMID:18505553
  35. 35.
    Anandam P, Torarinsson E, Ruzzo WL (2009) Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies. Bioinformatics 25:668–669. PMID:19136551Google Scholar
  36. 36.
    Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21):2688–2690. PMID:16928733Google Scholar
  37. 37.
    Gowri-Shankar V, Rattray M (2007) A reversible jump method for Bayesian phylogenetic inference with a nonhomogeneous substitution model. Mol Biol Evol 24(6):1286–1299. PMID:17347157Google Scholar
  38. 38.
    Knudsen B, Hein J (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6):446–454. PMID:10383470
  39. 39.
    Knudsen B, Hein J (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 31(13):3423–3428. PMID:12824339
  40. 40.
    Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D (2006) Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2:e33. PMID:16628248
  41. 41.
    Yao Z (2008) Genome scale search of noncoding RNAs: bacteria to vertebrates. Ph.D. thesis, Department of Computer Science and Engineering, University of WashingtonGoogle Scholar
  42. 42.
    Bernhart SHF, Hofacker IL, Will S, Gruber AR, Stadler PF (2008) RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9:474. PMID:19014431Google Scholar
  43. 43.
    Washietl S, Hofacker IL, Stadler PF (2005) Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA 102:2454–2459. PMID:15665081
  44. 44.
    Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: an RNA family database. Nucleic Acids Res 31(1):439–441. PMID:12520045Google Scholar
  45. 45.
    Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33(Database issue):121–124. PMID:15608160Google Scholar
  46. 46.
    Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A (2009) Rfam: updates to the RNA families database. Nucleic Acids Res 37(Database issue):D136–D140. PMID:18953034
  47. 47.
    Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A (2011) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res 39(Database issue):D141–D145. PMID:21062808
  48. 48.
    Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33(Database issue):D192–D196. PMID:15608175Google Scholar
  49. 49.
    Weinberg Z, Regulski EE, Hammond MC, Barrick JE, Yao Z, Ruzzo WL, Breaker RR (2008) The aptamer core of SAM-IV riboswitches mimics the ligand-binding site of SAM-I riboswitches. RNA 14:822–828. PMID:18369181Google Scholar
  50. 50.
    Regulski EE, Moy RH, Weinberg Z, Barrick JE, Yao Z, Ruzzo WL, Breaker RR (2008) A widespread riboswitch candidate that controls bacterial genes involved in molybdenum cofactor and tungsten cofactor metabolism. Mol Microbiol 68:918–932. PMID: 18363797Google Scholar
  51. 51.
    Sudarsan N, Lee ER, Weinberg Z, Moy RH, Kim JN, Link KH, Breaker RR (2008) Riboswitches in eubacteria sense the second messenger cyclic di-GMP. Science 321(5887):411–413. PMID:18635805Google Scholar
  52. 52.
    Wang JX, Lee ER, Morales DR, Lim J, Breaker RR (2008) Riboswitches that sense S-adenosylhomocysteine and activate genes involved in coenzyme recycling. Mol Cell 29:691–702. PMID:18374645Google Scholar
  53. 53.
    Meyer MM, Roth A, Chervin SM, Garcia GA, Breaker RR (2008) Confirmation of a second natural preQ1 aptamer class in Streptococcaceae bacteria. RNA 14:685–695. PMID:18305186Google Scholar
  54. 54.
    Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biol 11(3):R31. PMID:20230605Google Scholar
  55. 55.
    Weinberg Z, Perreault J, Meyer MM, Breaker RR (2009) Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis. Nature 462(7273):656–659. PMID:19956260Google Scholar
  56. 56.
    Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715. PMID:15060014Google Scholar
  57. 57.
    Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006. PMID:12045153Google Scholar
  58. 58.
    ENCODE Project Consortium et al (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816. PMID:17571346Google Scholar
  59. 59.
    Lunter G, Ponting CP, Hein J (2006) Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput Biol 2(1):e5. PMID:16410828Google Scholar
  60. 60.
    Gardner PP, Wilm A, Washietl S (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res 33(8):2433–2439. PMID:15860779Google Scholar
  61. 61.
    Torarinsson E, Sawera M, Havgaard JH, Fredholm M, Gorodkin J (2006) Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res 16(7):885–889. Erratum: Genome Res 16:1439, 2006. PMID:16751343Google Scholar
  62. 62.
    Lu ZJ, Yip KY, Wang G, Shou C, Hillier LW, Khurana E, Agarwal A, Auerbach R, Rozowsky J, Cheng C, Kato M, Miller DM, Slack F, Snyder M, Waterston RH, Reinke V, Gerstein MB (2011) Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data. Genome Res 21(2):276–285. PMID:21177971Google Scholar
  63. 63.
    Chen XS, Brown CM (2012) Computational identification of new structured cis-regulatory elements in the 3′-untranslated region of human protein coding genes. Nucleic Acids Res 40(18):8862–8873. doi: 10.1093/nar/gks684. PMID:22821558
  64. 64.
    Weinberg Z, Ruzzo WL (2004) Faster genome annotation of non-coding RNA families without loss of accuracy. In: RECOMB04: Proceedings of the eighth annual international conference on computational molecular biology. ACM, San Diego, pp 243–251.
  65. 65.
    Weinberg Z, Ruzzo WL (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics 20(1):i334–i341. PMID:15262817Google Scholar
  66. 66.
    Weinberg Z, Ruzzo WL (2006) Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics 22(1):35–39. PMID:16267089Google Scholar
  67. 67.
    Sun Y, Buhler J, Yuan C (2012) Designing filters for fast-known ncRNA identification. IEEE/ACM Trans Comput Biol Bioinformatics 9(3):774–787. PMID: 22084145Google Scholar
  68. 68.
    Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol 3(4):e65. PMID:17432929Google Scholar
  69. 69.
    Tseng HH, Weinberg Z, Gore J, Breaker RR, Ruzzo WL (2009) Finding non-coding RNAs through genome-scale clustering. J Bioinformatics Comput Biol 7:373–388. PMID:19340921Google Scholar
  70. 70.
    Parker BJ, Moltke I, Roth A, Washietl S, Wen J, Kellis M, Breaker R, Pedersen JS (2011) New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. Genome Res 21(11):1929–1943. PMID:21994249Google Scholar
  71. 71.
    Rivas E, Eddy SR (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2(1):8. ISSN 1471-2105. PMID:11801179
  72. 72.
    Rivas E, Klein RJ, Jones TA, Eddy SR (2001) Computational identification of noncoding RNAs in E. coli by comparative genomics. Curr Biol 11(17):1369–1373. PMID:11553332Google Scholar
  73. 73.
    McCutcheon JP, Eddy SR (2003) Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics. Nucleic Acids Res 31(14):4119–4128. PMID:12853629
  74. 74.
    Missal K, Rose D, Stadler PF (2005) Non-coding RNAs in Ciona intestinalis. Bioinformatics 21(2):ii77–ii78. PMID:16204130Google Scholar
  75. 75.
    Missal K, Zhu X, Rose D, Deng W, Skogerbø G, Chen R, Stadler PF (2006) Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. J Exp Zoolog B Mol Dev Evol 306(4):379–392. PMID:16425273
  76. 76.
    Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF (2005) Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol 23(11):1383–1390. PMID:16273071Google Scholar
  77. 77.
    Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigo R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF (2007) Structured RNAs in the ENCODE selected regions of the human genome. Genome Res 17(6):852–864. PMID:17568003Google Scholar
  78. 78.
    Uzilov AV, Keegan JM, Mathews DH (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 7:173. PMID:16566836
  79. 79.
    Havgaard JH, Lyngsø RB, Stormo GD, Gorodkin J (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 21(9):1815–1824. PMID:15657094Google Scholar
  80. 80.
    Havgaard JH, Torarinsson E, Gorodkin J (2007) Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol 3:1996–1908. PMID:17937495

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Walter L. Ruzzo
    • 1
    • 2
    • 3
  • Jan Gorodkin
    • 4
    • 5
  1. 1.Fred Hutchinson Cancer Research CenterSeattleUSA
  2. 2.Department of Computer Science & EngineeringUniversity of WashingtonSeattleUSA
  3. 3.Department of Genome SciencesUniversity of WashingtonSeattleUSA
  4. 4.Center for non-coding RNA in Technology and Health, IKVHUniversity of CopenhagenFrederiksberg CDenmark
  5. 5.Center for non-coding RNA in Technology and Health, IKVHUniversity of CopenhagenFrederiksberg CDenmark

Personalised recommendations