Abstract
De novo discovery of “motifs” capturing the commonalities among related noncoding structured RNAs is among the most difficult problems in computational biology. This chapter outlines the challenges presented by this problem, together with some approaches towards solving them, with an emphasis on an approach based on the CMfinder program as a case study. Applications to genomic screens for novel de novo structured ncRNA s, including structured RNA elements in untranslated portions of protein-coding genes, are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yao Z, Weinberg Z, Ruzzo WL (2006) CMfinder—a covariance model based RNA motif finding algorithm. Bioinformatics 22(4):445–452. http://www.ncbi.nlm.nih.gov/pubmed/16357030 PMID:16357030
Yao Z, Barrick J, Weinberg Z, Neph S, Breaker R, Tompa M, Ruzzo WL (2007) A computational pipeline for high-throughput discovery of cis-regulatory noncoding RNA in prokaryotes. PLoS Comput Biol 3(7):e126. http://www.ncbi.nlm.nih.gov/pubmed/17616982 PMID:17616982
Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR (2007) Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline. Nucleic Acids Res 35:4809–4819. http://www.ncbi.nlm.nih.gov/pubmed/17621584 PMID:17621584
Torarinsson E, Yao Z, Wiklund ED, Bramsen JB, Hansen C, Kjems J, Tommerup N, Ruzzo WL, Gorodkin J (2008) Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions. Genome Res 18:242–251. http://www.ncbi.nlm.nih.gov/pubmed/18096747 PMID:18096747
Gorodkin J, Knudsen B (2000) RNA informatik. Naturens Verden 11–12:2–9
Eddy SR (2001) Non-coding RNA genes and the modern RNA world. Nat Rev Genet 2(12):919–929. http://www.ncbi.nlm.nih.gov/pubmed/11733745 PMID:11733745
Eddy SR (2002) Computational genomics of noncoding RNA genes. Cell 109(2):137–140. http://www.ncbi.nlm.nih.gov/pubmed/12007398 PMID:12007398
Bompfünewerer AF, Flamm C, Fried C, Fritzsch G, Hofacker IL, Lehmann J, Missal K, Mosig A, Müller B, Prohaska SJ, Stadler BMR, Stadler PF, Tanzer A, Washietl S, Witwer C (2005) Evolutionary patterns of non-coding RNAs. Theory Biosci 123(4):301–369. http://www.ncbi.nlm.nih.gov/pubmed/18202870 PMID:18202870
Mattick JS, Makunin IV (2006) Non-coding RNA. Hum Mol Genet 15(1):R17–R29. http://www.ncbi.nlm.nih.gov/pubmed/16651366 PMID:16651366
Bompfünewerer AF, Backofen R, Bernhart SH, Flamm C, Fried C, Fritzsch G, Hackermüller J, Hertel J, Hofacker IL, Missal K, Mosig A, Prohaska SJ, Rose D, Stadler PF, Tanzer A, Washietl S, Will S (2007) RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol 308:1–25. http://www.ncbi.nlm.nih.gov/pubmed/17171697 PMID:17171697
Gorodkin J, Hofacker IL, Torarinsson E, Yao Z, Havgaard JH, Ruzzo WL (2010) De novo prediction of structured RNAs from genomic sequences. Trends Biotechnol 28:9–19 (Feature Review). http://www.ncbi.nlm.nih.gov/pubmed/19942311 PMID:19942311
Gorodkin J, Hofacker IL (2011) From structure prediction to genomic screens for novel non-coding RNAs. PLoS Comput Biol 7(8):e1002100. http://www.ncbi.nlm.nih.gov/pubmed/21829340 PMID:21829340
Washietl S, Will S, Hendrix DA, Goff LA, Rinn JL, Berger B, Kellis M (2012) Computational analysis of noncoding RNAs. Wiley Interdiscip Rev RNA 3(6):759–778. http://www.ncbi.nlm.nih.gov/pubmed/22991327 PMID:22991327
Pace NR, Thomas BR, Woese CR (1999) Probing RNA structure, function, and history by comparative analysis. In: Gesteland RF, Cech TR, Atkins JF (eds) The RNA world, Chap. 4. Cold Spring Harbor Laboratory, Cold Spring Harbor, pp 113–141
Shang L, Xu W, Ozer S, Gutell RR (2012) Structural constraints identified with covariation analysis in ribosomal RNA. PLoS One 7(6):e39383. http://www.ncbi.nlm.nih.gov/pubmed/22724009 PMID:22724009
Barrick JE, Breaker RR (2007) The distributions, mechanisms, and structures of metabolite-binding riboswitches. Genome Biol 8(11):R239. http://www.ncbi.nlm.nih.gov/pubmed/17997835 PMID:17997835
Zuker M (1989) Computer prediction of RNA structure. Methods Enzymol 180:262–288. http://www.ncbi.nlm.nih.gov/pubmed/2482418 PMID:2482418
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P (1994) Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie 125:167–188
Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics 25(10):1335–1337. http://www.ncbi.nlm.nih.gov/pubmed/19307242 PMID:19307242
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge
Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol 26(8):897–899. http://www.ncbi.nlm.nih.gov/pubmed/18688245 PMID:18688245
Bailey TL, Elkan C (1995) The value of prior knowledge in discovering motifs in MEME. In: Proceedings of the third international conference on intelligent systems for molecular biology. AAAI, Menlo Park, pp 21–29. http://www.ncbi.nlm.nih.gov/pubmed/7584439 PMID:7584439
Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Res 22(11):2079–2088. http://www.ncbi.nlm.nih.gov/pubmed/8029015 PMID:8029015
Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, Haussler D (1994) Stochastic context-free grammars for tRNA modeling. Nucleic Acids Res 22(23):5112–5120. http://www.ncbi.nlm.nih.gov/pubmed/7800507 PMID:7800507
Touzet H, Perriquet O (2004) CARNAC: folding families of related RNAs. Nucleic Acids Res 32(Web server issue):W142–W145. http://www.ncbi.nlm.nih.gov/pubmed/15215367 PMID:15215367
Ji Y, Xu X, Stormo GD (2004) A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics 20(10):1591–1602. http://www.ncbi.nlm.nih.gov/pubmed/14962926 PMID:14962926
Sankoff D (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM J Appl Math 45:810–825
Gorodkin J, Heyer LJ, Stormo GD (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res 25(18):3724–3732. http://www.ncbi.nlm.nih.gov/pubmed/9278497 PMID:9278497
Mathews DH, Turner DH (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol 317(2):191–203. http://www.ncbi.nlm.nih.gov/pubmed/11902836 PMID:11902836
McCaskill JS (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29:1105–1119. http://www.ncbi.nlm.nih.gov/pubmed/1695107 PMID:1695107
Hofacker IL, Fekete M, Stadler PF (2002) Secondary structure prediction for aligned RNA sequences. J Mol Biol 319(5):1059–1066. http://www.ncbi.nlm.nih.gov/pubmed/12079347 PMID:12079347
Altschul SF, Erickson BW (1985) Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage. Mol Biol Evol 2(6):526–538. http://www.ncbi.nlm.nih.gov/pubmed/3870875 PMID:3870875
Babak T, Blencowe BJ, Hughes TR (2007) Considerations in the identification of functional RNA structural elements in genomic alignments. BMC Bioinformatics 8:33. http://www.ncbi.nlm.nih.gov/pubmed/17263882 PMID:17263882
Gesell T, Washietl S (2008) Dinucleotide controlled null models for comparative RNA gene prediction. BMC Bioinformatics 9:248. http://www.ncbi.nlm.nih.gov/pubmed/18505553 PMID:18505553
Anandam P, Torarinsson E, Ruzzo WL (2009) Multiperm: shuffling multiple sequence alignments while approximately preserving dinucleotide frequencies. Bioinformatics 25:668–669. http://www.ncbi.nlm.nih.gov/pubmed/19136551 PMID:19136551
Stamatakis A (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21):2688–2690. http://www.ncbi.nlm.nih.gov/pubmed/16928733 PMID:16928733
Gowri-Shankar V, Rattray M (2007) A reversible jump method for Bayesian phylogenetic inference with a nonhomogeneous substitution model. Mol Biol Evol 24(6):1286–1299. http://www.ncbi.nlm.nih.gov/pubmed/17347157 PMID:17347157
Knudsen B, Hein J (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6):446–454. http://www.ncbi.nlm.nih.gov/pubmed/10383470 PMID:10383470
Knudsen B, Hein J (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 31(13):3423–3428. http://www.ncbi.nlm.nih.gov/pubmed/12824339 PMID:12824339
Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D (2006) Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2:e33. http://www.ncbi.nlm.nih.gov/pubmed/16628248 PMID:16628248
Yao Z (2008) Genome scale search of noncoding RNAs: bacteria to vertebrates. Ph.D. thesis, Department of Computer Science and Engineering, University of Washington
Bernhart SHF, Hofacker IL, Will S, Gruber AR, Stadler PF (2008) RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9:474. http://www.ncbi.nlm.nih.gov/pubmed/19014431 PMID:19014431
Washietl S, Hofacker IL, Stadler PF (2005) Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci USA 102:2454–2459. http://www.ncbi.nlm.nih.gov/pubmed/15665081 PMID:15665081
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: an RNA family database. Nucleic Acids Res 31(1):439–441. http://www.ncbi.nlm.nih.gov/pubmed/12520045 PMID:12520045
Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33(Database issue):121–124. http://www.ncbi.nlm.nih.gov/pubmed/15608160 PMID:15608160
Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A (2009) Rfam: updates to the RNA families database. Nucleic Acids Res 37(Database issue):D136–D140. http://www.ncbi.nlm.nih.gov/pubmed/18953034 PMID:18953034
Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A (2011) Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res 39(Database issue):D141–D145. http://www.ncbi.nlm.nih.gov/pubmed/21062808 PMID:21062808
Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Bryant SH (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33(Database issue):D192–D196. http://www.ncbi.nlm.nih.gov/pubmed/15608175 PMID:15608175
Weinberg Z, Regulski EE, Hammond MC, Barrick JE, Yao Z, Ruzzo WL, Breaker RR (2008) The aptamer core of SAM-IV riboswitches mimics the ligand-binding site of SAM-I riboswitches. RNA 14:822–828. http://www.ncbi.nlm.nih.gov/pubmed/18369181 PMID:18369181
Regulski EE, Moy RH, Weinberg Z, Barrick JE, Yao Z, Ruzzo WL, Breaker RR (2008) A widespread riboswitch candidate that controls bacterial genes involved in molybdenum cofactor and tungsten cofactor metabolism. Mol Microbiol 68:918–932. http://www.ncbi.nlm.nih.gov/pubmed/18363797 PMID: 18363797
Sudarsan N, Lee ER, Weinberg Z, Moy RH, Kim JN, Link KH, Breaker RR (2008) Riboswitches in eubacteria sense the second messenger cyclic di-GMP. Science 321(5887):411–413. http://www.ncbi.nlm.nih.gov/pubmed/18635805 PMID:18635805
Wang JX, Lee ER, Morales DR, Lim J, Breaker RR (2008) Riboswitches that sense S-adenosylhomocysteine and activate genes involved in coenzyme recycling. Mol Cell 29:691–702. http://www.ncbi.nlm.nih.gov/pubmed/18374645 PMID:18374645
Meyer MM, Roth A, Chervin SM, Garcia GA, Breaker RR (2008) Confirmation of a second natural preQ1 aptamer class in Streptococcaceae bacteria. RNA 14:685–695. http://www.ncbi.nlm.nih.gov/pubmed/18305186 PMID:18305186
Weinberg Z, Wang JX, Bogue J, Yang J, Corbino K, Moy RH, Breaker RR (2010) Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biol 11(3):R31. http://www.ncbi.nlm.nih.gov/pubmed/20230605 PMID:20230605
Weinberg Z, Perreault J, Meyer MM, Breaker RR (2009) Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis. Nature 462(7273):656–659. http://www.ncbi.nlm.nih.gov/pubmed/19956260 PMID:19956260
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715. http://www.ncbi.nlm.nih.gov/pubmed/15060014 PMID:15060014
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12(6):996–1006. http://www.ncbi.nlm.nih.gov/pubmed/12045153 PMID:12045153
ENCODE Project Consortium et al (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146):799–816. http://www.ncbi.nlm.nih.gov/pubmed/17571346 PMID:17571346
Lunter G, Ponting CP, Hein J (2006) Genome-wide identification of human functional DNA using a neutral indel model. PLoS Comput Biol 2(1):e5. http://www.ncbi.nlm.nih.gov/pubmed/16410828 PMID:16410828
Gardner PP, Wilm A, Washietl S (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res 33(8):2433–2439. http://www.ncbi.nlm.nih.gov/pubmed/15860779 PMID:15860779
Torarinsson E, Sawera M, Havgaard JH, Fredholm M, Gorodkin J (2006) Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res 16(7):885–889. Erratum: Genome Res 16:1439, 2006. http://www.ncbi.nlm.nih.gov/pubmed/16751343 PMID:16751343
Lu ZJ, Yip KY, Wang G, Shou C, Hillier LW, Khurana E, Agarwal A, Auerbach R, Rozowsky J, Cheng C, Kato M, Miller DM, Slack F, Snyder M, Waterston RH, Reinke V, Gerstein MB (2011) Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data. Genome Res 21(2):276–285. http://www.ncbi.nlm.nih.gov/pubmed/21177971 PMID:21177971
Chen XS, Brown CM (2012) Computational identification of new structured cis-regulatory elements in the 3′-untranslated region of human protein coding genes. Nucleic Acids Res 40(18):8862–8873. doi: 10.1093/nar/gks684. http://www.ncbi.nlm.nih.gov/pubmed/22821558 PMID:22821558
Weinberg Z, Ruzzo WL (2004) Faster genome annotation of non-coding RNA families without loss of accuracy. In: RECOMB04: Proceedings of the eighth annual international conference on computational molecular biology. ACM, San Diego, pp 243–251. http://doi.acm.org/10.1145/974614.974647http://doi.acm.org/10.1145/ http://doi.acm.org/10.1145/974614.974647974614.974647
Weinberg Z, Ruzzo WL (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics 20(1):i334–i341. http://www.ncbi.nlm.nih.gov/pubmed/15262817 PMID:15262817
Weinberg Z, Ruzzo WL (2006) Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics 22(1):35–39. http://www.ncbi.nlm.nih.gov/pubmed/16267089 PMID:16267089
Sun Y, Buhler J, Yuan C (2012) Designing filters for fast-known ncRNA identification. IEEE/ACM Trans Comput Biol Bioinformatics 9(3):774–787. http://www.ncbi.nlm.nih.gov/pubmed/22084145 PMID: 22084145
Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R (2007) Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol 3(4):e65. http://www.ncbi.nlm.nih.gov/pubmed/17432929 PMID:17432929
Tseng HH, Weinberg Z, Gore J, Breaker RR, Ruzzo WL (2009) Finding non-coding RNAs through genome-scale clustering. J Bioinformatics Comput Biol 7:373–388. http://www.ncbi.nlm.nih.gov/pubmed/19340921 PMID:19340921
Parker BJ, Moltke I, Roth A, Washietl S, Wen J, Kellis M, Breaker R, Pedersen JS (2011) New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes. Genome Res 21(11):1929–1943. http://www.ncbi.nlm.nih.gov/pubmed/21994249 PMID:21994249
Rivas E, Eddy SR (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2(1):8. ISSN 1471-2105. http://www.ncbi.nlm.nih.gov/pubmed/11801179 PMID:11801179
Rivas E, Klein RJ, Jones TA, Eddy SR (2001) Computational identification of noncoding RNAs in E. coli by comparative genomics. Curr Biol 11(17):1369–1373. http://www.ncbi.nlm.nih.gov/pubmed/11553332 PMID:11553332
McCutcheon JP, Eddy SR (2003) Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics. Nucleic Acids Res 31(14):4119–4128. http://www.ncbi.nlm.nih.gov/pubmed/12853629 PMID:12853629
Missal K, Rose D, Stadler PF (2005) Non-coding RNAs in Ciona intestinalis. Bioinformatics 21(2):ii77–ii78. http://www.ncbi.nlm.nih.gov/pubmed/16204130 PMID:16204130
Missal K, Zhu X, Rose D, Deng W, Skogerbø G, Chen R, Stadler PF (2006) Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. J Exp Zoolog B Mol Dev Evol 306(4):379–392. http://www.ncbi.nlm.nih.gov/pubmed/16425273 PMID:16425273
Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF (2005) Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat Biotechnol 23(11):1383–1390. http://www.ncbi.nlm.nih.gov/pubmed/16273071 PMID:16273071
Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, Hertel J, Lindemeyer M, Reiche K, Tanzer A, Ucla C, Wyss C, Antonarakis SE, Denoeud F, Lagarde J, Drenkow J, Kapranov P, Gingeras TR, Guigo R, Snyder M, Gerstein MB, Reymond A, Hofacker IL, Stadler PF (2007) Structured RNAs in the ENCODE selected regions of the human genome. Genome Res 17(6):852–864. http://www.ncbi.nlm.nih.gov/pubmed/17568003 PMID:17568003
Uzilov AV, Keegan JM, Mathews DH (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 7:173. http://www.ncbi.nlm.nih.gov/pubmed/16566836 PMID:16566836
Havgaard JH, Lyngsø RB, Stormo GD, Gorodkin J (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 21(9):1815–1824. http://www.ncbi.nlm.nih.gov/pubmed/15657094 PMID:15657094
Havgaard JH, Torarinsson E, Gorodkin J (2007) Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol 3:1996–1908. http://www.ncbi.nlm.nih.gov/pubmed/17937495 PMID:17937495
Acknowledgements
This work is supported by the Danish Council for Independent Research (Technology and Production Sciences), the Danish Council for Strategic Research (Programme Commission on Strategic Growth Technologies), as well as the Danish Center for Scientific Computing.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media New York
About this protocol
Cite this protocol
Ruzzo, W.L., Gorodkin, J. (2014). De Novo Discovery of Structured ncRNA Motifs in Genomic Sequences. In: Gorodkin, J., Ruzzo, W. (eds) RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods. Methods in Molecular Biology, vol 1097. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-709-9_15
Download citation
DOI: https://doi.org/10.1007/978-1-62703-709-9_15
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-708-2
Online ISBN: 978-1-62703-709-9
eBook Packages: Springer Protocols