Skip to main content
Log in

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction

  • Original Paper
  • Published:
Theory in Biosciences Aims and scope Submit manuscript

Abstract

Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. In this article, we call an approach probabilistic if it makes no use of free energy-based models; even if a PF-based Boltzmann sample is a random event, we accordingly do not assume it probabilistic.

  2. A structurally ambiguous SCFG mirror of modern energy-based algorithms for single sequence structure prediction has already been described in Rivas and Eddy (2000).

  3. Note that in order to avoid ambiguity, we will denote a particular sequence by r and the corresponding secondary structure by s in the sequel.

  4. To ensure that a SCFG gets consistent one can, e.g., assign relative frequencies to the productions, which are computed by counting the production rules used in the leftmost derivations of a finite sample (RNA database) of words from the generated language (Chaudhuri et al. 1983).

  5. Here, we decided to consider any possible pair as valid base pair, where non-canonical ones are mostly prohibited due to small probabilities. Thus, in contrast to the thermodynamics-based PF approach which can only handle canonical base pairs, our algorithm is able to deal with arbitrary base pairs, in a convenient way: when using appropriate probabilities, canonical base pairs will be very likely and non-canonical ones will be very unprobable (but not necessarily impossible) to be formed. However, since non-canonical base pairs are usually not permitted in secondary structure models (to limit the number of possible foldings), it would also be adequate to allow only canonical ones. The probabilities for non-canonical base pairs would then be equal to zero.

  6. Note that this separation into transition and emission probabilities corresponds to the standard treatment applied in hidden Markov models.

  7. All references starting with Sm are references to the Supplementary material available at http://wwwagak.cs.uni-kl.de/research/publications/.

  8. Both methods can be implemented to run in \({{\mathcal O}(n^{3})}\) time and with \({{\mathcal O}(n^{2})}\) space requirements for a sequence of length n, where a single secondary structure can be drawn in \({{\mathcal O}(n^{2})}\) time.

  9. Note that the positive predictive value is often called specificity, like, for example, in Do et al. (2006), which will be extensively referenced in the sequel.

  10. It should be clear that this formula would only be correct for \(m_s = 1.\) For choices of \(m_s > 1,\) however, it would only yield approximate results.

  11. Note that the most probable structure is assumed to be (nearly) the MFE structure when sampling is realized via PFs.

  12. TN is the number of base pairs which were correctly predicted as non-matching (true negatives).

  13. In order to perform a k-fold cross-validation, k ≥ 2, on the basis of a given probabilistic model and a set of real-world data, we first have to partition the data randomly into k approximately equal-sized subsets (“folds”). Then, for any \(i \in \{ 1, \ldots, k \},\) we must estimate the model parameters from all objects that are not contained in fold i (training set) and validate the results obtained for all objects that actually belong to fold i (benchmark set). The corresponding result of the cross-validation process is then the average of the results derived for the different folds i, 1 ≤ i ≤ k.

  14. This sample size has proven to be adequate for most applications, as even for a huge set of possible secondary structures of a given sequence, a sample of only 1,000 structures can yield statistical reproducibility of typical sampling statistics, even if samples can be entirely different (see Ding and Lawrence 2003).

  15. Note that in the few cases where the PF approach yields more different shapes, we generally further restricted the possible structures by prohibiting isolated base pairs \((\hbox{min}_{\rm hel}> 1),\) which are in fact allowed in PF calculations.

References

  • Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424

    Article  PubMed  CAS  Google Scholar 

  • Chaudhuri R, Pham S, Garcia ON (1983) Solution to an open problem on probabilistic grammars. IEEE Trans Comput C 32(8):748–750

    Article  Google Scholar 

  • Ding Y (2006) Statistical and bayesian approaches to RNA secondary structure prediction. RNA 12:323–331

    Article  PubMed  CAS  Google Scholar 

  • Ding Y, Lawrence CE (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 31(24):7280–7301

    Article  PubMed  CAS  Google Scholar 

  • Ding Y, Chan CY, Lawrence CE (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32:W135–W141

    Article  PubMed  CAS  Google Scholar 

  • Ding Y, Yu Chan C, Lawrence CE (2005) RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA 11:1157–1166

    Article  PubMed  CAS  Google Scholar 

  • Dirks RM, Pierce NA (2003) A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem 24:1664–1677

    Article  PubMed  CAS  Google Scholar 

  • Dirks RM, Pierce NA (2004) An algorithm for computing nucleic acid base-pairing probabilities including pseudoknots. J Comput Chem 25:1295–1304

    Article  PubMed  CAS  Google Scholar 

  • Do CB, Woods DA, Batzoglou S (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22(14):e90–e98

    Article  PubMed  CAS  Google Scholar 

  • Dowell RD, Eddy SR (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinform 5:71

    Article  Google Scholar 

  • Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Res 2(11):2079–2088

    Article  Google Scholar 

  • Fu KS, Huang T (1972) Stochastic grammars and languages. Int J Comput Inform Sci 1(2):135–170

    Article  Google Scholar 

  • Giegerich R, zu Siederdissen R (2011) Semantics and ambiguity of stochastic RNA family models. IEEE/ACM Trans Comput Biol Bioinform 8:499–516

    Article  PubMed  CAS  Google Scholar 

  • Giegerich R, Voß B, Rehmsmeier M (2004) Abstract shapes of RNA. Nucleic Acids Res 32(16):4843–4851

    Article  PubMed  CAS  Google Scholar 

  • Goodman JT (1998) Parsing inside-out. PhD thesis, Harvard University, Cambridge, MA

  • Goodman J (1999) Semiring parsing. Comput Linguist 25(4):573–605

    Google Scholar 

  • Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: an RNA family database. Nucleic Acids Res 31(1):439–441

    Article  PubMed  CAS  Google Scholar 

  • Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33:D121–D124

    Article  PubMed  CAS  Google Scholar 

  • Hamada M, Kiryu H, Sato K, Mituyama T, Asai K (2009) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics 25(4):465–473

    Article  PubMed  CAS  Google Scholar 

  • Hofacker IL (2003) The Vienna RNA secondary structure server. Nucleic Acids Res 31(13):3429–3431

    Article  PubMed  CAS  Google Scholar 

  • Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Schuster P (1994) Fast folding and comparison of RNA secondary structures (the Vienna RNA package). Monatsh Chem 125:167–188

    Article  CAS  Google Scholar 

  • Huang T, Fu KS (1971) On stochastic context-free languages. Inform Sci 3:201–224

    Article  Google Scholar 

  • Janssen S, Reeder J, Giegerich R (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinform 9:131

    Google Scholar 

  • Knudsen B, Hein J (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6):446–454

    Article  PubMed  CAS  Google Scholar 

  • Knudsen B, Hein J (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 31(13):3423–3428

    Article  PubMed  CAS  Google Scholar 

  • Mathews DH, Sabina J, Zuker M, Turner DH (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288:911–940

    Article  PubMed  CAS  Google Scholar 

  • McCaskill JS (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29:1105–1119

    Article  PubMed  CAS  Google Scholar 

  • Nebel ME, Scheid A (2009) On quantitative effects of RNA shape abstraction. Theory Biosci 128(4):211

    Article  PubMed  CAS  Google Scholar 

  • Nebel ME, Scheid A, Weinberg F (2011) Random generation of RNA secondary structures according to native distributions. Algorithms Mol Biol 6(1):24

    Google Scholar 

  • Nussinov R, Jacobson AB (1980) Fast algorithms for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci USA 77(11):6309–6313

    Article  PubMed  CAS  Google Scholar 

  • Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ (1978) Algorithms for loop matchings. SIAM J Appl Math 35:68–82

    Article  Google Scholar 

  • Reeder J, Giegerich R (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinform 5:104

    Article  Google Scholar 

  • Rivas E, Eddy SR (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol 285:2053–2068

    Article  PubMed  CAS  Google Scholar 

  • Rivas E, Eddy SR (2000) Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 6:583–605

    Article  Google Scholar 

  • Rivas E, Lang R, Eddy SR (2011) A range of complex probabilistic models for RNA secondary structure prediction that include the nearest neighbor model and more

  • Rozenski J, Crain PF, McCloskey JA (1999) The RNA modification database. Nucleic Acids Res 27:196–197

    Article  PubMed  CAS  Google Scholar 

  • Ruan J, Stormo GD, Zhang W (2004) An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics 20(1):58–66

    Article  PubMed  CAS  Google Scholar 

  • Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S (1998) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26:148–153

    Article  PubMed  CAS  Google Scholar 

  • Steffen P, Voß B, Rehmsmeier M, Reeder J, Giegerich R (2006) RNAshapes 2.1.1 manual

  • Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J (2002) 5s ribosomal RNA database. Nucleic Acids Res 30:176–178

    Article  PubMed  CAS  Google Scholar 

  • Viennot G, Vauchaussade De Chaumont M (1985) Enumeration of RNA secondary structures by complexity. Math Med Biol. Lect Notes Biomath 57:360–365

  • Waterman MS (1978) Secondary structure of single-stranded nucleic acids. Adv Math Suppl Stud 1:167–212

    Google Scholar 

  • Weinberg F, Nebel ME (2011) Applying length-dependent stochastic context-free grammars to RNA secondary structure prediction. Algorithms 4(4):223–238

    Article  CAS  Google Scholar 

  • Wuchty S, Fontana W, Hofacker I, Schuster P (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49:145–165

    Article  PubMed  CAS  Google Scholar 

  • Xia T, SantaLucia J Jr, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37:14719–14735

    Article  PubMed  CAS  Google Scholar 

  • Zuker M (1989) On finding all suboptimal foldings of an RNA molecule. Science 244:48–52

    Article  PubMed  CAS  Google Scholar 

  • Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31(13):3406–3415

    Article  PubMed  CAS  Google Scholar 

  • Zuker M, Stiegler P (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9:133–148

    Article  PubMed  CAS  Google Scholar 

  • Zuker M, Mathews DH, Turner DH (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In: Barciszewski J, Clark BFC (eds) RNA biochemistry and biotechnology. NATO ASI series. Kluwer Academic Publishers, Dordrecht, pp 11–43

Download references

Acknowledgments

The authors wish to thank two anonymous reviewers for their careful and helpful remarks and suggestions made for a previous version of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anika Scheid.

Additional information

The research of Anika Scheid is supported by Carl-Zeiss foundation.

Electronic supplementary material

Below is the link to the electronic supplementary material.

PDF (596 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nebel, M.E., Scheid, A. Evaluation of a sophisticated SCFG design for RNA secondary structure prediction. Theory Biosci. 130, 313–336 (2011). https://doi.org/10.1007/s12064-011-0139-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12064-011-0139-7

Keywords

Navigation