Evaluation of a sophisticated SCFG design for RNA secondary structure prediction

Nebel, Markus E.; Scheid, Anika

doi:10.1007/s12064-011-0139-7

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction

Original Paper
Published: 02 December 2011

Volume 130, pages 313–336, (2011)
Cite this article

Theory in Biosciences Aims and scope Submit manuscript

Markus E. Nebel¹ &
Anika Scheid¹

289 Accesses
12 Citations
Explore all metrics

Abstract

Predicting secondary structures of RNA molecules is one of the fundamental problems of and thus a challenging task in computational structural biology. Over the past decades, mainly two different approaches have been considered to compute predictions of RNA secondary structures from a single sequence: the first one relies on physics-based and the other on probabilistic RNA models. Particularly, the free energy minimization (MFE) approach is usually considered the most popular and successful method. Moreover, based on the paradigm-shifting work by McCaskill which proposes the computation of partition functions (PFs) and base pair probabilities based on thermodynamics, several extended partition function algorithms, statistical sampling methods and clustering techniques have been invented over the last years. However, the accuracy of the corresponding algorithms is limited by the quality of underlying physics-based models, which include a vast number of thermodynamic parameters and are still incomplete. The competing probabilistic approach is based on stochastic context-free grammars (SCFGs) or corresponding generalizations, like conditional log-linear models (CLLMs). These methods abstract from free energies and instead try to learn about the structural behavior of the molecules by learning (a manageable number of) probabilistic parameters from trusted RNA structure databases. In this work, we introduce and evaluate a sophisticated SCFG design that mirrors state-of-the-art physics-based RNA structure prediction procedures by distinguishing between all features of RNA that imply different energy rules. This SCFG actually serves as the foundation for a statistical sampling algorithm for RNA secondary structures of a single sequence that represents a probabilistic counterpart to the sampling extension of the PF approach. Furthermore, some new ways to derive meaningful structure predictions from generated sample sets are presented. They are used to compare the predictive accuracy of our model to that of other probabilistic and energy-based prediction methods. Particularly, comparisons to lightweight SCFGs and corresponding CLLMs for RNA structure prediction indicate that more complex SCFG designs might yield higher accuracy but eventually require more comprehensive and pure training sets. Investigations on both the accuracies of predicted foldings and the overall quality of generated sample sets (especially on an abstraction level, called abstract shapes of generated structures, that is relevant for biologists) yield the conclusion that the Boltzmann distribution of the PF sampling approach is more centered than the ensemble distribution induced by the sophisticated SCFG model, which implies a greater structural diversity within generated samples. In general, neither of the two distinct ensemble distributions is more adequate than the other and the corresponding results obtained by statistical sampling can be expected to bare fundamental differences, such that the method to be preferred for a particular input sequence strongly depends on the considered RNA type.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting RNA Secondary Structures: One-grammar-fits-all Solution

SCFGs in RNA Secondary Structure Prediction : A Hands-on Approach

Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models

Notes

In this article, we call an approach probabilistic if it makes no use of free energy-based models; even if a PF-based Boltzmann sample is a random event, we accordingly do not assume it probabilistic.
A structurally ambiguous SCFG mirror of modern energy-based algorithms for single sequence structure prediction has already been described in Rivas and Eddy (2000).
Note that in order to avoid ambiguity, we will denote a particular sequence by r and the corresponding secondary structure by s in the sequel.
To ensure that a SCFG gets consistent one can, e.g., assign relative frequencies to the productions, which are computed by counting the production rules used in the leftmost derivations of a finite sample (RNA database) of words from the generated language (Chaudhuri et al. 1983).
Here, we decided to consider any possible pair as valid base pair, where non-canonical ones are mostly prohibited due to small probabilities. Thus, in contrast to the thermodynamics-based PF approach which can only handle canonical base pairs, our algorithm is able to deal with arbitrary base pairs, in a convenient way: when using appropriate probabilities, canonical base pairs will be very likely and non-canonical ones will be very unprobable (but not necessarily impossible) to be formed. However, since non-canonical base pairs are usually not permitted in secondary structure models (to limit the number of possible foldings), it would also be adequate to allow only canonical ones. The probabilities for non-canonical base pairs would then be equal to zero.
Note that this separation into transition and emission probabilities corresponds to the standard treatment applied in hidden Markov models.
All references starting with Sm are references to the Supplementary material available at http://wwwagak.cs.uni-kl.de/research/publications/.
Both methods can be implemented to run in \({{\mathcal O}(n^{3})}\) time and with \({{\mathcal O}(n^{2})}\) space requirements for a sequence of length n, where a single secondary structure can be drawn in \({{\mathcal O}(n^{2})}\) time.
Note that the positive predictive value is often called specificity, like, for example, in Do et al. (2006), which will be extensively referenced in the sequel.
It should be clear that this formula would only be correct for \(m_s = 1.\) For choices of \(m_s > 1,\) however, it would only yield approximate results.
Note that the most probable structure is assumed to be (nearly) the MFE structure when sampling is realized via PFs.
TN is the number of base pairs which were correctly predicted as non-matching (true negatives).
In order to perform a k-fold cross-validation, k ≥ 2, on the basis of a given probabilistic model and a set of real-world data, we first have to partition the data randomly into k approximately equal-sized subsets (“folds”). Then, for any \(i \in \{ 1, \ldots, k \},\) we must estimate the model parameters from all objects that are not contained in fold i (training set) and validate the results obtained for all objects that actually belong to fold i (benchmark set). The corresponding result of the cross-validation process is then the average of the results derived for the different folds i, 1 ≤ i ≤ k.
This sample size has proven to be adequate for most applications, as even for a huge set of possible secondary structures of a given sequence, a sample of only 1,000 structures can yield statistical reproducibility of typical sampling statistics, even if samples can be entirely different (see Ding and Lawrence 2003).
Note that in the few cases where the PF approach yields more different shapes, we generally further restricted the possible structures by prohibiting isolated base pairs \((\hbox{min}_{\rm hel}> 1),\) which are in fact allowed in PF calculations.

References

Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424
Article PubMed CAS Google Scholar
Chaudhuri R, Pham S, Garcia ON (1983) Solution to an open problem on probabilistic grammars. IEEE Trans Comput C 32(8):748–750
Article Google Scholar
Ding Y (2006) Statistical and bayesian approaches to RNA secondary structure prediction. RNA 12:323–331
Article PubMed CAS Google Scholar
Ding Y, Lawrence CE (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 31(24):7280–7301
Article PubMed CAS Google Scholar
Ding Y, Chan CY, Lawrence CE (2004) Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res 32:W135–W141
Article PubMed CAS Google Scholar
Ding Y, Yu Chan C, Lawrence CE (2005) RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA 11:1157–1166
Article PubMed CAS Google Scholar
Dirks RM, Pierce NA (2003) A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem 24:1664–1677
Article PubMed CAS Google Scholar
Dirks RM, Pierce NA (2004) An algorithm for computing nucleic acid base-pairing probabilities including pseudoknots. J Comput Chem 25:1295–1304
Article PubMed CAS Google Scholar
Do CB, Woods DA, Batzoglou S (2006) CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 22(14):e90–e98
Article PubMed CAS Google Scholar
Dowell RD, Eddy SR (2004) Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinform 5:71
Article Google Scholar
Eddy SR, Durbin R (1994) RNA sequence analysis using covariance models. Nucleic Acids Res 2(11):2079–2088
Article Google Scholar
Fu KS, Huang T (1972) Stochastic grammars and languages. Int J Comput Inform Sci 1(2):135–170
Article Google Scholar
Giegerich R, zu Siederdissen R (2011) Semantics and ambiguity of stochastic RNA family models. IEEE/ACM Trans Comput Biol Bioinform 8:499–516
Article PubMed CAS Google Scholar
Giegerich R, Voß B, Rehmsmeier M (2004) Abstract shapes of RNA. Nucleic Acids Res 32(16):4843–4851
Article PubMed CAS Google Scholar
Goodman JT (1998) Parsing inside-out. PhD thesis, Harvard University, Cambridge, MA
Goodman J (1999) Semiring parsing. Comput Linguist 25(4):573–605
Google Scholar
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003) Rfam: an RNA family database. Nucleic Acids Res 31(1):439–441
Article PubMed CAS Google Scholar
Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33:D121–D124
Article PubMed CAS Google Scholar
Hamada M, Kiryu H, Sato K, Mituyama T, Asai K (2009) Prediction of RNA secondary structure using generalized centroid estimators. Bioinformatics 25(4):465–473
Article PubMed CAS Google Scholar
Hofacker IL (2003) The Vienna RNA secondary structure server. Nucleic Acids Res 31(13):3429–3431
Article PubMed CAS Google Scholar
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Schuster P (1994) Fast folding and comparison of RNA secondary structures (the Vienna RNA package). Monatsh Chem 125:167–188
Article CAS Google Scholar
Huang T, Fu KS (1971) On stochastic context-free languages. Inform Sci 3:201–224
Article Google Scholar
Janssen S, Reeder J, Giegerich R (2008) Shape based indexing for faster search of RNA family databases. BMC Bioinform 9:131
Google Scholar
Knudsen B, Hein J (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15(6):446–454
Article PubMed CAS Google Scholar
Knudsen B, Hein J (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 31(13):3423–3428
Article PubMed CAS Google Scholar
Mathews DH, Sabina J, Zuker M, Turner DH (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288:911–940
Article PubMed CAS Google Scholar
McCaskill JS (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29:1105–1119
Article PubMed CAS Google Scholar
Nebel ME, Scheid A (2009) On quantitative effects of RNA shape abstraction. Theory Biosci 128(4):211
Article PubMed CAS Google Scholar
Nebel ME, Scheid A, Weinberg F (2011) Random generation of RNA secondary structures according to native distributions. Algorithms Mol Biol 6(1):24
Google Scholar
Nussinov R, Jacobson AB (1980) Fast algorithms for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci USA 77(11):6309–6313
Article PubMed CAS Google Scholar
Nussinov R, Pieczenik G, Griggs JR, Kleitman DJ (1978) Algorithms for loop matchings. SIAM J Appl Math 35:68–82
Article Google Scholar
Reeder J, Giegerich R (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinform 5:104
Article Google Scholar
Rivas E, Eddy SR (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol 285:2053–2068
Article PubMed CAS Google Scholar
Rivas E, Eddy SR (2000) Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 6:583–605
Article Google Scholar
Rivas E, Lang R, Eddy SR (2011) A range of complex probabilistic models for RNA secondary structure prediction that include the nearest neighbor model and more
Rozenski J, Crain PF, McCloskey JA (1999) The RNA modification database. Nucleic Acids Res 27:196–197
Article PubMed CAS Google Scholar
Ruan J, Stormo GD, Zhang W (2004) An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics 20(1):58–66
Article PubMed CAS Google Scholar
Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S (1998) Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26:148–153
Article PubMed CAS Google Scholar
Steffen P, Voß B, Rehmsmeier M, Reeder J, Giegerich R (2006) RNAshapes 2.1.1 manual
Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J (2002) 5s ribosomal RNA database. Nucleic Acids Res 30:176–178
Article PubMed CAS Google Scholar
Viennot G, Vauchaussade De Chaumont M (1985) Enumeration of RNA secondary structures by complexity. Math Med Biol. Lect Notes Biomath 57:360–365
Waterman MS (1978) Secondary structure of single-stranded nucleic acids. Adv Math Suppl Stud 1:167–212
Google Scholar
Weinberg F, Nebel ME (2011) Applying length-dependent stochastic context-free grammars to RNA secondary structure prediction. Algorithms 4(4):223–238
Article CAS Google Scholar
Wuchty S, Fontana W, Hofacker I, Schuster P (1999) Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers 49:145–165
Article PubMed CAS Google Scholar
Xia T, SantaLucia J Jr, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs. Biochemistry 37:14719–14735
Article PubMed CAS Google Scholar
Zuker M (1989) On finding all suboptimal foldings of an RNA molecule. Science 244:48–52
Article PubMed CAS Google Scholar
Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31(13):3406–3415
Article PubMed CAS Google Scholar
Zuker M, Stiegler P (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9:133–148
Article PubMed CAS Google Scholar
Zuker M, Mathews DH, Turner DH (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In: Barciszewski J, Clark BFC (eds) RNA biochemistry and biotechnology. NATO ASI series. Kluwer Academic Publishers, Dordrecht, pp 11–43

Download references

Acknowledgments

The authors wish to thank two anonymous reviewers for their careful and helpful remarks and suggestions made for a previous version of this article.

Author information

Authors and Affiliations

Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany
Markus E. Nebel & Anika Scheid

Authors

Markus E. Nebel
View author publications
You can also search for this author in PubMed Google Scholar
Anika Scheid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anika Scheid.

Additional information

The research of Anika Scheid is supported by Carl-Zeiss foundation.

Electronic supplementary material

Below is the link to the electronic supplementary material.

PDF (596 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nebel, M.E., Scheid, A. Evaluation of a sophisticated SCFG design for RNA secondary structure prediction. Theory Biosci. 130, 313–336 (2011). https://doi.org/10.1007/s12064-011-0139-7

Download citation

Received: 20 July 2011
Accepted: 07 November 2011
Published: 02 December 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s12064-011-0139-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction

Abstract

Access this article

Similar content being viewed by others

Predicting RNA Secondary Structures: One-grammar-fits-all Solution

SCFGs in RNA Secondary Structure Prediction : A Hands-on Approach

Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

PDF (596 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluation of a sophisticated SCFG design for RNA secondary structure prediction

Abstract

Access this article

Similar content being viewed by others

Predicting RNA Secondary Structures: One-grammar-fits-all Solution

SCFGs in RNA Secondary Structure Prediction : A Hands-on Approach

Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

PDF (596 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation