Asymptotic structural properties of quasi-random saturated structures of RNA
- 2.3k Downloads
RNA folding depends on the distribution of kinetic traps in the landscape of all secondary structures. Kinetic traps in the Nussinov energy model are precisely those secondary structures that are saturated, meaning that no base pair can be added without introducing either a pseudoknot or base triple. In previous work, we investigated asymptotic combinatorics of both random saturated structures and of quasi-random saturated structures, where the latter are constructed by a natural stochastic process.
We prove that for quasi-random saturated structures with the uniform distribution, the asymptotic expected number of external loops is O(logn) and the asymptotic expected maximum stem length is O(logn), while under the Zipf distribution, the asymptotic expected number of external loops is O(log2n) and the asymptotic expected maximum stem length is O(logn/log logn).
Quasi-random saturated structures are generated by a stochastic greedy method, which is simple to implement. Structural features of random saturated structures appear to resemble those of quasi-random saturated structures, and the latter appear to constitute a class for which both the generation of sampled structures as well as a combinatorial investigation of structural features may be simpler to undertake.
KeywordsRNA secondary structure Kinetic trap combinatorial analysis Zipf distribution
RNA is an important biomolecule, now known to play both an information carrying role, as in retroviruses, such as HIV, whose genome consists of RNA, as well as a catalytic role, as in the the peptidyl transferase catalysis by RNA, which concatenates an amino acid to a growing peptide chain in the formation of a protein on the ribosome . It has recently emerged that RNA plays a wide range of previously unsuspected roles in many biological processes, including retranslation of the genetic code (selenocysteine insertion , ribosomal frameshift ), transcriptional and translational gene regulation [4, 5], temperature sensitive conformational switches [6, 7], chemical modification of specific nucleotides in the ribosome , regulation of alternative splicing , etc.
The diverse and biologically important functions performed by RNA molecules depend for the most part on RNA tertiary structure, which is known to be constrained by secondary structure, the latter acting as a scaffold for tertiary contact formation . For this reason, much work has focused on RNA secondary structure prediction [11, 12, 13, 14] and on the kinetics of RNA folding [15, 16, 17]. In , Stein and Waterman pioneered work on asymptotic combinatorics of RNA secondary structures, where they developed recurrence relations to count the number of secondary structures. These recurrence relations were later modified by Nussinov and Jacobson  and especially by Zuker  to compute the minimum free energy secondary structure.
Formally, a secondary structure for a given RNA nucleotide sequence a1, …, a n is a set S of base pairs (i, j), such that (i) if (i, j) ∈ S then a i , a j form either a Watson-Crick (AU,UA,CG,GC) or wobble (GU) base pair, (ii) if (i, j) ∈ S then j - i > θ = 3 (a steric constraint requiring that there be at least θ = 3 unpaired bases between any two paired bases), (iii) if (i, j) ∈ S then for all j′ ≠ j and i′ ≠ i, (i′, j) ∉ S and (i, j′) ∉ S (nonexistence of base triples), (iv) if (i, j) ∈ S and (k, ℓ) ∈ S, then it is not the case that i < k < j < ℓ (nonexistence of pseudoknots). For the purposes of this paper, following Stein and Waterman , we consider the homopolymer model of RNA, in which condition (i) is dropped, thus entailing that any base can pair with any other base, and we modify condition (ii) so that θ = 1. With inessential additional complications in the combinatorics, we could handle the situation where θ is any fixed positive constant.
For a given RNA sequence, a saturated secondary structure is one such that no base pair can be added without introducing either a pseudoknot or base triple; in other words, saturated structures have a maximal number of base pairs, while the Nussinov minimum energy structure has a maximum number of base pairs. Since the kinetics of RNA structure formation depend on secondary structure energy landscape, and more particularly on the distribution of kinetic traps (saturated structures), in previous work we have designed an algorithm to compute the number of saturated structures , determine the asymptotic number of saturated secondary structures  and the expected number of base pairs in saturated and quasi-random saturated structures .
Secondary structures are conveniently displayed in Vienna dot bracket notation, consisting of a balanced parenthesis expression with dots, where an unpaired nucleotide at position i is depicted by a dot at that position, while a base pair (i, j) is depicted by the presence of matching left and right parentheses located respectively at positions i and j. The minimum free energy secondary structure of the selenocysteine insertion (SECIS) sequence fruA, given by
is a saturated structure. In contrast, the following structure for the Gag/pro ribosomal frameshift site of mouse mammary tumor virus  is not only not saturated, but includes a pseudoknot, as shown by the square bracket notation necessary to show the crossing base pairs.
Having defined saturated structure, we now define a stochastic greedy process to generate random saturated structures, technically denoted quasi-random saturated structures. This notion was defined in , where we showed that the expected number of base pairs in quasi-random saturated structures is 0.340633 · n, just slightly more than the expected number 0.337361 · n of base pairs in all saturated structures.
Results and discussion
With these definitions, we are now in a position to state some results concerning structural features of (quasi) random saturated structures. Under the uniform distribution, we show that the asymptotic expected number of external loops is O(logn), and the expected maximum stem length is O(logn). In contrast, under the Zipf distribution, the asymptotic expected number of external loops is O(log2n), and the expected maximum stem length is O(logn/ log logn)a.
In the literature on RNA combinatorics ( and subsequent papers), combinatorial results have been proved for the homopolymer as well as for the Bernouilli model, in which latter one assumes a stickiness parameter p = 2(p A p U + p G p U + p G p C ) that any two positions can base-pair. To the best of our knowledge, the current paper appears to be one of the first combinatorial analyses of RNA secondary structures, which involves the Zipf distribution for base pairs.
Saturated secondary structures form natural kinetic traps in the energy landscape with respect to the Nussinov energy model , in that it is energetically unfavorable to move from a saturated structure to any neighboring structure that differs by one base pair. However, there is currently no program to sample saturated secondary structures with respect to the Nussinov energy (given either a homopolymer or an RNA sequence), although the programs we developed in [21, 22] could be extended to do so for both homopolymers and RNA sequences. (Note that the program RNAsat, described in , can sample saturated structures in the Turner energy landscape, and the program RNAlocopt, described in , can sample locally optimal structures in the Turner energy landscape). In contrast, it is extremely simple to implement a program to sample quasi-random saturated structures, thus permitting one to easily obtain an idea of various structural features in the ensemble of quasi-random structures. We expect many structural features to be approximately shared between the random saturated structures and quasi-random saturated structures – for instance, as earlier mentioned, the expected number of base pairs in quasi-random saturated structures is 0.340633·n, while the expected number of base pairs in saturated structures is 0.337361·n, almost the same value .
Generally, it requires substantial effort involving the application of deep results from complex analysis, such as the Flajolet-Odlyzko theorem  or the Drmota-Lalley-Woods theorem [28, 29, 30] (see also the text by Flajolet and Sedgewick ) to prove asymptotic results, such as the fact that the asymptotic number of saturated structures is 1.07427 · n-3/2 · 2.35467 n and the asymptotic expected number of base pairs is 0.337361 · n, and the asymptotic expected number of hairpins is 0.323954 · 1.69562 n . In contrast, the argument given in this paper is elementary, not requiring complex analysis. Taken together we believe that the stochastic greedy method, described in Figure 1, performs reasonably well in sampling saturated structures, that appear to be representative of the ensemble of all saturated structures, and supports a combinatorial analysis that may be simpler than that required for all saturated structures.
Structural properties of quasi-random saturated secondary structures
In general G(S) is a forest; i.e., a set of trees. In the sequel we determine the size of several structural parameters of random saturated secondary structures, in particular, expected stem length and expected number of external loops. These parameters are studied both for the uniform and Zipf distributions. Before proceeding any further, we first define the probability distributions to be considered.
Zipf’s law is the observation first made by the deceased Harvard linguist, George Kingsley Zipf, that the frequency p i of English words, when graphed against their rank i (in the list of English words sorted in decreasing order with respect to frequency), obeys the power law p i ≈ i - α. More generally, Zipf’s law is the statement of a power law, when plotting frequency against rank (Zipf’s first law) or when plotting frequency against reverse rank (Zipf’s second law). In bioinformatics, Zipf’s law has been observed in the frequency/rank plot of differentially expressed gene in microarray data , as well as in the frequency/rank plot for protein structures , where there are a few very frequent structures, and very many rare structures. In the remainder of the paper, we consider probability distributions related to Zipf’s law.
for all n ≥ 2.
Observe that when α = 0 the α-Zipf distribution is the same as the uniform distribution, while if α = 1, we have the (classical) Zipf distribution . Moreover, observe that as α increases, “shorter” base pairs are being selected with higher probability by the stochastic process described in equation (1).
The stochastic process of generating random saturated secondary structures, according to equation (1), is of the “divide-and-conquer” type, very common in computer science, where well-known algorithms such as QUICKSORT choose a division point according to the uniform distribution. Stochastic algorithms of this kind have been intensively studied for the uniform distribution. Known results suggest that the probability distribution for the number of base pairs in random saturated structures, generated by the earlier described stochastic process (uniform choice of base pairs) is asymptotically Gaussian (see  and ). We also note that structural features of trees have been well studied including the expected depth and the exact distribution of the depth; see, for instance, [36, 38, 39]. In the sequel, we consider a random binary search tree with n nodes obtained by inserting n i.i.d. random variables X1, …, X n . Careful analysis of  and  implies our results in the section on the uniform distribution. However we will use a different and simpler technique that enables the analysis not only for the uniform distribution in the following section concerning the Uniform Distribution, but also for the Zipf distribution in the section following this section.
An important observation concerns the threshold θ considered above. All the results proved in this section are “upper bounds” and therefore it is easily seen that they are valid for any threshold θ ≥ 0. Therefore to simplify proofs in the sequel we consider the case of threshold θ = 0.
The main theorem of this section concerns stem length and number of external loops of random saturated structures S, generated by a natural stochastic process associated with the tree graph G(S). Throughout the remainder of the paper, we state results in terms of random saturated structures, although we intend to mean only those structures generated by the stochastic process associated with the graph G(S); we will distinguish between the uniform and α-Zipf variant of the stochastic process. Without this convention, statements of lemmas and theorems would be too cumbersome.
Theorem 1. With high probability, the number of external loops and the maximum stem length of random saturated structures generated by the uniform distribution variant is O(logn).
Proof. Before we give the proof of the main theorem it will be necessary to give the proof of two lemmas. In the first lemma we consider the expected number of external loops.
Lemma 1. With high probability, the number of external loops is O(logn).
where the last inequality is valid for k + 3 ≤ n.
In particular, . This completes the proof of Lemma 1. □
Next we prove the following lemma.
Lemma 2. With high probability, the maximum stem length is O(logn).
(notice the dependence of the random variable T′ on n).
In particular, since , we conclude that . Finally, we can derive It follows that
As a consequence we conclude that This completes the proof of Lemma 2. □
Finally, we can complete the proof of the main result of Theorem 1 since this is now immediate from Lemmas 1 and 2.
is defined to be the (n - 1)st harmonic number. As before, the chord joining 1 and u partitions the ring into two parts. One part has k bases between 1 and u, where k ≤ n - 2, and the other part has the remaining n-k-2 bases (see Figure 1).
Define Z n to be the expected number of base pairs of a random saturated secondary structure with n bases, where n ≥ 2. A base pair (1, u) is added as follows. Select u ≥ 2 at random among 2, 3, …, n with probability .
for all n ≥ 2. The main theorem of this section concerns the overall structure of random secondary structures.
Theorem 2. With high probability, random saturated secondary structures generated by the Zipf distribution have O(log2n) external loops and stem length O(logn/ log logn).
Proof. Before we give the proof, it will be necessary to give the proof of two lemmas. In the first lemma we look at the number of external loops.
Lemma 3. With high probability, the number of external loops is O(log2n).
In particular, since H(n - 1) ∼ lnn we conclude that This completes the proof of Lemma 3. □
The next result concerns the maximum stem length. We can prove the following result.
Lemma 4. With high probability, the maximum stem length is O(logn/ log logn).
(notice the dependence of the random variable T′ on n).
The proof shows that the leftmost sequence of base pairs given by the recursive construction of the random secondary structure has length at most O(logn/ log logn) with high probability. We would like to prove the same for any sequence of nested base pairs. It is easily seen that a proof similar to the one presented above works. This completes the proof of Lemma 4. □
If we now combine Lemmas 3 and 4 we derive the proof of Theorem 2.
a Throughout this paper all logarithms are in base 2.
Many thanks to the anonymous referees for useful comments that improved significantly the presentation.
Funding for the research of P. Clote was provided by National Science Foundation grant DMS-1016618. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Additional support was provided to P. Clote by the Deutscher Akademischer Austauschdienst (DAAD) during a visit to the Computational Molecular Biology Department of Martin Vingron, at the Max Planck Institute for Molecular Genetics. Funding for the research of E. Kranakis was provided by the Natural Sciences and Engineering Research Council of Canada (NSERC) and Mathematics of Information Technology and Complex Systems (MITACS).
- 31.Sedgewick R, Flajolet P: Analytic Combinatorics. 2009, [ISBN-13: 9780521898065], Cambridge: Cambridge UniversityGoogle Scholar
- 35.Zipf G: Human Behavior and the Principle of Least Effort. 1949, Cambridge: Addison WesleyGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.