Abstract
Background
Pairwise stochastic contextfree grammars (Pair SCFGs) are powerful tools for evolutionary analysis of RNA, including simultaneous RNA sequence alignment and secondary structure prediction, but the associated algorithms are intensive in both CPU and memory usage. The same problem is faced by other RNA alignmentandfolding algorithms based on Sankoff's 1985 algorithm. It is therefore desirable to constrain such algorithms, by preprocessing the sequences and using this first pass to limit the range of structures and/or alignments that can be considered.
Results
We demonstrate how flexible classes of constraint can be imposed, greatly reducing the computational costs while maintaining a high quality of structural homology prediction. Any scoreattributed contextfree grammar (e.g. energybased scoring schemes, or conditionally normalized Pair SCFGs) is amenable to this treatment. It is now possible to combine independent structural and alignment constraints of unprecedented general flexibility in Pair SCFG alignment algorithms. We outline several applications to the bioinformatics of RNA sequence and structure, including WatermanEggert Nbest alignments and progressive multiple alignment. We evaluate the performance of the algorithm on test examples from the RFAM database.
Conclusion
A program, Stemloc, that implements these algorithms for efficient RNA sequence alignment and structure prediction is available under the GNU General Public License.
Similar content being viewed by others
Background
As our acquaintance with RNA's diverse functional repertoire develops [1–5], so does demand for faster and more accurate tools for RNA sequence analysis. In particular, comparative genomics approaches hold great promise for RNA, due to the wellbehaved basepairing correlations in an RNA gene family with conserved secondary structure (at least, wellbehaved compared to protein structures). Whereas the structural signal encoded in a single RNA gene is rather weak and may be barely (if at all) distinguishable from the secondary structure of a random sequence [6], the covariation signal increases with every additional sequence considered.
Many programs for comparative analysis of RNA require the sequences to be prealigned [7–9]. This can be a source of error, since misaligned bases can add noise that swamps the covariation signal. The most recent of these methods allows for some uncertainty in the alignment [7]. More generally, one can view the alignment and structure prediction as a combined problem, to be solved simultaneously. This is the approach taken in this paper, and by earlier programs such as FOLDALIGN [10], DYNALIGN [11], CARNAC [12], QRNA [9] and our dart library, introduced in a previous paper [13] and extended here. In this framework, fixing of the alignment can be viewed as a partial constraint on the simultaneous alignment/folding problem.
A powerful, general dynamic programming algorithm for simultaneously aligning and predicting the structure of multiple RNA sequences was developed by David Sankoff [14]. The energybased folding of Zuker et al [15] and recent approaches based on Stochastic ContextFree Grammars (SCFGs) [9, 13, 16–20] are both closely related to Sankoff's algorithm. The method takes time O(L^{3N}) and memory O(L^{2N}) for N sequences of length L. This is prohibitively expensive at the time of writing, except for fairly short sequences, which has motivated the development of various constrained versions of these algorithms [9–11, 13, 21].
The purpose of this paper is to report our progress on general pairwise constrained versions of Sankoff's algorithm (or, more precisely, constrained versions of some related dynamic programming algorithms for SCFGs). The overall aim is the simultaneous alignment and structure prediction of two RNA sequences, X and Y, subject to an SCFGbased scoring scheme and usersupplied constraints. Additionally, we wish to be able to parameterize the model automatically from training data. Without constraints, the above tasks are addressed by the resourceintensive CYK and InsideOutside algorithms; here, we present constrained versions of these algorithms that work in reduced space and time (the exact complexity depends nontrivially on the constraints).
Our system of constraints is quite general. Previous constrained versions of Sankofflike algorithms, such as the programs DYNALIGN [11] and FOLDALIGN [10], have been restricted to "banding" the algorithm e.g. by constraining the maximum insertion/deletion distance between the two sequences or the maximum separation between paired bases. Alternately, constraints on the accessible structures [13] or alignments [9] have been described.
The algorithms described here can reproduce nearly all such banding constraints and, further, can take advantage of more flexible sequencetailored constraints. Specifically, the fold envelopes determine the subsequences of X and Y that can be considered by the algorithm, while the alignment envelope determines the permissible cutpoints in the pairwise alignment of X and Y. The fold envelopes can be used to prune the search over secondary structures (e.g. by including/excluding specific hydrogenbonded basepairings), while the alignment envelopes can be used to prune the search over alignments (e.g. by including/excluding specific residuelevel homologies). The fold envelopes can be precalculated for each sequence individually (e.g. by an energybased folding or a singlesequence SCFG), and the alignment envelope by comparing the two sequences without regard for secondary structure (e.g. using a pairwise Hidden Markov Model); both types of precomparison are much more resourcefriendly than the unconstrained Sankofflike algorithms. The design of the constrained algorithms is discussed using concepts from objectoriented programming: the dynamic programming matrix can be viewed as a sparsely populated container, whereas the main loop that fills the matrix is a complex iterator [22]. The algorithms have been implemented in a freely available program for RNA sequence alignment, stemloc, which also includes algorithms to determine appropriate constraints in an automatic fashion. Results demonstrating the program's efficient resource usage are presented.
The stemloc program also implements various familiar extensions to pairwise alignment, including local alignment [23], WatermanEggert Nbest suboptimal alignments [24] and progressive multiple alignment [25]. Although the envelope framework, rather than these extensions, is the main focus of this paper, implementation of the extensions is straightforward within this framework, and is briefly described.
Results
To investigate the comparative resource usage of the various different kinds of constraint that can be applied using fold and alignment envelopes, stemloc was tested on 22 pairwise alignments taken from version 6.1 of RFAM [37], spanning 7 different families of functional noncoding RNA. Each chosen test family had a consensus secondary structure published independently in the literature, and no two sequences in the test set had higher than 60% identity.
The EMBL accession numbers and coordinates of all sequences are listed in Table 5. The table shows the performance of stemloc using the 1000best fold envelope and the 100best alignment envelope. The various RFAM families are S15, the ribosomal S15 leader sequence; the U3 and U5 spliceosomal small nucleolar RNAs; IRE, the iron response element from UTRs of genes involved in vertebrate iron metabolism; glmS, the glucosamine6phosphate activated mRNAcleaving ribozyme; Purine, the prokaryotic purinebinding riboswitch; and 6S, the E.coli polymeraseassociated transcriptional repressor. The following three test regimes were used, each representing a different combination of fold and alignment envelopes:
Nbest alignments, all folds
The alignment envelope containing the N best primary sequence alignments, with the unconstrained fold envelopes (stemloc options: 'nalign N nfold 1'). This is the red curve in Figures 8, 9, 10, 11, 12, 13
Nbest folds, all alignments
The unconstrained alignment envelope, with the fold envelopes containing the N best singlesequence structure predictions (stemloc options: 'nalign 1 nfold N'). This is the green curve in Figures 8, 9, 10, 11, 12, 13
Nbest folds, 100best alignments
The alignment envelope containing the 100 best primary sequence alignments, with the fold envelopes containing the N best singlesequence structure predictions (stemloc options: 'nalign 100 nfold N'). This is the blue curve in Figures 8, 9, 10, 11, 12, 13
In the first two tests, N was varied from 1 to 100; in the latter test, N was varied from 1 to 10000. The lower ceiling for N in the first two tests was imposed by resource limitations. Note that the endpoint of the red curve ("Nbest alignments, all folds"), which occurs at N = 100, coincides with the asymptotic limit of the blue curve ("Nbest folds, 100best alignments") at high N.
A range of different values for the parameter N was used to test the above three strategies. As N was increased over the range, the size of the corresponding fold or alignment envelopes was found to be strongly correlated (Figures 6, 7). However, the actual size of the fold/alignment envelopes in each particular test case varies widely (see large error bars in Figures 6, 7), perhaps due to variable factors such as the sequence lengths, compositions and/or identities. Since it is easier to control the envelope construction parameter N than to control the envelope sizes directly, the following section will report performance indicators as a direct function of N, rather than as a function of the stronglycorrelated but widelyvarying envelope sizes. We report performance indicators for stemloc as follows. Let A and B be alignments of a given pair of sequences, each represented as a set of aligned residuepairs {(i, k)}. Suppose that A is the alignment according to RFAM, and B is the alignment predicted by stemloc. Then define the alignment sensitivity to be A ∩ B/A and the alignment specificity to be A ∩ B/B. Further, let S and T be possible secondary structures for a given sequence, each represented as a set of basepairs {(i, j)}. Suppose that S is the published structure, and T is the structure predicted by stemloc. Then the basepair sensitivity is S ∩ T/S and the basepair specificity is S ∩ T/T.
These performance indicators are averaged over all 22 pairwise alignments and plotted for the three test regimes in Figure 8 (alignment sensitivity), Figure 9 (alignment specificity), Figure 10 (basepair sensitivity) and Figure 11 (basepair specificity). As can be seen, the N best alignment regime empirically seems to achieve an asymptotic maximum performance around N ≃ 100 (possibly even around N ≃ 10), while the N best fold envelope underperforms compared to the unconstrained fold envelope up to around N ≃ 1000. The tests were performed on a 2.3 GHz Apple PowerPC G5. The resource usages of the test regimes are plotted in Figure 12 (usermode running time) and Figure 13 (memory usage). The resource usage of the constrained algorithms is substantially reduced when the envelopes are smaller (i.e. at lower N ). This is especially notable when contrasting the resource usage of the "N best folds, all alignments" test with the more constrained "N best folds, 100best alignments" test.
Three main conclusions can be drawn from these data. First, allowing the search to consider more than a single alignment greatly improves structure prediction (the red curve). Second, constraining the alignment search while exhaustively scanning fold space (the red curve) outperforms constraining the fold search while exhaustively scanning alignment space (the green curve). Third, the hybrid strategy (the blue curve), which partially constrains both searches, approaches the alignmentconstrained, foldunconstrained strategy (the red curve) in performance, with a significant saving in CPU and memory resources. Memory is the limiting factor in pairwise RNA alignment, and the primary motivation for constraints. For example, without constraints, alignment of two 16S ribosomal subunits using the stemloc grammar would take approximately 500 terabytes. (Using fold envelope constraints with structures fully specified, it can be done in under 5 gigabytes.)
Based on the results of these tests, the default envelope options for stemloc were chosen to be the 100best alignment envelope and the 1000best fold envelope. The performance of stemloc with these envelopes on each of the pairwise test alignments is given in Table 5.
Discussion
The algorithms presented here include constrained versions of PairSCFG dynamic programming algorithms that run in significantly reduced space and time. The primary advance over previous work is the simultaenous imposition of fold and alignment constraints, including alignment constraints that are more general than others previously described. Thes constraints lead to significant reductions in requirements for processor and memory usage, which will increase the length of RNA sequences that can be analyzed on mainstream computer hardware.
These algorithms have been used to implement stemloc, a fast, efficient software tool for multiple RNA sequence alignment implementing numerous extra features such as local alignment, WatermanEggert N best suboptimal alignment and progressive multiple alignment. The source code for the program is freely available from http://www.biowiki.org.
The results given here should be regarded as preliminary. For example, we have only tested the pairwise alignment functionality; full evaluation/optimisation of the multiple alignment algorithm remains. Rather than using the CYK algorithm, one could use the InsideOutside algorithm with a decisiontheoretic dynamic programming step to maximize expected performance [38, 39]. As noted in the Parameterization section, it might also be possible to improve on the training procedure. We are also considering ways of elaborating the grammar to include basepair stacking terms. These and other improvements we hope to address in future work.
Conclusion
RNA sequence analysis has generated considerable interest over recent years, as many new roles for RNA in the cell have come to light. RNA genes and regulatory elements are components of many molecular systems and comparative genomics is a powerful way to probe this function, perhaps even more so for RNA than for protein (due to the "wellbehaved" statistical correlations found in RNAs with conserved secondary structure). Furthermore, statistical modeling of RNA evolution continues to play a fundamental role in the phylogenetic classification of new forms of life.
These biological motives have driven a demand for RNA sequence analysis tools that are faster, slimmer and more scaleable. It is hoped that the algorithms and approaches described here, together with development and analysis of RNA evolutionary models [36], may expand the applications of RNA informatics.
Methods
We begin our description of the envelope method with an explanatory note regarding our decision to present these constraints in terms of SCFGs, rather than other scoring schemes such as those based solely on energies [15] or on energy/informationtheoretic hybrids [11].
The reason for our choice of SCFGs is simple: stochastic grammars are, in our opinion, the most theoretically welldeveloped of the scoring schemes used for RNA. They come with welldocumented algorithms for sequence alignment, structure prediction, parameterization by supervised learning from various kinds of training data, and calculation of posterior probabilities [20]. Discussion of these algorithms is facilitated by a welldeveloped and widelyunderstood probabilistic vocabulary. Stochastic grammars are actively researched outside bioinformatics, principally in natural language processing [26]. Crucially, SCFGs are sufficiently general to express virtually all of the features offered by other scoring schemes [19]. We also acknowledge the appeal of free energybased scoring schemes, which have the advantage that the parameters can be determined experimentally. Energybased scores can also be used to find posterior probabilities of basepairings using a partition function [27]. However, to extend energybased methods to two or more sequences, one must incorporate substitution scores. These are informationtheoretic in nature and so are measured in bits, rather than kilocaloriespermole [28]. Reconciling these two units of score (in a principled way) is an open problem. However, in an SCFG framework, all scores are informationtheoretic and so there is no conflict of units.
Despite these arguments, many people continue to find calories preferable to bits as a unit of score. For such readers, we note that the system of constraints described here is entirely applicable to the general scoreattributed grammar. This includes energybased and heuristic scoring schemes as well as (for example) grammars whose rule "probabilities" actually represent logodds ratios, or which are conditionally normalized with respect to one sequence.
Notation
To implement SCFG dynamic programming algorithms efficiently for RNA, it is convenient to define a simplified (but universal) template for grammars, similar in principle to "Chomsky normal form" [29, 30]. Our "RNA normal form" preserves the RNAoptimized efficiency of the Pair SCFG form presented in an earlier paper [13] (based on a singlesequence form due to Durbin et al [20]) by introducing different types of production rule to minimize bifurcations and collect emissions. The form defined here is slightly different from the above forms, in that it classifies only production rules, and not nonterminals, into different types. Let be the "ungapped RNA alphabet", i.e. the set of four possible nucleotides in RNA. Let be the "gapped RNA alphabet", i.e. the ungapped RNA alphabet Ω plus the gap symbol . Finally, let Ψ = Ω' × Ω' be the "gappedpair RNA alphabet", i.e. a Cartesian product of two gapped RNA alphabets. We write Ψsymbols by vertically stacking pairs of Ω'symbols, like this or this .
A pairwise stochastic contextfree grammar
in RNA normal form consists of a nonterminal symbol alphabet Φ, a terminal symbol alphabet that is the gappedpair RNA alphabet Ψ, and (for each nonterminal L) a probability distribution over a set of transformation rules (or production rules), (L → R), where R = R_{1} ... R_{ K }is a sequence of nonterminal or terminal symbols, taking one of several stereotypical forms (see below). The nonterminal L is referred to as the lefthand side (LHS) of the production rule and the symbol sequence R as the right hand side (RHS).
Let S, U, V, W ∈ Φ denote nonterminal symbols. The allowable forms for production rules include terminations, transitions, bifurcations and emissions. These are defined as follows
Terminations: rules of the form L → ε.
Transitions: rules of the form L → V. The directed graph formed by transition rules on nonterminals must be acyclic, and the list of nonterminals Φ must be topologically reversesorted with respect to this graph; i.e. if (U → V) > 0, then V appears before U in Φ.
Bifurcations: rules of the form L → VW. There must be no transitiontermination path from V or W to ε, i.e. neither V or W can have completely empty inside sequencepairs (see next section for a formal definition of the "inside sequencepair").
Emissions: rules of the form where A', B', C', D' ∈ Ω' are gapped RNA symbols, at least one of which is a nongap symbol. For convenience, we also define A, B, C, D ∈ Ω* to be the corresponding ungapped RNA sequences, as follows: A is the empty string if and only if A' is the gap character; otherwise, A = A'. Similar definitions apply for B, C and D.
The particular RNA normal form described in this section is chosen for ease of presentation. The implementation in the dart library uses the slightly more restrictive form for Pair SCFGs defined in an earlier paper [13].
For presentational purposes, we will generally omit allgap columns from the pairwise alignment and the grammar. For example, an emission rule having the form
would be written as instead. Allgap columns are not very interesting to a sequence analyst, and only arise in our formalism because all emission rules have the same form.
Table 1 is an example of an RNA normal form grammar with two nonterminals, Stem and Loop. The grammar generates simple alignments of stems and loops, using two nonterminals (Stem and Loop); the starting nonterminal is Stem. The rule probabilities are functions of five scalar probability parameters (stemExtend, stemGap, bifurcate, loopExtend and loopGap) and four arrays of probability parameters (baseIndel[4], baseSubstitution[16], basepairIndel[16] and basepairSubstitution [256]). Here we introduce the notation X [N] for an array of N probability parameters normalized so that . It is also convenient to introduce some notation for ungapped sequences at this stage. Let X, Y ∈ Ω* denote ungapped RNA sequences, including (possibly) the empty string ε. Let X_{ i }be the i'th symbol of X (counting from 1, so e.g. X_{3} is the third symbol), let X_{ ij }denote the subsequence from X_{i+1}to X_{ j }inclusive, or the empty string if i = j (so e.g. X_{0,3} contains the first three symbols of X, while X_{3,3} = ε) and let X denote the length of X.
The parse tree and the sequence likelihood
The grammar
is a probabilistic model for deriving sequences X, Y from a single nonterminal. This derivation proceeds as follows: start with an initial sequence containing one starting nonterminal, S, then repeatedly apply probabilisticallysampled transformations to the nonterminals in the sequence. Eventually the sequence will contain only terminals from Ψ. This process generates a parse tree, rooted at node S, in which internal nodes are labeled with nonterminals and leaf nodes with terminals, with children of each node ordered lefttoright (Figure 1). Sequence X can be obtained by reading off ungapped RNA symbols from the top row of the output, and sequence Y by reading off the bottom row. Note that the subtree rooted at any internal Wlabeled node describes a subprocess that generates some pair of subsequences (X_{ ij }, Y_{ kl }) starting from nonterminal W. We will refer to this subsequencepair (X_{ ij }, Y_{ kl }) as the inside sequencepair of W.
The parse tree likelihood is the product of all the rule probabilities corresponding to the internal nodes. Summing the likelihoods of all parse trees rooted in state S and generating sequences X, Y, one obtains P(X, Y  S,), the sequence likelihood. This sum can be performed efficiently by the Inside algorithm, as will be described below.
Dynamic programming algorithms for Pair SCFGs
The following section describes the constrained and unconstrained dynamic programming (DP) algorithms used for Pair SCFGs.
The Inside algorithm
The Inside algorithm [26] computes P(X, Y  S,) by recursive decomposition via intermediate probabilities (i, j, k, l) ≡ P(X_{ ij }, Y_{ kl }U,). In RNA normal form, the timelimiting step in the Inside algorithm involves summing contributions to (i, j, k, l) from bifurcation rules of the form U → VW, such that the bifurcation splits the two sequences (X, Y) between bases (X_{ m }, Y_{ n }) and (X_{m+1}, Y_{n+1})
An asymptotically faster step involves summing contributions from matching emission rules of the form
(transition rules are represented as the special case A' = B' = C' = D' = )
(Recall that A is the ungapped version of A'. Thus A = 0 ⇔ A = ε ⇔ A' = and similarly for B, C and D.)
The intermediate probabilities of the Inside algorithm can then be expressed as
(i, j, k, l) = Q_{1} + Q_{2}
Termination of the recursion is provided by matching end rules, U → ε, if and only if the subsequences (X_{ ij }, Y_{ kl }) are empty
(i, i, k, k) = (U → ε)
The sequence likelihood is obtained as
P(X, Y  S,) ≡ (0,X,0,Y)
In pseudocode, the Inside algorithm is
Inputs: X, Y, S,
For i = X to 0 (descending)
• For j = i to X
• For k = Y to 0 (descending)
• For l = k to Y
• For each nonterminal U ∈ ε
• Set Q_{1} ← 0; calculate Q_{2}
• For m = i to j
• For n = k to l
• Calculate Q_{ B }(m, n) and add to Q_{1}
• Calculate
(i, j, k, l) and store
Return
(0, X, 0, Y)
The timelimiting step of the Inside algorithm (computing the Q_{ B }) involves six indices (i, j, k, l, m, n) and the time complexity of the full recursion is O(X^{3}Y^{3}). However, the stored intermediate probabilities involve only four indices (i, j, k, l) and so the memory complexity is O(X^{2}Y^{2}).
In RNA normal form, the emission rules (Q_{2}) account for homologous basepairings between residues (X_{ i }, X_{ j }) and (Y_{ k }, Y_{ l }), or unpaired residues at X_{ i }, X_{ j }, Y_{ k }or Y_{ l }. This may also imply that X_{ i }and Y_{ k }are aligned, or that X_{ j }and Y_{ l }are (Figure 2). The bifurcation rules (Q_{ B }(m, n)) account for conserved multiloop structures in the RNA, i.e. one homology between substructures X_{ im }and Y_{ kn }and another between substructures X_{ mj }and Y_{ nl }(Figure 3).
Imposing constraints
The high time and memory cost of the Inside and related algorithms motivate the development of slimmer, faster versions. To begin with, we impose constraints that narrow the search space. For example, we might want to preparse the sequences individually (using a singlesequence SCFG, or other O(X^{3}) RNAfolding method) and identify likely basepairs (X_{ i }, X_{ j }) or (Y_{ k }, Y_{ l }), and/or likely unpaired nucleotides X_{ i }or Y_{ k }. Even simpler, we could simply throw out basepairs (X_{ i }, X_{ j }) between distant residues (i.e. where j  i exceeds some cutoff) Alternatively, we might want to prealign the sequences (using a pairwise hidden Markov model, or other O(XY) alignment method) and identify likely alignment columns (X_{ i }, Y_{ k }) and/or likely indels X_{ i }or Y_{ k }. Again, more simply, we could simply exclude columns (X_{ i }, Y_{ k }) for which k  i is too large.
We can combine these various strategies into a generalized constraint on basepairs, alignmentcolumns or both. We stipulate that
(i, j, k, l) = P(X_{ ij }, Y_{ kl }U, ) = 0 unless the following conditions are satisfied
Here
and are sets of permissible coordinates for structurally discrete subsequences in X and Y. Foldrelated features (basepairs and unpaired residues) can be included or excluded by this set, and so we refer to it as a fold envelope [13]. The set is a set of possible cutpoints in the alignment of X and Y, and is referred to as an alignment envelope [31]. Both types of envelope are illustrated in Figure 2. The fold and alignment envelopes satisfy the following set relations
If equality holds in all three cases, then we recover the unconstrained Inside algorithm.
Note that the coordinates (i, j, k, l) for the cutpoints and subsequences lie between residues of X and Y.
There are (X + 1)(Y + 1) cutpoints in the maximal alignment envelope and (X + 1)(X + 2) subsequences in the maximal fold envelope .
As an alternative to the unconstrained Inside algorithm, we can partially initialize the envelopes to limit the maximum subsequence length and/or the maximum deviation of the alignment from the main diagonal (Figure 4). More flexibly, we can limit the recursion to a single alignment, a single structure, or a broadlyspecified set of alignments or structures (Figure 5). Applications such as alignment of two known structures [13, 32], alignment of an unstructured sequence to a known structure [33] or structure prediction from a known alignment [9] all reduce to simple application of the appropriate constraints.
Further possible constraints
The constraints given here allow the independent imposition of alignment or fold constraints. One can imagine further, even more general constraints. For example, one could exclude subsequencepairs (X_{ ij }, Y_{ kl }) of radically different lengths, i.e. for which (j  i)  (k  l) exceeds some cutoff. This constraint is employed by the FOLDALIGN program. It is not expressible as a combination of independent alignment and fold constraints, and has not been implemented for the present work, though it would be relatively straightforward to combine it with the other constraints described here [10].
Accelerating the iteration
Simply setting some intermediate probabilities to zero is not sufficient to accelerate the Inside algorithm. We also need to redesign the iteration to avoid visiting zeroprobability subsequencepairs (X_{ ij }, Y_{ kl }). This is achieved by preindexing the fold envelopes , and the alignment envelope so that we can quickly locate valid coordinates (i, j, k, l). The following is pseudocode for the algorithm with the redesigned iterator
Inputs: X, Y, S,
For i = X to 0 (descending)
• For each j satisfying (i, j) ∈ (ascending) (†)
• For each k satisfying (i, k) ∈ (descending) (†)
• For each l satisfying (k, l) ∈ (ascending) (†)
• If (j, l) ∈ then
• For each nonterminal U ∈ Φ
• Set Q_{1} ← 0; calculate Q_{2}
• For each m satisfying {(i, m), (m, j)} ⊂ (†)
• For each n satisfying {(k, n), (n, l)} ⊂ (†)
• If (m, n) ∈ then
• Calculate Q_{ B }(m, n) and add to Q_{1}
• Calculate
(i, j, k, l) and store
Return
(0, X, 0, Y)
(†) These ordered subsets of
, and can be precomputed for speed.
Alternative designs for the algorithm are possible, and indeed different circumstances may affect the choice of optimal design (e.g. depending on which envelopes are most constrained).
Slimming the container
Memory is the most prohibitively expensive resource demand of the Inside algorithm. In its simplest form, the algorithm stores the intermediate probabilities
(i, j, k, l) using a fivedimensional array indexed by U, i, j, k and l. To get the most benefit out of imposing constraints, it is necessary to replace this multidimensional array with an efficientlyindexed reducedspace container.
This design decision involves a close tradeoff between CPU and memory usage. Initially, we tested various combinations of generic containers with O(N)storage and O(N log N) accesstimes, such as balanced search trees [22]. These were found to be unbearably slow, and so we settled on a configuration of nested preindexed arrays. This configuration wastes some space, but has the important advantage of constant access time for given indices (U, i, j, k, l).
For fold envelope
, we precompute and sort the list = {(i, j)} ⊆ of subsequences starting at each position i. For each subsequence (i, j) ∈ , let be its rank in . These are also precomputed and stored. Similar precomputed sets and ranks are stored for .
Our DP matrix then uses an inner twodimensional array nested inside an outer twodimensional array.
The outer array has dimensions (X, Y) and is indexed by (i, k). The inner array has dimensions (, ) and is indexed by (, ). Cells of this inner array are further subindexed by nonterminal U using a standard fixedlength array, yielding (i, j, k, l).
This particular configuration is efficient when the alignment envelope is densely populated and the fold envelopes are sparsely populated. As with the redesigned iterator, there may be alternative designs that are resourceoptimal under various different circumstances, depending on the nature of the envelopes.
The CYK algorithm
The CockeYoungerKasami (CYK) dynamic programming algorithm [20] is related to the Inside algorithm, but replaces "p + q" with "max(p, q)" in all formulae (that is, instead of summing probabilities, the maximum probability is always taken). The entries of the DP matrix, (i, j, k, l), represent the maximum likelihood of any parse tree for (X_{ ij }, Y_{ kl }). A recursive/stackbased traceback algorithm can be used to recover this parse tree, beginning from subsequence (i, j, k, l) = (0, X, 0, Y) (for global alignment) or (i, j, k, l) = argmax (i, j, k, l) (for local alignment) [13].
The Outside and KYC algorithms
The Outside and KYC algorithms widen the applications of probabilistic inference with SCFGs. The Outside algorithm, together with the Inside, can be used to recover posterior probabilities of given basepairs/columns, which can be used as alignment reliability indicators or as update counts in Expectation Maximization parameter training [20, 26]. The KYC algorithm can be used with CYK to recover maximumlikelihood tracebacks from given coordinates. (We introduce the name "KYC" as a simple reversal of "CYK", reflecting the fact that KYC is to CYK as Outside is to Inside, i.e. the "reverse" version of the algorithm.)
These algorithms use dynamic programming recursions that are related to Inside and CYK. The Outside algorithm calculates intermediate probabilities of the form
(i, j, k, l) = P(X_{0,i}, Y_{0,k}, V, Y_{l,Y}, X_{j,X}S)
representing the sumoverprobabilities of all partial parse trees rooted at S and ending in V without having yet generated sequences X_{ ij }and Y_{ kl }. Then, for example, the posterior probability that some node V in the parse tree has inside sequencepair (X_{ ij }, Y_{ kl }) is
As CYK is related to Inside, the KYC algorithm is related to the Outside algorithm: the intermediate probabilities
(i, j, k, l) are found using similar recursions (see below), but with all sums "p + q" replaced by maxima "max(p, q)". The CYK and KYC DP matrices can be used to find the maximum likelihood of any parse tree containing a node V with inside sequencepair (X_{ ij }, Y_{ kl }): this maximum likelihood is (i, j, k, l) (i, j, k, l). A recursive/stackbased traceback algorithm can be used to find the parse tree with this maximum likelihood.
As with the Inside algorithm, we sum contributions to
(i, j, k, l) from various matching production rules. In contrast to the Inside algorithm, the nonterminal V that indexes (...) must now be matched on the righthandside, not the lefthandside, of these production rules. We first consider bifurcations from a source nonterminal U that adjoin adjacent subsequences to left (U → WV) or right (U → VW), so that the inside sequencepairs for U are (respectively) (X_{ mj }, Y_{ nl }) or (X_{ im },Y_{ kn })
Next we consider emission rules,
, again representing transitions as the special case A' = B' = C' = D' =
The intermediate Outside probabilities are thus
(i, j, k, l) = +
(0, X, 0, Y) = 1 (Termination condition)
Note that the Inside probabilities
(i, j, k, l) are needed to compute the Outside probabilities. We supply the Inside matrix, I, as an input to the Outside algorithm (they are usually calculated at the same time anyway).
In terms of the underlying iteration, the key difference between the Inside and Outside algorithms is as follows. Suppose subsequencepair INNER = (X_{ ij }, Y_{ kl }) is enclosed by subsequencepair OUTER = (X_{ i' j' }, Y_{ k' l' }) (that is, 0 ≤ i' ≤ i ≤ j ≤ j' ≤ X and 0 ≤ k' ≤ k ≤ l ≤ l' ≤ Y, and INNER ≠ OUTER). Then the Inside iterator visits INNER before OUTER, whereas the Outside iterator visits OUTER before INNER. The order in which the Outside iterator visits nonterminals is topologically forwardsorted with respect to the grammar's transitionrule graph (i.e. the reverse of the order used by the Inside iterator).
Inputs: X, Y, S, , I
Initialize
(0, X, 0, Y)
For i = 0 to X (ascending)
• For each j satisfying (i, j) ∈ (descending) (†)
• For each k satisfying (i, k) ∈ (ascending) (†)
• For each l satisfying (k, l) ∈ (descending) (†)
• If (j, l) ∈ then
• For each nonterminal U ∈ Φ (reverse order)
• Set
← 0; calculate
• For each m satisfying {(m, j), (m, i)} ⊂ (†)
• For each n satisfying {(n, l), (n, k)} ⊂ (†)
• If (m, n) ∈ then
• Calculate
(m, n) and add to
• For each m satisfying {(i, m), (j, m)} ⊂ (†)
• For each n satisfying {(k, n), (l, n)} ⊂ (†)
• If (m, n) ∈ then
• Calculate
(m, n) and add to
• Calculate
(i, j, k, l) and store
(†) These ordered subsets of
, and can be precomputed for speed.
The reducedspace dynamic programming matrix that was developed above for the constrained Inside algorithm can be reused for the constrained Outside algorithm.
Implementation
The abovedescribed algorithms were implemented in the C++ dart library. One dart program in particular, stemloc, is an efficient generalpurpose RNA multiplesequence alignment program that can be flexibly controlled by the user from the Unix command line, including reestimation of parameters from training data as well as a broad range of alignment functions.
The dart libraries provide Inside, Outside, CYK, KYC, traceback and training algorithms for any pairwise SCFG in RNA normal form, whose rule probabilities can be expressed as algebraic expressions of some set of probability parameters (with associated normalization constraints). The operatoroverloading features of C++ are utilized in full, so that the syntax of initializing a grammar object involves very few function calls and is essentially declarative.
dart source code releases can be downloaded under the terms of the GNU Public License, from the following URL (which also gives access to the latest development code in the CVS repository)
The grammars and algorithms described in this paper specifically refer to release 0.2 of the dart package, dated October 2004 (although the algorithms are also implemented in release 0.1, dated 10/2003).
Selecting appropriate fold and alignment envelopes
This section offers a nonexhaustive list of possible strategies for choosing appropriate fold/ alignment envelopes. (Italicized terms apply to fold envelopes, and bold terms to alignment envelopes.)

Choose some appropriately simplified grammar, such as a singlesequence SCFG/ pair HMM that models RNA folding /primary sequence alignment. Compute posterior probabilities of subsequences/cutpoints using the InsideOutside/ForwardBackward algorithm for some singlesequence SCFG/ pair HMM that models RNA folding/primary sequence alignment. Select all subsequences/cutpoints with posterior probability above some threshold (or select e.g. the top 10 percentile of the posterior probabability distribution). Ensure that each of these subsequences/cutpoints is on a valid traceback path, e.g. by running the CYKKYC/Viterbiforwardbackward algorithm to find the maximumlikelihood traceback path from any given subsequence/ cutpoint. Take the union of all subsequences/cutpoints on these traceback paths to obtain the required envelope.

As above, choose some appropriate singlesequence SCFG/pair HMM that models RNA folding /primary sequence alignment. Compute the maximum traceback pathlikelihood from all subsequences/cutpoints using the CYKKYC/Viterbiforwardbackward algorithm for this grammar. Select all subsequences/cutpoints with maximum traceback likelihood above some threshold (or select e.g. the top 10 percentile of the maxtraceback likelihood distribution) and do a traceback from each such cell. Take the union of all subsequences/cutpoints on these traceback paths to obtain the required envelope.

As above, choose some appropriate singlesequence SCFG/pair HMM that models RNA folding /primary sequence alignment. Sample some number of RNA structures /pairwise alignments using the Inside/Forward algorithm with stochastic traceback. Take the union of all subsequences/cutpoints on these traceback paths to obtain the required envelope.
The latter two strategies have been implemented in the stemloc package described below. Empirically, the stochastic strategy appears to be less reliable than the deterministic strategies (although in theory the stochastic strategy will eventually find the globally optimal alignment given sufficiently many random repetitions, which may be a useful property).
Multiple sequence alignment
A heuristic algorithm for performing multiple alignmentandfolding of RNA sequences with a pairwise SCFG by progressive singlelinkage clustering runs as follows

Start by making pairwise alignments (with predicted secondary structures) for all pairs of input sequences.

Mark the highestscoring pair, and extract the two marked sequences with their predicted secondary structures. This highestscoring pair is called the seed alignment.

While some sequences remain unmarked:

For each newlymarked sequence:

Align the marked sequence, with a fold envelope constrained by its predicted structure, to each unmarked sequence in turn. (The fold envelope can be tailored to allow e.g. extension of local alignments.)

Select the highestscoring of the pairwise (markedtounmarked) alignments. Use this alignment to merge the unmarked sequence into the seed alignment, and mark this sequence as newly aligned.


Return the seed alignment.
The above algorithms have been implemented in stemloc. The multiple alignments produced by this algorithm lack welldefined probabilistic scores unless the pair SCFG is conditionally normalized. It is also straightforward to retrieve the N best nonoverlapping alignments by repeatedly applying an incremental WatermanEggertstyle mask to the alignment envelope [24]. This is implemented in stemloc for the pairwise case.
A grammar for pairwise RNA alignment and structure prediction
After some empirical experimentation, we developed the grammar of Tables 2, 3, 4 for the stemloc program. The grammar is split over three tables due to its considerable number of rules. Table 2 contains rules describing the connectivity of stems, loops and multiloops, and contains the only bifurcation rule. Table 3 describes the connectivity of bulges and Table 4 handles emissions (basepaired, unpaired, aligned or gapped).
The starting nonterminal is Start. The nonterminals representing higherlevel units of RNA structure are Loop, Stem, LBulge, RBulge and LRBulge. Each of these has associated Match, Ins and Del states (e.g. StemMatch, StemIns and StemDel) and each of these states has an associated emission state, prefixed with x, y or xy superscripts (e.g. ^{xy}StemMatch, ^{y}StemIns and ^{x}StemDel). The Multi state models multiloops, using a bifurcation to LMulti and RMulti.
Tables 2, 3, 4 also refer to probabilistic parameters used by the models. Free probability parameters (allowed to range from 0 to 1) include startInStem (determining the probability of ending the alignment with a basepair), loopExtend, stemExtend and multiExtend (determining the geometric distribution over loop and stem lengths, or the number of branches in a multiloop), multiBulgeOpen (determining the probability of bulges in multiloops) and stemGapOpen, stemGapExtend, stemGapSwap, loopGapOpen, loopGapExtend and loopGapSwap (determining the geometric distribution over gap lengths in stems and loops). There are also five parameter arrays: postStem[4] (determining whether a stem is followed by a loop, a bulge, a doublestranded bulge or a multiloop); baseIndel[4] (a probability distribution for a single unaligned, unpaired nucleotide); baseSubstitution[16] (a joint probability distribution for two aligned, unpaired nucleotides); basepairIndel[16] (a joint probability distribution for two unaligned, basepaired nucleotides); and basepairSubstitution [256] (a joint probability distribution for four aligned, basepaired nucleotides). All of the above parameters were automatically estimated from training data by the dart software.
To summarize, the grammar models homologous stems, loops, multiloops and bulges in pairwise RNA alignments, with covariant substitution scores and affine gap penalties (geometric indel length distributions). It has the property that any combined alignment and structure prediction for two RNA sequences has a single, unambiguous parse tree. In our investigations, this unambiguity appeared to improve the accuracy of alignment and structure prediction substantially; see also writings on this topic by Giegerich [34] and Dowell, Eddy et al [35]. The grammar is also designed to minimize the number of bifurcation rules (the only bifurcation is Multi → LMulti RMulti).
The stemloc grammar does not model basepair stacking effects due to πorbital overlap, nor does it allow lowcost insertion or deletion of whole stems or substructures. Since the algorithms are implemented for any SCFG, it is straightforward to modify the program to experiment with grammars that model such phenomena. An example of a grammar that models the latter type of mutation (wholesubstructure indels), and is also fully derived from an evolutionary ratebased model, is presented in a companion paper [36].
Parameterization
Under the SCFG framework, the probability parameters for the grammar can be estimated directly from data using the InsideOutside algorithm with appropriate constraints, which are easy to supply (e.g. to sum over all parses consistent with a given alignment, one simply uses an appropriate alignment envelope). The parameters were trained from 56376 (nonindependent) pairwise alignments from RFAM [37], with each pairwise alignment making a weighted contribution of 1/N to the counts computed during Expectation Maximization training, where N is the number of sequences in the multiple alignment from which the pairwise alignments were derived. Furthermore, the pairwise alignments were binned according to sequence identity, providing four alternative parameterisations; the bin ranges were 30–40%, 50–60%, 70–80% and 90–100%.
stemloc allows the user to reestimate all parameters from their own personal training set of trusted alignments. This may be a useful feature, since the training procedure described above is probably biased. Since training was performed using all kinds of sequence available in RFAM, including RNA sequences with computationally predicted secondary structure as well as those for which structures were experimentally confirmed, it is possible that the stemloc parameters may be skewed by the parameters of the computational methods used by the RFAM curators to predict structure. These include the homology modeling program INFERNAL [37] and the de novo structure prediction program PFOLD [8].
References
Eddy SR: Noncoding RNA genes. Current Opinion in Genetics and Development 1999, 9(6):695–699. 10.1016/S0959437X(99)000222
Mandal M, Boese B, Barrick JE, Winkler WC, Breaker RR: Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria. Cell 2003, 113: 577–586. 10.1016/S00928674(03)00391X
Sijen T, Plasterk RH: Transposon silencing in the Caenorhabditis elegans germ line by natural RNAi. Nature 2003, 426(6964):310–314. 10.1038/nature02107
Ambros V: The functions of animal microRNAs. Nature 2004, 431(7006):350–355. 10.1038/nature02871
Baulcombe D: RNA silencing in plants. Nature 2004, 431(7006):356–363. 10.1038/nature02874
Rivas E, Eddy SR: Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics 2000, 16(7):583–605. 10.1093/bioinformatics/16.7.583
Coventry A, Kleitman DJ, Berger B: MSARI: Multiple sequence alignments for statistical detection of RNA secondary structure. Proceedings of the National Academy of Sciences of the USA 2004, 101: 12102–12107. 10.1073/pnas.0404193101
Knudsen B, Hein J: RNA secondary structure prediction using stochastic contextfree grammars and evolutionary history. Bioinformatics 1999, 15(6):446–454. 10.1093/bioinformatics/15.6.446
Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2001, 2: 8. 10.1186/1471210528
Gorodkin J, Heyer LJ, Stormo GD: Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research 1997, 25(18):3724–3732. 10.1093/nar/25.18.3724
Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology 2002, 317(2):191–203. 10.1006/jmbi.2001.5351
Perriquet O, Touzet H, Dauchet M: Finding the common structure shared by two homologous RNAs. Bioinformatics 2003, 19: 108–116. 10.1093/bioinformatics/19.1.108
Holmes I, Rubin GM: Pairwise RNA structure comparison using stochastic contextfree grammars. Pac Symp Biocomput 2002, 163–174.
Sankoff D: Simultaneous solution of the RNA folding, alignment, and protosequence problems. SIAM Journal of Applied Mathematics 1985, 45: 810–825. 10.1137/0145048
Zuker M, Stiegler P: Optimal Computer Folding of Large RNA Sequences Using Thermodynamics and Auxiliary Information. Nucleic Acids Research 1981, 9: 133–148.
Eddy SR, Durbin R: RNA Sequence Analysis Using Covariance Models. Nucleic Acids Research 1994, 22: 2079–2088.
Sakakibara Y, Brown M, Hughey R, Mian IS, Sjölander K, Underwood RC, Haussler D: Stochastic ContextFree Grammars for tRNA Modeling. Nucleic Acids Research 1994, 22: 5112–5120.
Brown M, Wilson C: RNA Pseudoknot Modeling Using Intersections of Stochastic ContextFree Grammars with Applications to Database Search.1995. [http://www.cse.ucsc.edu/research/compbio/pseudoknot.html]
Lefebvre F: A GrammarBased Unification of Several Alignment and Folding Algorithms. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. Edited by: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith RF, Menlo Park. CA: AAAI Press; 1996:143–154.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press; 1998.
Hofacker IL, Bernhart SH, Stadler PF: Alignment of RNA base pairing probability matrices. Bioinformatics 2004, 20(14):2222–2227. 10.1093/bioinformatics/bth229
Austern MH: Generic Programming and the STL: Using and Extending the C++ Standard Template Library. AddisonWesley; 1999.
Smith TF, Waterman MS: Identification of Common Molecular Subsequences. Journal of Molecular Biology 1981, 147: 195–197. 10.1016/00222836(81)900875
Waterman MS, Eggert M: A new algorithm for best subsequence alignments with application to tRNArRNA comparisons. Journal of Molecular Biology 1987, 197: 723–725. 10.1016/00222836(87)904785
Higgins DG, Sharp PM: Fast and Sensitive Multiple Sequence Alignments on a Microcomputer. Computer Applications in the Biosciences 1989, 5: 151–153.
Lari K, Young SJ: The Estimation of Stochastic ContextFree Grammars Using the InsideOutside Algorithm. Computer Speech and Language 1990, 4: 35–56. 10.1016/08852308(90)90022X
McCaskill JS: The Equilibrium Partition Function and Base Pair Binding Probabilities for RNA Secondary Structure. Biopolymers 1990, 29: 1105–1119. 10.1002/bip.360290621
Altschul SF: Amino Acid Substitution Matrices from an Information Theoretic Perspective. Journal of Molecular Biology 1991, 219: 555–565. 10.1016/00222836(91)90193A
Chomsky N: Three Models for the Description of Language. IRE Transactions Information Theory 1956, 2: 113–124. 10.1109/TIT.1956.1056813
Chomsky N: On Certain Formal Properties of Grammars. Information and Control 1959, 2: 137–167. 10.1016/S00199958(59)903626
Holmes I: Studies in probabilistic sequence alignment and evolution. PhD thesis. The Sanger Centre; 1998.
Shapiro BA, Zhang KZ: Comparing multiple RNA secondary structures using tree comparisons. Computer Applications in the Biosciences 1990, 6(4):309–318.
Klein R, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2003., 4(44):
Giegerich R: Explaining and Controlling Ambiguity in Dynamic Programming. In Combinatorial Pattern Matching: 11th Annual Symposium. Volume 1848. Edited by: Giancarlo R, Sankoff D. SpringerVerlag Heidelberg; 2000:46–59.
Dowell RD, Eddy SR: Evaluation of several lightweight stochastic contextfree grammars for RNA secondary structure prediction. BMC Bioinformatics 2004, 5: 71. 10.1186/14712105571
Holmes I: A probabilistic model for the evolution of RNA structure. BMC Bioinformatics 2004., 5(166):
GriffithsJones S, Bateman A, Marshall M, Khanna A, Eddy SR: Rfam: an RNA family database. Nucleic Acids Research 2003, 31: 439–441. 10.1093/nar/gkg006
Holmes I, Durbin R: Dynamic programming alignment accuracy. Journal of Computational Biology 1998, 5(3):493–504.
Do CB, Brudno M, Batzoglou S: PROBCONS: Probabilistic Consistencybased Multiple Alignment of Amino Acid Sequences. , in press.
Acknowledgements
The author thanks Sean Eddy for inspiring discussions and three anonymous reviewers for their helpful suggestions. The work was conceived during an NIHfunded workshop on RNA informatics organised by Elena Rivas and Eric Westhof in Benasque, Spain, 2003.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
IH designed, programmed, tested and documented the algorithms.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what reuse is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and reuse information, please contact the Rights and Permissions team.
About this article
Cite this article
Holmes, I. Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics 6, 73 (2005). https://doi.org/10.1186/14712105673
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14712105673